Get Started Free
‹ Back to courses
course: Apache Flink® 101

Intro to Stream Processing with Apache Flink

8 min
David Anderson

David Anderson

Software Practice Lead

Intro to Stream Processing with Apache Flink

Overview

Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.

This video introduces Flink, explains why it's useful, and presents some of the important patterns Flink provides for stream processing.

Topics:

  • What is Stream Processing?
  • Why Flink?
  • Bounded vs. Unbounded Streams
  • Sources and Sinks
  • The Job Graph
  • Operators
  • Parallel Processing
  • Forwarding, Repartitioning, and Rebalancing
  • Using SQL to Describe Stream Processing Pipelines

Resources

Use the promo code FLINK101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Intro to Stream Processing with Apache Flink

Hi, I'm David from Confluent. What is stream processing, why should you care about it, and what makes Flink especially interesting? These are a few of the questions I'll try to answer in this introduction to stream processing with Apache Flink. Apache Flink is becoming an integral part of how many companies operate their businesses. The reason for this is centered around meeting customer expectations. We've all come to expect instant access to our data. If someone uses your credit card to fraudulently rent a car halfway around the world, you want to be alerted about that right away. Similarly, every time you order something online, you expect an accurate estimate of when it will be delivered, and when something happens that's likely to disrupt that delivery, you expect to be alerted about that too. Apache Flink is a powerful framework that can connect, enrich, and process data in real-time. Today's businesses are increasingly software-defined and their business processes are being automated. Real-time business operations like these weren't feasible before event streaming platforms, like Kafka and Flink, came along. This is having a huge impact on many traditional industries, such as banking, telecommunications, and retail, and event stream is also enabling new business models, such as ride-sharing. Before we dive into looking at the details of what Flink does and how it works, I want to draw your attention to some of the reasons why you should consider using Flink if you intend to build real-time data products. Apache Flink has a very active and supportive community. Its mailing lists and forums are consistently among the most active of all Apache projects. Flink has been battle tested and hardened through its use at several companies who use Flink at enormous scale. Companies such as Netflix, Alibaba, Uber and Goldman Sachs have shared their success stories with Flink. Flink offers expressive APIs in Java, Python, and SQL, letting you work in the ecosystem where you will be most productive, and Flink supports both stream and batch processing, making it a very flexible framework that can be used for a wide variety of use cases. In this course, we are going to do a deep dive into each of these four concepts, streaming, state, time and snapshots. These are the big ideas in which Flink is based and understanding how they fit together is central to understanding how Flink works. During this video, we're going to just focus on streaming. In technical terms, event streaming is the practice of capturing events in real-time as they occur, whether it's orders and shipments or downloads and clicks, business events can always be streamed. These streams of events can be manipulated, processed and reacted to in real-time in which case the sequence of events forms an unbounded stream that extends indefinitely into the future. But event streams can also be stored for later retrieval and reprocessing. Reprocessing a batch of historic data is just a special case of streaming where the stream is bounded by starting and ending timestamps. Flink applications consume data from one or more event sources and produce data to one or more syncs. These sources and syncs can be messaging systems, such as Kafka, or files, or databases, or any service or application that produces and consumes data. Our focus in this course will be on this part in the middle sitting in between the sources and syncs. This is where your stream processing business logic goes. As a Flink developer, you will define your business logic using one of Flink's APIs and Flink will execute your code in a Flink cluster. I'll show you how these APIs are organized and what happens inside a Flink cluster a little bit later. But for now, let's start by establishing some terminology. A running Flink application is called a Job and the event data is streaming through a data processing pipeline that we call the Job Graph. The nodes in the job graph represent the processing steps in your pipeline. Each of these processing steps is executed by an operator. Operators transform event streams. I'll show you some examples of that in just a moment. The operators in the job graph are connected to one another as shown here by the arrows that form the edges of the graph. Technically speaking, the job graph is always a directed acyclic graph where the event data is always flowing from the sources toward the sync and being processed along the way by the operators. Stream processing is done in parallel by partitioning event streams into parallel sub-streams, each of which can be independently processed. This is a key point. This partitioning into independently processed pipelines is crucial for scalability. These independent parallel operators share nothing and can run at full speed. It's typically the case that the input to a Flink job can be consumed in parallel. Often these parallel input streams will have been partitioned upstream before being ingested by Flink. In this example, some upstream process has produced data that is partitioned by shape. At each stage of the job graph, your application code will use one of Flink's APIs to specify both what to do in that operator and where that operator should send its results. In this example, the first operator is collecting the input from the source and forwarding it downstream. Forwarding an event stream is the simplest thing an operator can do and Flink will optimize this type of connection so that it is handled very efficiently. The second operator in this job is filtering the event stream to remove any orange events and then it is reorganizing the stream so that they are now grouped by color rather than by shape. Sometimes a shuffle or re-partitioning like this is necessary. This particular shuffle brings together all of the yellow events so that they will be processed by the same node, and similarly, all of the blue events will also be processed together. This makes it possible to count all of the yellow events and all of the blue events. Shuffling event streams is much more expensive than forwarding them. Each event has to be serialized and depending on where an event is being routed to, it may have to be sent over the network to the next operator downstream. At the right hand side of this diagram, you see the results of the COUNT operation that are streaming toward the sync. For the yellow events, a total of one, then two, then three and now four events have been processed. And for the blue events, the count has progressed from one to two and then to three. Instead of having each instance of the count operator forward its results to one of two parallel syncs, I've chosen instead to change the parallelism so that there's only one sync. This is accomplished by a stream operation called rebalancing, which means that the event streams are being redistributed in a round robin fashion. Since in this example, the parallelism is being reduced from two to one, the streams are being merged at the sync. Just to be clear here, this sync could and probably should operate at the same parallelism as the rest of the job graph. I reduced the parallelism merely for the sake of having an example of a rebalance. The reason why you may not want to do this, is that rebalancing is expensive. Just like the network shuffle we looked at before, rebalancing requires serializing each event and using the network. In addition to forwarding, re-partitioning and rebalancing streams, Flink offers many other patterns for stream processing, including broadcasting streams to distribute data throughout the cluster and joining streams for data enrichment. The example I've been showing you can be implemented using Flink SQL, and here's what that would look like. Flink SQL can transform a SQL statement like this one into a Flink application. The first line of this SQL statement, INSERT INTO results, sets up the results table so that it is used as the sync for an application that Flink will create to execute this query. The SELECT, COUNT, and GROUP BY clauses set up the part of the job that re-partitions the stream by color and counts the events of each color. WHERE color IS NOT EQUAL to orange creates this filtering step in the job graph. And finally, the FROM events clause specifies the event source for this streaming application. Later in the course, I'll explain in more detail how Flink SQL relates to stream processing. If you aren't already on Confluent Developer, head there now using the link in the video description to access other courses, hands-on exercises and many other resources for continuing your learning journey.