VP Developer Relations
Software Practice Lead
As data streaming applications grow in complexity, consumers often evolve from simple message processing to more advanced stream processing tasks. What might start as basic, stateless transformations—like masking sensitive information or reformatting messages—can quickly expand into stateful operations:
Managing these operations with basic consumers is not scalable and introduces challenges like handling out-of-order events, managing state persistence, and ensuring fault tolerance. To address this, Apache Kafka® leverages stream processing frameworks: Apache Flink® and Kafka Streams.
Apache Flink is an open source, independent, distributed stream processing framework that integrates seamlessly with Kafka. Flink handles all core stream processing operations—transformations, filtering, joining, and aggregating—without requiring you to build custom processing infrastructure. Flink has become a de facto standard for large-scale stream processing alongside Kafka.
Flink offers three main APIs for different levels of abstraction:
One of Flink's unique features is its ability to process both streaming and batch data using the same APIs. For example, data stored in S3 can be processed just like a live event stream, making it flexible for both real-time and historical data analysis.
Kafka Streams is a Java library that is part of Apache Kafka itself. Unlike Flink, it doesn’t require a separate cluster—it's a lightweight client library that turns your consumer applications into powerful stream processors. Kafka Streams provides the same processing primitives as Flink—transform, filter, aggregate, and join—while seamlessly integrating with Kafka topics.
While Flink is often chosen for extreme-scale processing, Kafka Streams excels in small to medium-scale applications, when Java is the primary language. It’s ideal for microservices that need to process streams with minimal overhead.
Flink's SQL API allows you to write SQL queries directly against Kafka topics, treating them as real-time tables. Here’s an example:
Flink SQL also supports advanced use cases like deduplication, windowing operations, and user-defined functions (UDFs). You can extend SQL logic with Java or Python for more complex processing without leaving the SQL context.
Only Apache Flink is available as a managed service in Confluent Cloud, allowing you to leverage powerful stream processing without the need to manage infrastructure.
In summary, stream processing is essential for real-time analytics, fraud detection, traffic monitoring, and more. Whether you use Apache Flink or Kafka Streams, these frameworks allow you to transform, enrich, and analyze your Kafka streams reliably and at scale.
This is the introduction to Apache Kafka on stream processing. In a growing data streaming application with Apache Kafka as its substrate, consumers tend to grow in complexity. Maybe producers stay kind of simple, consumers, they don't, they kind of grow. And what may have started as a simple stateless transformation, like masking, personally identifying information or reformatting a message to conform with some internal format. Soon, you're adding stateful operations like aggregation and enrichment, joining, maybe time window processing. There's a lot to that.
And maybe your job is about detecting fraudulent activity on your company's website. And it involves a sequence of interaction and they're all events and you want to detect a pattern and perform some correlation and all that. You know, that's your job to do. It's not your job to write stream processing infrastructure. And it's very easy to get pulled into that a little bit at a time by just adding a little bit of complexity to consumers.
Now, if you recall the consumer code we looked at in the section on consumers, there isn't much built-in support for advanced stream processing features. You go read the docs, it's just, it's not there. So you'd have to write a significant amount of framework to handle time windows and late arriving messages, out-of-order events, lookup tables, key-based aggregate, all that. That's not your job. It might sound fun, but you know, you're supposed to be delivering value to the business, not doing that.
And most of the stateful operations, those are even harder, right? That introduces new fault tolerance problems to solve. If your application crashes, is your state persisted somewhere? And doing all this at scale is incredibly complex. This is all just my extended plea for you to learn and adopt a stream processing framework for your work with Kafka.
So we're a little bit leaving the Kafka universe proper here. I mean, halfway, we're going to talk about two options, Apache Flink and Kafka Streams, but Flink has become a de facto standard way of doing stream processing on Kafka. It's got all the core computational primitives of stream processing, transforming, filtering, joining, aggregating, without requiring you to write extensive framework code. It's just not a thing that you want to do. And again, it has become a de facto standard for doing stream processing alongside Kafka Streams that we'll look at in a moment.
There are three APIs you get to choose from in Flink. We start at the bottom with the DataStream API. That's kind of the first one out of the gate. The earliest versions of Flink use that. It's very low level. And there are folks who have used it for years and are very happy with it and want to stick with it, but it's probably not recommended for new designs. It's going to be harder to learn. And a lot of the current research and development and commits are not going into improving the DataStream API so much as the Table API.
The Table API is also just much easier to understand. It's not SQL literally, but it's in Java and in Python, a fluent API where the methods look like kind of SQL-ish operations, very simple stuff to understand, and it's growing and becoming more powerful over time. So that's a great way to go for new designs if you want to write your stream processors in Java or Python. If you want to write them in SQL and Flink SQL gets the job done, that's great too. That is a live option.
You can write SQL statements. Those are parsed and broken down and turned into jobs running on the Flink cluster. That can be a great way to get stream processing done with the expressiveness of SQL. Who wouldn't want it? So one framework to do all the things. And Flink by itself can also process bounded data in batch mode, like not stream data, but data that you've got lying around in a pile in S3 or something like that. The same code in some cases can be used to process batch data as streaming data in open source Flink. That's kind of a cool thing.
Now, Flink isn't the only game in town. There's also Kafka Streams, which is a part of open source Apache Kafka. It's a Java library that comes with Kafka. It's a thing that you can write code against. There's lots of great tutorials. There's courses available in Confluent Developer and all kinds of tutorials. There's great documentation, everything you could need to do this. So if you're a Java shop and you've got everything set up that you need to deploy Java consumers, Java applications that consume from Kafka, a Kafka Streams app is just an enriched, leveled-up Kafka consumer that's got all those same stream processing primitives in it.
Now there is a basic tendency for extreme scale things to be done with Flink. You don't see that as much with Kafka Streams. Kafka Streams has a great scale story, but kind of a general anecdotal community sense, you tend to see the larger scale things happen in Flink and small to medium scale things happen in Kafka Streams. But many Java shops are very happy with Kafka Streams. Also, many people are very happy with Flink. These are both live options for you to get this work done.
Lest we focus too much on Java in this course, I'd like to take you through some Flink SQL and give you a quick example of how this might work. I just want to put some code in front of you to do a few things. Now I'm not going to be explaining every last detail here, so don't worry if you don't understand it all. You can always pause and look things up in the docs if you really want to study deeply, but I'm just trying to give you a sense of what's going on.
So we've got a topic that you see towards the bottom there. It says with connector equals Kafka topic equals raw ratings. So there's a Kafka topic called raw ratings. That's got movie ratings in it. Somebody had deployed an app somewhere where people can rate movies live, and those are streaming into this app. We're going to make a table out of that. You see that table there? It's got a movie ID, a rating, and a time.
All right, what are we going to do with that? Well, let's average them. So we're going to create a new table called average ratings that averages the ratings with a tumbling window you see over an interval of five minutes, and it's going to be grouped by movie ID. So the average of all the movie IDs over the last five minutes, there you go. That's in the average ratings table.
What do we need to do with that? I don't know any of these movie IDs. I'd like movie names. So it turns out I've got a topic lying around called movies that maps my movie IDs to movie names. This is a good example, by the way, in Kafka of a compacted topic where the topic is really holding entities and not so much events. But that's totally cool. A topic can do that. Now I've got a table called movies.
So now I have all the pieces set up to do what I want to do. I've got that average ratings table. I've got the movies table. I join them and look at that. I now have the rated movies. I can see actual movie titles and their ratings. Looks like we have a David Lynch fan here, but not so much into Mulholland Drive. I wonder what happened there. We'll have to look into it.
Anyway, that's a quick view of what you can do with Flink SQL on live streaming data. So remember, this all started with this topic called raw ratings. At some volume, maybe extreme volume, ratings are flowing into that topic and being averaged and produced into this table here for our use.
There's other things you can do with Flink SQL. You can detect duplicates. That's a very common use case. It can handle bounded datasets with its batch execution mode. So you don't have to switch tools. And the SQL API isn't just for simple queries. It's got support for user-defined functions. When it's easier for you to implement a function in Python or Java, rather than using SQL operators, you can call that UDF from your SQL query.
So whether you use Kafka Streams or Flink, stream processing is the thing you're going to need to do. And Flink, as I said, emerging de facto standard, very powerful framework for real-time stream processing. The SQL API, the Table API available for your use. This is also available in fully managed form in Confluent Cloud. So if you don't want to deploy your own Flink cluster, it's a thing you can run in Confluent Cloud.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.