VP Developer Relations
Software Practice Lead
As your Kafka-based applications grow, you'll likely need more than simple message handling. Use cases like real-time analytics, fraud detection, and dynamic personalization require continuous transformations, aggregations, and joins across streams of events—not just isolated message reads.
That’s where stream processing frameworks come in.
Advanced stream processors allow you to perform a wide range of tasks including:
Stream processing frameworks are purpose-built to manage state, ordering, and fault tolerance at scale.
You’ll often reach for one of these two libraries:
You don’t need to become a Flink expert overnight. But understanding the “why” behind stream processing helps you design more scalable, reactive systems from the start. In this course, you’ll:
Stream processing unlocks the full power of event-driven architecture—enabling smarter, faster decisions using the data you’re already collecting in Kafka.
This is the introduction to Apache Kafka on stream processing. In a growing data streaming application with Apache Kafka as its substrate, consumers tend to grow in complexity. Maybe producers stay kind of simple, consumers, they don't, they kind of grow. And what may have started as a simple stateless transformation, like masking, personally identifying information or reformatting a message to conform with some internal format. Soon, you're adding stateful operations like aggregation and enrichment, joining, maybe time window processing. There's a lot to that.
And maybe your job is about detecting fraudulent activity on your company's website. And it involves a sequence of interaction and they're all events and you want to detect a pattern and perform some correlation and all that. You know, that's your job to do. It's not your job to write stream processing infrastructure. And it's very easy to get pulled into that a little bit at a time by just adding a little bit of complexity to consumers.
Now, if you recall the consumer code we looked at in the section on consumers, there isn't much built-in support for advanced stream processing features. You go read the docs, it's just, it's not there. So you'd have to write a significant amount of framework to handle time windows and late arriving messages, out-of-order events, lookup tables, key-based aggregate, all that. That's not your job. It might sound fun, but you know, you're supposed to be delivering value to the business, not doing that.
And most of the stateful operations, those are even harder, right? That introduces new fault tolerance problems to solve. If your application crashes, is your state persisted somewhere? And doing all this at scale is incredibly complex. This is all just my extended plea for you to learn and adopt a stream processing framework for your work with Kafka.
So we're a little bit leaving the Kafka universe proper here. I mean, halfway, we're going to talk about two options, Apache Flink and Kafka Streams, but Flink has become a de facto standard way of doing stream processing on Kafka. It's got all the core computational primitives of stream processing, transforming, filtering, joining, aggregating, without requiring you to write extensive framework code. It's just not a thing that you want to do. And again, it has become a de facto standard for doing stream processing alongside Kafka Streams that we'll look at in a moment.
There are three APIs you get to choose from in Flink. We start at the bottom with the DataStream API. That's kind of the first one out of the gate. The earliest versions of Flink use that. It's very low level. And there are folks who have used it for years and are very happy with it and want to stick with it, but it's probably not recommended for new designs. It's going to be harder to learn. And a lot of the current research and development and commits are not going into improving the DataStream API so much as the Table API.
The Table API is also just much easier to understand. It's not SQL literally, but it's in Java and in Python, a fluent API where the methods look like kind of SQL-ish operations, very simple stuff to understand, and it's growing and becoming more powerful over time. So that's a great way to go for new designs if you want to write your stream processors in Java or Python. If you want to write them in SQL and Flink SQL gets the job done, that's great too. That is a live option.
You can write SQL statements. Those are parsed and broken down and turned into jobs running on the Flink cluster. That can be a great way to get stream processing done with the expressiveness of SQL. Who wouldn't want it? So one framework to do all the things. And Flink by itself can also process bounded data in batch mode, like not stream data, but data that you've got lying around in a pile in S3 or something like that. The same code in some cases can be used to process batch data as streaming data in open source Flink. That's kind of a cool thing.
Now, Flink isn't the only game in town. There's also Kafka Streams, which is a part of open source Apache Kafka. It's a Java library that comes with Kafka. It's a thing that you can write code against. There's lots of great tutorials. There's courses available in Confluent Developer and all kinds of tutorials. There's great documentation, everything you could need to do this. So if you're a Java shop and you've got everything set up that you need to deploy Java consumers, Java applications that consume from Kafka, a Kafka Streams app is just an enriched, leveled-up Kafka consumer that's got all those same stream processing primitives in it.
Now there is a basic tendency for extreme scale things to be done with Flink. You don't see that as much with Kafka Streams. Kafka Streams has a great scale story, but kind of a general anecdotal community sense, you tend to see the larger scale things happen in Flink and small to medium scale things happen in Kafka Streams. But many Java shops are very happy with Kafka Streams. Also, many people are very happy with Flink. These are both live options for you to get this work done.
Lest we focus too much on Java in this course, I'd like to take you through some Flink SQL and give you a quick example of how this might work. I just want to put some code in front of you to do a few things. Now I'm not going to be explaining every last detail here, so don't worry if you don't understand it all. You can always pause and look things up in the docs if you really want to study deeply, but I'm just trying to give you a sense of what's going on.
So we've got a topic that you see towards the bottom there. It says with connector equals Kafka topic equals raw ratings. So there's a Kafka topic called raw ratings. That's got movie ratings in it. Somebody had deployed an app somewhere where people can rate movies live, and those are streaming into this app. We're going to make a table out of that. You see that table there? It's got a movie ID, a rating, and a time.
All right, what are we going to do with that? Well, let's average them. So we're going to create a new table called average ratings that averages the ratings with a tumbling window you see over an interval of five minutes, and it's going to be grouped by movie ID. So the average of all the movie IDs over the last five minutes, there you go. That's in the average ratings table.
What do we need to do with that? I don't know any of these movie IDs. I'd like movie names. So it turns out I've got a topic lying around called movies that maps my movie IDs to movie names. This is a good example, by the way, in Kafka of a compacted topic where the topic is really holding entities and not so much events. But that's totally cool. A topic can do that. Now I've got a table called movies.
So now I have all the pieces set up to do what I want to do. I've got that average ratings table. I've got the movies table. I join them and look at that. I now have the rated movies. I can see actual movie titles and their ratings. Looks like we have a David Lynch fan here, but not so much into Mulholland Drive. I wonder what happened there. We'll have to look into it.
Anyway, that's a quick view of what you can do with Flink SQL on live streaming data. So remember, this all started with this topic called raw ratings. At some volume, maybe extreme volume, ratings are flowing into that topic and being averaged and produced into this table here for our use.
There's other things you can do with Flink SQL. You can detect duplicates. That's a very common use case. It can handle bounded datasets with its batch execution mode. So you don't have to switch tools. And the SQL API isn't just for simple queries. It's got support for user-defined functions. When it's easier for you to implement a function in Python or Java, rather than using SQL operators, you can call that UDF from your SQL query.
So whether you use Kafka Streams or Flink, stream processing is the thing you're going to need to do. And Flink, as I said, emerging de facto standard, very powerful framework for real-time stream processing. The SQL API, the Table API available for your use. This is also available in fully managed form in Confluent Cloud. So if you don't want to deploy your own Flink cluster, it's a thing you can run in Confluent Cloud.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.