VP Developer Relations
So far we've covered the ingest and egress components of streaming pipelines. Let's now look at how we can also process that data as part of the pipeline.
Apache Kafka ships with stream processing capabilities in the form of the Kafka Streams library for Java and Scala, which you can use to build elastic applications that read, write, and process event streams in Kafka. This is great if you're in the Java ecosystem, but plenty of people aren't. The streaming database ksqlDB abstracts away the need to write Java code and instead gives you a lightweight SQL language in which to express the stream processing that you'd like to do (plus many more features, of which we cover only a few in this course).
In our stream of ratings data, there are events generated by a test system that have crept in, which we want to filter out. We can identify these by looking at the channel field. When it comes to processing the events, we only want to include those which are from production sources.
CREATE STREAM RATINGS_LIVE AS
SELECT *
FROM RATINGS
WHERE LCASE(CHANNEL) NOT LIKE '%test%'
EMIT CHANGES;
As each event is read from the source stream, ksqlDB applies the predicate against the channel field, and only if it does not include test within it does the event get written to the target stream. We can now run any subsequent ksqlDB queries against the ratings_live stream.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
Hi, I'm Tim Berglund with Confluent. This is Data Pipelines, lesson four, Filtering Streams of Data with ksqlDB. Now, filtering's a pretty simple concept, but let's just go through it as our introduction to stream processing. Remember, so far we've covered how to get data in, that's ingest, how to get data out, that's egress. And that's an important part of a pipeline, moving the data through, not much of a pipe if things don't flow, but we want to look at what we can do to add value to that data as it flows through the pipeline. So Kafka itself, the open source Apache Kafka thing has this built in stream processing library called Kafka Streams. It's really cool, really powerful. We've got a whole separate course on it, a bunch of examples and Kafka tutorials. It really is a great thing, if you're a part of the Java ecosystem and you're already writing micro-services in Java, and you're super good at deploying Java. And that's just a thing that you do absentmindedly when you wake up in the morning, that's great. But there are a lot of people who aren't like that, and you don't want to have to learn all that just to do stream processing. KsqlDB abstracts away the need for you to write Java code. And instead gives you a SQL language, a version of SQL to express those stream processing programs that you want to make a part of your pipelines. In our stream of ratings data, and if you've been skipping the examples videos, then you haven't seen ratings, but if you've been watching them and working along then you've seen the ratings data. There are some messages there that are generated by a test system, that have mixed in, in that stream with the production data. Now, this is part of the data generator simulation. Just enter into my story here for a second, if you would. So there's these test messages. We don't want them there. We can identify them by looking at the field called Channel. And when that field, Channel, has the value, Test, we know they're test messages and we want them to go away. That's filtering, right? We want that to happen. So on the screen here, you see at the bottom, that thing that looks like SQL, that's a ksqlDB query. And you see an animation of messages coming in from the source topic or source stream, looks like it has three partitions, and that filter doing its work. Sometimes they just kind of fall down and disappear as if dropping into the bit bucket and going on to their eternal reward. And some of them make it over into one of the partitions of the target topic. That's filtering, you knew what filtering was, but it helps to start with a really simple concept to just see the ksql and just get this concept into your head. This is stream processing. This is real time computation, happening on messages in our pipeline. We'll see if we can come up with something a little bit more complex next.