So you've read the docs for ksqlDB, maybe done some Kafka tutorials, but do you really know how it works on the inside? Well, on today's episode of Streaming Audio, I sat down with Michael Drogalis to talk about this stuff as a preview of a new course he's done, a video course on ksqlDB internals. Streaming Audio is brought to you by Confluent Developer. And at the time of this recording, we are relaunching a completely new version of that site with all kinds of great content. You can check it out at developer.confluent.io, that's developer.confluent.io. Let's get to today's show.
Hello, and welcome to another episode of Streaming Audio. I am, as you have become accustomed to hearing me say, your host, Tim Berglund. And I'm really happy to be joined in the virtual studio today by my friend and co-worker Michael Drogalis. Michael, welcome to the show.
Thanks for having me, Tim.
Yeah. I should say welcome back. You've been here before. This is, as I like to say, is a triumph of hope over experience, like maybe it'll be better this time so you came back. And okay, so MD what do you... And by the way, everybody, I just said his name is Michael Drogalis. Internally, he's known as MD. I don't think there's any other part of his life in which he's known as MD. But if you hear me say MD, that's why. Remind us what you do here.
Yeah. So I'm a product manager here at Confluent. I head up our stream processing products. I work on any products that really take any data out of Kafka and perform computation over them. So I worked mainly on ksqlDB and Kafka Streams right now.
And ksqlDB is what we're going to talk about. You and I, so we're recording this, what day is today? May 25th. Last week, just five, six days ago, we were in the Mountain View headquarters-
We were.
... for the first time.
We were. It was glorious.
It was. It was. Nature is healing. And it also was a little creepy because there weren't very many people there, and we were still being super careful and everything. But we were there to film some video content on ksqlDB internals. And you're the guy who would know that. So I also kind of wanted to talk about it today. So by the time this goes live, I'm not sure if the video course will be up. So I'm not promising you, dear listener, a link to the course in the show notes, but check the show notes because there might be such a link.
If nothing else, it's an appetizer of things to come.
Exactly. This is a little amuse-bouche for you, ksqlDB learner, to think about some internals. A little bit past the how to use it. I mean, obviously, you got to start with how to use it, but it's nice to know how things work on the inside too. So let's kick it off. We started by talking about architecture. Give me a mental block diagram.
Yeah. So the reason we wanted to really cover this broad topic is, as you're starting to use Kafka more and more, it becomes a more critical part of your company. And it's really important to understand really the way that it's all working. And this applies to stream processing too. If a venture of architecture is really going to become a more integral part of your business, it's really quite critical to understand how it behaves as you make various changes to it, as you apply more load to certain places, and that kind of thing.
And so, yeah, we started at the architecture. We wanted to really take an overview of the way that it functions. And ksqlDB is set up pretty classically like a client-server architecture like you might find with MySQL or Postgres. You have a set of servers that cluster together and they execute commands. They could execute queries, they could execute other DDL or DML statements and they communicate with clients. And so if you're new to it, kind of picture you're at a command line, you're issuing commands. They're sent off to a server. What's different about a streaming database like ksqlDB is that many of your commands execute as daemons. They kind of run forever. But periodically you can run other commands from your client, and it will communicate back. And so it's a pretty classic deployment model, even though what's going on in the server is pretty different.
Yeah, it is. I mean, you're at a command line, you're typing SQL, and it's going to some server, and you're getting results back. So that's a pretty comfortable feel. That persistent query thing isn't the first time. Because normally a SQL query is a completely synchronous affair. You type it in, and you wait. And maybe you wait a few seconds or whatever, and then stuff comes back and it's done. And if you don't get a command prompt back, something's broken. In KSQL, it's that same interaction. Type is some SQL. It goes to the server and things happen, but you may not get a command prompt back depending on what the command is because it's running, and there are going to be new records coming into that topic, and you're going to [crosstalk 00:04:57].
Yeah, that's right. And that's why this distinction is so useful. Most databases are made to handle data at rest. You put in some data when you write it, and it's kind of boring. Just the data gets into your database. It's maybe indexed. But it's boring. There's nothing happening. It's only when you issue a query and you do a read that any of the work happens. And this actually has a lot of downsides, is that it's mainly human driven, it's on-demand. And really, we want to get to a world where the data is always in motion.
And so when you write the data into the database, these persistent queries are running all the time. It's not only re-indexing your data, but it's refreshing materialized views for you automatically. And really doing all the work that you used to do inside of your application queries that are very heavy, but taking care of it as soon as events arrive. And the result is that you get these perpetually fresh views that you can query deterministically with low latency.
Nice, nice. And I guess, ideally, less complexity in your application code at that point. Because a certain amount of that stuff that you would have had to hack up yourself is now living in the server because, well, it's what we call a streaming database. It's not an actual historical classical database.
Yeah. That's right. In the past, the idea of turning the database inside out was really popular and it was very useful. But the reality is when you do that, the guts of your database are kind of splayed onto your application. And what we want to do is actually tuck it all back behind the obstruction of the database again, and this is really one way to do that.
All right. I like that. Now for some reason, I've been addressing the listener directly a little more today than I usually do. I don't normally do that a lot. But dear listener, what we're going to do is we're going to plow through the outline of this course. I'm just going to ask MD some kind of high-level questions, and ask him to riff on them. So again, useful information here, and it is sort of a teaser, compliments of the chefs, for the course itself. So stateless operations, this is kind of the simple stuff. It's always simpler than stateful. Anyway, what's the basic idea?
Yeah. So kind of starting with Hello World with what's the most basic thing that you can do with a ksqlDB program. Let's look at that and then let's dive deep and tinker with it. Look at how it works and what it does. And so a stateless operation, maybe I want to create a stream, or I want to do a basic transformation, take some data from one topic, and move it to the other. When you do any sort of persistent query that does work, what actually happens is that when you send that command off to ksqlDB servers, we've been saying it runs perpetually. Well, what does that mean? And who's doing what? And why does that matter?
The way that it actually works is you take the SQL statement, ksqlDB's grammar, and parser compiles it to a physical execution plan. And it runs it with Kafka Streams. And so it creates its ecology inside of Kafka Streams, which is a very popular stream processing library. And that's kind of cool because it's making use of all the lower-level things that a lot of hard work has gone into, a lot of sweat as far as making it performant and freedom free of defects. And so ksqlDB will manage the execution of that topology indefinitely. It will stay up. And so it's actually quite interesting what's going on under the covers. We're constantly taking advantage of things that are going on further down the stack and trying to make use of them.
Nice. And that is, I think, important detail. So if you know Kafka Streams and you're already a Kafka Streams developer, you can think of ksqlDB as a way of having a computer write a Kafka Streams typology for you by expressing the query in SQL, which is pretty dang cool. And there was another thing I was going to... Oh, scale. Scale. Right. So you mentioned that there's a cluster of servers, and how does that all work? Well, you can sort of deferring to how does Kafka Streams clustering works? And it's the same answer for ksqlDB. So you've got a horizontal scale. And I don't think we're going to dive... Do we get into that? Maybe we'll get into that a little bit later.
A little bit. Yeah.
We'll see what time allows us. But yeah, scale is a thing that you can do, and relatively effortlessly compared to doing this all on your own.
Yeah. And really just knowing this one principle that ksqlDB programs are actually Kafka Streams programs answers a lot of questions. People often ask me which would be better to use. I could use either, but which would be more performant. And the answer is either are as good as each other, but if you can choose the better abstraction, go with that one. We're trying to make ksqlDB the most convenient one, but semantically it's the same thing under the covers. And so you reuse your mental model of how do consumers work, how do consumer groups work, how does sharding work, how do keys work? And that all apply here.
Yep. All of it. And fair point. You've got choices. Kafka Streams is a thing, and it's not like ksqlDB has permanently elbowed it in the face or anything like that, and it's just out of the picture. It's the layer beneath in terms of the abstractions, but it's also a perfectly respectable thing to use. And there are people who make that choice. So that's totally cool.
So you'll see this when the course comes out, I'm looking right now at some slides, and MD has made this nifty SVG and JavaScript-based animation framework to kind of show you how some of these stream processing operations work. It's nice for me to say that. Either you're listening to our voices, or if you're watching this on YouTube, you're looking at our faces. And I'm saying, "Well, I'm looking at these slides, and believe me, they're cool." But trust me, they're cool. And we'll see if we can put a link to them in the show notes maybe if that works out because there is some nifty stuff in there. Now, stateful operations. Moving right along. What's different about stateful things?
Yeah. The primary difference is that when you do a stateless operation, it's relatively simple. KsqlDB doesn't really need to remember anything between the rows that it processes. And so you can imagine it's sort of sitting there. It's getting a row one at a time. It's changing the case of a column, or you're doing a comparison and you're producing some new data. But there's really no memory that's involved other than just kind of keeping track of where the last place that it processed was.
Stateful operations get into a much more difficult part of computer science where you're trying to remember things across records. And so that's challenging for a whole bunch of reasons. I think even in the simplest case you need to come up with a good abstraction for how to manage that intermediate state. In the hardest case, when you're trying to program for fault tolerance and deal with failures, it's a whole challenging area of building a stream processing framework.
And so, yeah, and of course we dive into sort of how does this work? And just to touch on it briefly, ksqlDB uses a really interesting fault tolerance mechanism where it kind of backs data up in two different ways. And it does it these two different ways that you can get a more interesting programming model, and then different recovery characteristics. And so we talk a little bit about the notion of how ksqlDB uses Kafka and these changelogs to be able to store unmaterialized state updates. It's almost as if you have an audit trail of everything that's ever happened. And then you also have a more compressed version, a materialized version, that sits on ksqlDB servers. And so that ends up being useful, both for running performing queries, but also for doing really quick recovery as well.
Right. And I think I'm getting ahead a little bit, but hey. The local state is kept in a local state store. We'll talk about that a little bit more. And that's persisted as a snapshot locally, but also backed up into an internal topic. Or I shouldn't say backed up, but it's also persisted in the Kafka cluster.
Yeah, that's right. So if you want it to say, take a moving average of what sensor readings are happening hour by hour, you want that query to be pretty performant. Each time you issue it, you don't want to have to go back in time, and then look at all the readings and do the average. Because as you get more and more readings, the query will get more and more slow. And so you do need a place to store that incrementally repaired aggregation, and ksqlDB does that right on its servers. And so it's a nifty design how it's able to make that work.
Nice. Streaming joins. That's the next section. So how do those work? What is the interesting internals behind that?
Yeah. Streaming joins may be the most confusing part of stream processing in general. It's a really tricky thing to explain. When you and I think about joins, we think of the classic, here's a table here on the left and a table here on the right. And the data is at rest. It's frozen in time, and you can take a pencil and draw which rows should connect with the others. And after a little while, you understand the difference between a left outer join or an inner join. And you start to be able to figure out if I have table A and table B, then table C must look like this because you just have these totally frozen snapshots in time.
The problem is really much trickier to think about when you have data in motion. You have these streams and data, they're constantly moving. There are some data in stream A that should appear in stream B. How do you sort of freeze time and get the data from A to B, and then from B to C? And so that's what streaming joins are all about. There are different flavors of joins. KsqlDB, if you're familiar with its syntax, it has streams for modeling immutable data and tables for modeling mutable data. And it supports different joins between all of them. You can join between a stream and a stream, a stream and a table, and a table on a table. And there's different semantics according to each.
And so, for example, if you were going to join between a stream and a table, it sort of has to be bounded, when will that stream actually fire? And so in the course, we go through what buffering mechanisms ksqlDB uses to sort of store data temporarily and create a window where a stream will join with a table when that join will fire when that will miss when it updates, and what those semantics really are.
Right. It's the stream stream joins that get weird. I don't know if you can do that without a visualization, but give it a shot. What does that even mean?
This is the one that we actually don't cover in the course. It was a little hard to stuff into an animation. I'd like to come back to it someday. But it is probably the most complex join style where you have two infinitely sized streams. Data in one is useful in data in the other. And so how do you stitch those two things together? What ksqlDB does is it-
Without infinite time.
Yeah, without infinite time. And what it really does is it creates a buffer on both sides. And it's reliant on you, the programmer, to say how long am I willing to wait? What's my tolerance? How much memory am I willing to use up? And it creates buffers on both sides. And so as data comes in, it creates a match across them. And so you join stream A to stream B, and you actually get stream C out of that. And you're able to control with normal SQL syntax, you get to say when to fire, is it a left join, is it an inner join, is it something else? And that buffering design is really key to the way the whole thing works.
Exactly. I think the key thing to say is when you introduce the concept of a stream stream join, you kind of watch wheels turn. People trying to reason about, I don't have infinity.
Yeah. That's right.
They're kind of wondering how this works. It's always within a window. So that's the short answer. There's a window in which you consider the streams. And so it's this stuff versus this stuff. And that window has new things sliding through it all the time. And so your resulting stream is always fresh.
And I think this is one of the things that Kafka Streams and ksqlDB really get right in terms of their abstraction. There are lots of frameworks out there and programming models that kind of allow you to do whatever you want. But it turns out, most of the time, if you can do whatever you want, you can probably reliably write a broken program. So ksqlDB is very focused around providing a good interface that forces you to make good decisions and steers you away from things that are just inherently impractical.
Absolutely. Yeah. There's a real tension there. I mean, I think the direction of programming languages in the last 20 years, 30 years has been away from I can do anything I want, in the direction of let me give you a powerful tool because you're a smart person and you also need help.
Yeah. I'm a big fan of closure, and that's really the ongoing philosophy of closure, which is there are power tools here, but none of the power tools are going to allow you to do something that will just always be wrong. No matter much you might want it. Infinitely sized buffers are not a thing you can get out of the box in the core library, as a quick example.
They're just not a thing. And buffers that can overflow. That's not a thing you should be allowed to do. It's funny, the DevRel team was creating a bunch of language-specific quick starts right now that we're going to publish on Confluent Developer. So you want to get started with Python. You want to start with JavaScript. You want to get started with Go. Obviously, Java, we've got resources for already. But we're trying to round out that coverage. And C was one of them. We did one on C.
Brave.
Right. But it's important. I mean, there are people who use C and C++. And so it's not like it's a disreputable language. It's just we all know the warts. And I was talking to the developer advocate who was building those. Chris Jenkins is his name. Also a man with some affection for closure. And he commented that the code was so much longer than all the other ones. And the thing that it was doing was memory management. It was always, do I have to free this thing? Okay, I'm going to free this thing. Okay. I'm going to allocate this thing. And that just is the-
Yeah, huge stride forward in the last couple of decades. You don't even think about it now. But when you go back, you're like, wow, how did we do this?
Right. Right. And so taking that principle, and that was a super elemental thing that languages did, and applying it to we want to give you a powerful API, but we actually don't want to let you do anything you could possibly conceive of. Because certain of those branches in the tree are the bridge is out in that direction. We don't want to let you go there.
Yeah, that's right. That's right.
Scaling and fault tolerance. We kind of hinted at this, but there's much more to say. The animations are super cool, folks. I'm just going to tell you that. I'm not going to show you. But talk to us.
Yeah. I think it's hard enough to get people to understand what's going on in a single node context. ksqlDB is not only fault-tolerant, but it's distributed, it's performant, and really getting people to understand what happens as you add nodes. That's a whole other layer of kind of the puzzle. And so, yeah, we talked quite a bit about how you deploy these persistent queries. They run on a server. But what happens when you add two or three, or 10 or 15 servers? What's going on under the hood?
And again, we get to fall back on the regular Kafka client model, where these persistent quarries are really packages of producers and consumers. They're deployed as a consumer group. And if you add more servers, they'll actually be able to fan across your nodes as a fleet, just as you're used to. So if you had two servers, and maybe you're consuming from four partitions, anyone who's written a Kafka consumer knows that you'll get one consumer with two partitions, the other with two. And this is nice. You get to recycle everything that you know. It's easier to monitor. Just putting aside how to architect it. You can take advantage of a lot of your existing monitoring techniques because you'll sort of know, how the way it's supposed to shard your data, and how it should behave as you continue to add more load.
Kafka is made out of Kafka, I like to say. And so ksqlDB being made out of Kafka Streams, which is itself is made out of consumers and producers. And so all those scaling properties just propagate up that stack of things.
Yeah, I think it's one of the few examples of software out there that's really gotten this right. You want to be able to reuse everything in the layers underneath you. And as you add more layers, this becomes a really difficult balancing act. But over the decades, I think the whole Kafka ecosystem has really gotten this correct.
I could not agree more. Even in places where it's not literal, like task assignment to workers in Kafka Connect, those aren't literally consumer groups, but there's a lot of borrowing there. And if you needed to dig into the internals and you knew consumer groups, and now you're digging into Connect, it's going to feel like home. The French toast is made like mom makes it, and the coffee is familiar, and all that kind of stuff.
Yeah, absolutely.
I say, mom. Growing up, my kids, I was the one who made the French toast. Why did I have to say that? I've totally cut myself off of 20 years of French toast cooking. And how about state? And we're perilously close to time. We might run a little long this time. But we talked about state. And when you're scaling out, how does that work? There's going to be state aggregated on a node. It's going to be in memory. And there's this amorphous thing where it's stored on the server. And this wasn't in the course, but how do you move that from one node to another when you're going from a four-node cluster to a six node cluster?
This is one of the cool things where it ties back to something we learned earlier. And so in the course, we look at changelogs and what function do they play. As you have these stateful operations, and you knock out a node or you add a new one, those changelogs are what allows state to kind of teleport between different places. You take the unmaterialized data, and then you load it into a state store. And what's nice is that as your compute cluster comes and goes, your storage remains. You have these brokers that are running in a fault-tolerant manner. You can always recover the data from there.
And so we do actually look a little bit in this course about how this works for a few minutes. And then we close it out by looking at high availability. Which is when you have these stateful operations, we look extensively at changelogs and the role that they play in a cold recovery. You lose a node, you have nothing, and you need to kind of rebuild the world as you come back online. But what about these more partial failures where you have a five node cluster and one node fails, or three nodes. If you're able to replicate your state, you can actually failover in a very small amount of time. And we stepped through how that works exactly. And again, it builds on concepts that we've already looked at, which is what makes the whole thing so elegant to me.
Yeah. And that is kind of the last section that you covered, which was high availability. Which is, I mean, the way scale works and the way HA works in KSQL, are all relying on the same mechanisms.
Yeah, that's right.
Yeah. Cool. Well, I liked the technology. I liked the course. There are some great animations. And I do need to say I think your delivery was excellent.
Thank you.
And it's also free. Like I said, by the time this podcast runs, I don't know if this will line up with the availability of the course. We may try to time that so check the show notes. And if the course is up when we run this, then there will be a link to it. And it really is worth watching. My guess is it's going to be 30 to 45 minutes of finished video content, and just some good stuff to really... This was the quick version, but you'll get a slightly more detailed and more visual version. And it's worth it.
We've got a course also on simply how to use KSQL, which maybe is where you want to start. But once you get comfortable with syntax and just the basic semantics of the thing, if you're like me, you do want to know what's going on under the hood. You need one layer down so that it's not magical. And so that when something doesn't look right, you're able to reason about it because you've got tools and enough knowledge of the internals. That's really why we did this because we want you to be able to use this thing effectively and think about it as the smart person that you are.
Well put.
I get that a lot. We're not going to edit that out. Anyway, what's next that you can talk about? What's coming up that you're able to talk about in the world of ksqlDB that you're excited about?
Yeah. I think the roadmap is really interesting, that we're working on a couple of different fronts if I could just go through it briefly. Much of the emphasis in the course is around the query layer. You would create these tables of state. How do you ask questions about them? And so we're continually making this more and more robust. The pull query layer is getting more expressive. We're putting a lot of investment into the push query layer to make it highly scalable. We have a goal for clients to be able to execute thousands of concurrent subscriptions of your data.
And so think about a SQL query that gives you results on-demand as the result changes. And so that's kind of what we're after. We're putting a lot of investment into what would be really the API. What are the kinds of programs that you can run? How do you work more with time? How do you work with more data formats? And then the last thing is we put a ton of effort into making a Confluent Cloud the best place that you can run ksqlDB. Think cloud computing is the future for all kinds of infrastructure, and stream processing isn't an exception. So we want to make this the best place that you can possibly do stream processing.
My guest today has been Michael Drogalis. Michael, thanks for being a part of Streaming Audio.
Thanks for having me.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer. That's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka Streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code podcast 100 to get an extra $100 of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening, or reach out in our community Slack or Forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
ksqlDB makes it easy to read, write, process, and transform data on Apache Kafka®, the de facto event streaming platform. With simple SQL syntax, pre-built connectors, and materialized views, ksqlDB’s powerful stream processing capabilities enable you to quickly start processing real-time data at scale. But how does ksqlDB work? In this episode, Michael Drogalis (Principal Product Manager, Product Management, Confluent) previews an all-new Confluent Developer course: Inside ksqlDB, where he provides a full overview of ksqlDB’s internal architecture and delves into advanced ksqlDB features.
When it comes to ksqlDB or Kafka Streams, there’s one principle to keep in mind: ksqlDB and Kafka Streams share a runtime. ksqlDB runs its SQL queries by dynamically writing Kafka Streams typologies. Leveraging Confluent Cloud makes it even easier to use ksqlDB.
Once you are familiar with ksqlDB’s basic design, you’ll be able to troubleshoot problems and build real-time applications more effectively.
The Inside ksqlDB course is designed to help you advance in ksqlDB and Kafka. Paired with hands-on exercises and ready-to-use codes, the course covers topics including:
Michael also sheds light on ksqlDB’s roadmap:
Tune in to this episode to find out more about the Inside ksqlDB course on Confluent Developer. The all-new website provides diverse and comprehensive resources for developers looking to learn about Kafka and Confluent. You’ll find free courses, tutorials, getting started guides, quick starts for 60+ event streaming patterns, and more—all in a single destination.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us