It seems like there's no bottom to the topic of how to make Kafka cloud native. There's always some cool new thing to explore. Today I talk to Prachetaa Raghavan, who is an engineer who works on confluent Cloud, about a recent blog post he wrote, that's really about just how Cloud Native Kafka works. The control plane, things with Kubernetes, tiered storage, all of the goodies. We're going to go through them all on this episode.
Streaming audio is brought to you by Confluent Developer, that's developer.confluent.io. That's a learning website dedicated to teaching you about Kafka and Confluent Cloud. We've got video courses, patterns for event-driven architecture, executable tutorials, all kinds of things, really everything you should need to get started on your Kafka journey. When you sign up, you want to do examples, and labs, and things like that in the Cloud. So use the code PODCAST100 to get an extra $100 of free usage. Now let's get to the show.
Hello and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined in the virtual studio today by Prachetaa Raghavan. Prachetaa is a staff engineer here at Confluent. He's a coworker of mine and he works on the cloud control plane team. So all kinds of cool things to talk about. He's also the author of a blog post relatively recently, we're recording this in the middle of November 2021. This blog post was like mid-summer, end of July, I think. It was called, Making Apache Kafka Serverless, Lessons from Confluent Cloud. Prachetaa, welcome to the show.
Thanks so much, Tim, for having me here.
You got it. So I want to talk through that blog post. There's a lot of really cool stuff in it. And you being an engineer on the control plane, you know things, and there are things that I think are just interesting to people who want to think about cloud services, people who want to deep dive on Kafka, there's just a lot to learn. But first, before we dig into that, tell us a little bit about you. You're a staff engineer here, but how did you get into this role? What got you interested in this? Yeah, where are you coming from?
Yeah, absolutely. And it's an interesting journey. So prior to Confluent I was at a company called Delphix. It was a relatively smaller size company probably at that time, very similar to the size of Confluent. And when I moved to Confluent, I had no clue about Kafka or Kubernetes or-
That's okay, me too.
... any of the cool things that Confluent is doing. So initially I started out in the operator team at Confluent, which dealt with the audio provision and de-provision or upgrade Kafka clusters in Confluent Cloud. And from there the journey took me towards different pieces in Confluent Cloud. I ended up being one of the people who also worked on Kafka Expansions, which was a thing probably a year ago, and then Kafka Shrink which is coming out soon. And then that's where all of these lessons learned from, how do you deploy and run Kafka and Confluent Cloud experience grew.
Cool. Yeah. Shrink we added later. We made it easier to make clusters bigger because it's usually what people want to do. And then later on, also if you need to make them smaller, that's okay too. Always harder to, when you talk to product management, harder to prioritize things like that for some reason, we're not sure why. But it's just about GA now I think.
Yes, that's right.
Anyway, making Kafka serverless. I remember the first time I read the phrase serverless Kafka, it was when we started using that phrase here at Confluent. My first reaction was like, what? Who came up with those? Serverless is about functions, serverless is Amazon Lambda. And that's I think the programming that a lot of us come them to the table with. So what does serverless mean? If it's a thing that can encompass Kafka... And by the way, anytime I talk to anybody on the show about something serverless, I ask them this question, and I think everybody has a different answer, but tell me what is serverless?
Yeah, for me serverless is basically as a user, I do not have to care about how you're going to run my application. All I have to care about is I have this piece of application or code that I want to run and the infrastructure and all of the niceties already taken care of. And I as a user also get billed based on how much I use, not pretty much like, "Hey, I have this much of bandwidth or the number of servers that I have to keep running for my application. It's like if a request comes along, you spin up, you run the code, and then spin it down. That's serverless to me.
Yep. Absolutely. And that was my first introduction to serverless as well, which was AWS Lambdas, and that was the perfect abstraction for serverless that I know of.
Yeah. But like a Kafka topic in the sky where you don't know what a broker is, you don't know about scaling. These are not things you think about, that's the serverless Kafka idea.
Yeah, absolutely. And on the Kafka front, it's pretty much a lot of configuration and it's a distributed system that you need to run. So running distributed systems, especially at scale is pretty hard, and it's not just Kafka, there is Zookeeper as well that you need to run, which is going away, but still, right now you do need to run both of them together. And if you can abstract all of the complexities are running Kafka and the infrastructure, that's the server as an event-driven platform that Confluent Cloud offers.
Got you. And it is pretty cool because like I said, I remember the first time I heard the phrase, I was skeptical. It sounded like somebody was trying too hard but... Then once you think about it and you free your mind from serverless equals cloud functions, you realize, no, there is a core abstraction in, in whatever the service is. It's a database, it's a distributed log, it's an application server. It's something, and that just exposing that as it were as an API with no infrastructure concerns visible to you at all is pretty powerful.
Now talk about how you guys do this. You in the blog post, you talk about four components, the control plane, Kubernetes operators, self-balancing clusters, and infinite storage. Dive in, I mean this is... Maybe we can even just back up and start with Kubernetes. This is all based on Kubernetes and massive numbers of Kubernetes robots behind the scenes doing things. How does that help?
Yeah. So Kubernetes is very critical to what Confluent Cloud offers. So let's maybe take the concept of a broker, and you're running it on either bare metal or even EC2 instances for that matter. If the broker goes down for whatever reason, you need to have the whole alerting system and you need to have everything in place to make sure you're able to recover from that scenario. Kubernetes, it's a very simple container orchestrated platform, but it's very pluggable and portable.
You can run Kubernetes on different machines, you can run it on your cloud, like AWS, GCP, Azure, all of them offer these natively, but they also offer a lot of niceties in there where you can scale up easily scale down. And Kubernetes itself is a reconciliation system wherein basically if an issue happens, it will try and recover the state automatically. So it's pretty much like you have a specification, you say this is my desired state. So it's specifying the what and not the how. So as long as the what is specified, Kubernetes takes care of realizing that state to its end goal. And if there are any issues that it encounters, it'll fix it itself.
And of course, there are lots of different kinds of whats that you might specify. And that makes the hows interesting. Because like a stateless application server type of code or type of component, you could stand those up and shut them down, and move them around and they're fine. But things with state life get harder and there are lots of custom ways of doing the how.
So this is a dorky analogy. Everybody listening to this knows who Kubernetes is, but for infrastructure like this, for a Cloud service like this, it's the distributed equivalent of an operating system. It's the way that you... The set of primitives that you use to spread your state and compute around your computing resources to allocate computing resources to them.
I'm sure that analogy's been made before. Now what I'm thinking about, I like it. X is the operating system of Y is not always my favorite analogy, but that's working. I'll think about that later. Anyway, control plane. Moving on to the control plane. That was I think the first thing on the list, in the blog post. Tell us what that means. If you're new to the term, what's a control plane and walk us through the control plane in Confluent Cloud.
Yeah. So control plane in Confluent Cloud, you can think of it as the central nervous system. So it has a lot of things that it does internally. So for example, one of the jobs of the control plane is also to expose all of the internal as well as external-facing APIs for the customer. So when a customer comes along and says, hey, I want to 4CQ Kafka Cluster.
A CQ is like just a unit of how big it's going to be for a dedicated cluster.
It's like saying four brokers but it's not saying, four brokers.
Yep. So it's pretty much the job of the control plane to realize that state. So if you say I want a 4CQ Kafka Cluster and AWS in US-East1 or something of that sort, the control plane takes that request, and then it... You can think of it as in Kubernetes analogies it's basically going to realize that state. And some of the operations in the control plane are also like, some of the APIs that are exposed, which are, if you want to create an API key or if you want to like delete an API key, all of those pieces are also involved and taken care by the control plane. And then of this, there are certain internal things that the control plane does, which is like placement and capacity management, which is transparent to the customer or the user. Like they don't know where we are placing things, how is our capacity managed internally. That's all an internal function within the control plane.
Got it. And so bringing Kubernetes back into the picture, Confluent Cloud exposes a user interface. That is something like, "Hey, I want to create a new cluster." "What type of cluster would you like?" "I'd like a dedicated cluster." "Well, how many CKUs would you like?" So you're saying how big you want the cluster in some amorphous unit. That's a nice user interface for a fully managed Kafka, right? Writing YAML would be an example of a very, very bad user interface for a fully managed Kafka because it wouldn't be all that dang managed. So you're saying it's the control plane that takes the user interface I see. I mean, just thinking of the case of cluster creation and then turns that into Kubernetes API calls basically.
Yeah. So pretty much. So control plan has to decide a few things. So if the customer says you can place my Kafka cluster in this region, the control plane has to also decide what is the size of the cluster, convert the whole CQ definition into like internal broker definitions, configure the whole Kafka cluster. And all of this is done through like a YAML specification. I know we discuss YAMLs, but internally Kubernetes is, you're specifying the what, you're not saying how. So you're specifying like this is my Kafka cluster. These are my placements. This is the size of the storage and so on. And that gets passed on to the Kubernetes operator and gets realized into the Kafka cluster itself.
Now you talk about how the outbox pattern shows up in this process. That's a helpful pattern. So number one, tell us for anybody who doesn't know what's the outbox pattern and how does that show up in this flow?
Yeah, absolutely. So the outbox pattern is... Maybe we can take an example here. So let's say you're ordering something from Amazon or something of that sort.
Speaker 2: That's unlikely to happen. Well, unlikely to happen while I'm recording this podcast, but daily activity. Go on.
Yep. So as soon as you make the order, it's basically like an order entry that gets created. But at the same time, you also need to update your inventory with like whatever, X minus one because someone has ordered this product. And then you have to also ship the product and so on. So the outbox pattern is basically as soon as the order comes in, you update the order table by saying, this is the order and so on, but you also make an entry into a table called the outbox table, which is then consumed by events and updated to other services. So you do it in a single transaction. You make an entry into your order table, you also make an entry into your outbox table. And then you have like a change data capture, which takes in entries, it looks at things in the outbox table, or you can consider it as an events table, and then you can store it into Kafka. And then you have consumers from Kafka which consume these evens and process the new result out of it.
So you're transactionally changing state and creating an event that lets other consumers react, have something to react to, right?
Yeah, that's right.
Got it. And how does that show up in the control plane?
Yeah, absolutely. So the control plane, like we mentioned, it translates the CQs is into a Kubernetes CRD. CRD is a custom resource definition and that gets stored into the database in the control plane. And then we have a similar pattern. We have the change data capture that looks at these tables and then it pushes those events into Kafka. So we dock food Kafka internally as well. And then we have consumers, which is the data plane. This is where the customer clusters are running. We have consumers listening for these events. And as soon as it looks like there's a CRD that I need to look at, it takes that and stores it into ES, which is our data plane. And that's very similar to the outbox pattern. It's not the exact pattern, but it's very similar to the outbox pattern.
Cool. Now let's see what's next on our list. You've got a control plane, the operator. So what's a Kubernetes operator. And I have a sense of how that's going to plugin here, but tell us what one is and where that fits into Confluent Cloud.
Yep. So a Kubernetes operator is a very application-specific controller whose main functionality is to extend the Kubernetes API in terms of creations, configurations, and managing instances of your custom application itself on behalf of the Kubernetes user. So in this case, basically we are talking about Kafka. So the Kubernetes operator is responsible for realizing the state of the Kafka cluster. So we spoke about how the CDC pipeline and everything get us the Kubernetes CRD or the CR to be specific, like the custom resource. It's an instance of the CRD.
So once you get that, the goal of the operator is to realize the Kafka state. So the CRD for example, says like, I want Kafka brokers of exercise. It has CPU specifications, memory storage, and so on. It also talks about how do you, for customers who use their own keys to encrypt the storage, how do you configure the storage? The operator is responsible for doing all of that. And on top of that, it's also responsible for placement. If you create a multi-zonal cluster, you want to make sure the brokers are spread across zones and your replica placement is also taken care of. So our Confluent Clouds Kubernetes operator is responsible for realizing all of these states.
Got it. So Kubernetes as this system that maps a declarative description of what you'd like onto those resources deployed out on actual computing resources means the operator to know how to take this new resource and make it a thing out of the world.
Nice. Which, this message was brought to you by Confluent platform. Confluent for Kubernetes is we used to be called Confluent operator, but that's that the whole thing that we had to build for confluent Cloud, like extracted and productized. So if you run on-prem, you can extract some of this stuff that Prachetaa and team have built and use it in your own Kubernetes cluster. Now self-balancing clusters. That was the third. I think we might get it through this whole list. I'm feeling good about this. So far we've got just a control plane, the Kubernetes operator, self-balancing clusters, which sounds like a thing that would just be a feature of the operator, but tell me what they are and how they work.
Yeah. So self-balancing cluster is basically a feature of Kafka. So when Kafka realizes, okay, I have new brokers added to this cluster, it basically does a self-balancing call in the sense that it seems that the new brokers are not having any leaders or it doesn't have any replicas in there. So things that it needs to rebalance a cluster so that the load is spread out across all brokers. However, it's a very complex feature.
It's not just about like for example, storage or replicas or something of that sort. It has a lot of other dimensions to it. Like it thinks in terms of leader counts, disconnects, network usage, and also looks at other constraints, like what is the amount of available disk and network capacity when it does these balancing decisions. However, like prior to SBs. SB is self-balancing Kafka. Prior to it, the operator was the one that handled it. The operator, when it looked at these new brokers that came in, it used to tell Kafka, hey, you have new brokers. Initiate a rebalance to these new brokers. But since then, like Kafka has this inbuilt feature, so the operator doesn't care about it anymore.
How much actual copying of log files is done? Because I mean, in principle, the task for keeping a cluster balance is to move a partition from one replica to another. And so how much actual bulk moving of files over network connections is done?
Yeah. So that depends on how the cluster is configured. And most of our clusters in Confluent Cloud have this feature called infinite storage. So when you look at the amount of data that's actually on those brokers is a subset of the overall data. So it's basically like the shuffling is pretty small in terms of how much data has moved across the brokers.
And that actually is the next thing I wanted to ask you about that. That's good by the way because that the bulk moving of data is expensive, right?
There's all kinds of just literally physical reasons why it costs energy and time to make that happen. We'd rather not do that if we could. So tiering is a nice way to keep on the amount of data that you have to move smaller. But walk us through tiered storage. It's a super obvious idea, but it's only obvious after you've thought of it. So tell us what it is and how it works.
Yeah. So tiered storage is basically you have a second layer of storage for the Kafka brokers. So there's a hot set, which is configurable, but then the hot set defines how much of data is actually in retention by the brokers themselves. The rest of the data then gets tiered to this secondary storage. Basically, for the most canonical messaging using the messaging system, you basically are going to read your own rights as soon as it's written. But for the other use cases where you have to read historical data, it is fine. You do get those reads from the second tier of storage, but that's basically what is tiered storage in its whole entirety.
So the hot set of the first tier is log files on brokers. I mean, in serverless Kafka we don't think of such things, but in fact, in the system you run, there are brokers and they have storage, and there are partitions that are kept by those brokers. It's Kafka. So that hot set or the hot tier, the fast tier is that regular thing, and the other one is like S3 or Azure blob store or something like that.
Yes, that's right.
Okay. And as a user, this being against serverless isotopic, I don't need to worry, I don't need to make a separate API call to go to the blob storage, just higher latency.
Yep. And as far as you are just producing and consuming data, and that's pretty much what you see as a user, but internally Confluent does all of these magic and it's just transparent to everyone.
Yeah. Confluent Cloud's doing a bunch of stuff. You're just using the same old producer and consumer API. And like you said, that latency hit isn't going to happen much because you're usually reading the hot set unless you have some process that needs to go back in, in which case... And I've made the point recently, we've always done this. You've got processors that have a register file and cash, and another tier of cash, and then main memory, and then a disc, and then some tape thing or even S3. S3 has tiers. There's the fast tier and there's the glacier tier that's probably on some tape drive that some robot goes, picks out of a wall. We always do this. There's a trade-off between latency and cost, and we're happy to make that trade-off.
Absolutely. And I think the other piece that goes unnoticed in here is also how it helps shrink overall. Because in shrink, you're removing brokers. And then that means basically you have to also rebalance the data out of those brokers that are being removed to the remaining set. And since you only have a subset of data in the brokers, you can pretty much add and remove brokers very quickly. So that's also a huge advantage towards this whole serverless Kafka ecosystem.
Yep. And once it's in the Cloud then it's in the Cloud. You might as well take advantage of these things that are effectively infinite. You're not going to, I was talking to Gwen Shapira. In fact, that episode may have aired in the last week or two before this one or never. The order shuffles a little bit between recording and when they go out. But recently Gwen's been on the podcast and she made the point, you're not going to run out of S3. You're just not. Maybe you're big and cool, but that's not going to happen. So what do you see about the future? Where does this all go? This seems pretty complete to me, but you talked about the future a little bit in the blog post. So where do you go?
Yeah. I think one of the things with serverless is if you look at, for example, AWS Lambdas, they basically shut down when you don't need them and then they scale up when you do need them. We are envisioning Kafka to be on the same journey. We are not there yet. Customers can scale up, they can scale down now, but then you're not really scaling down Kafka to zero. There's still a computer there that's running and you still have the data in there. We want to really see how we can realize that vision. Can we scale it down to zero. And resources are actually dynamically allocated based on workloads and policies. So this is where we want to take the journey where users only need to think about their client application logic, and Kafka resources are magically going to arrive when needed and disappear when you're not going to use them.
There's some non-obvious product thinking that has to go on there. There are interesting decisions, particularly about data. Because it's hard for that to scale to zero because you've got things stored. Where does that go? And there's a bunch of trade offs. It'll be interesting to see how the team tackles that, and I'd like that fully managed Kafka even better.
Yeah, absolutely. And it's going to be an interesting journey. We'll see how we get there.
My guest today has been Prachetaa Raghavan. Prachetaa, thanks for being a part of Streaming Audio.
Absolutely. Thanks, Tim for having me again.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
You might call building and operating Apache Kafka® as a cloud-native data service synonymous with a serverless experience. Prachetaa Raghavan (Staff Software Developer I, Confluent) spends his days focused on this very thing. In this podcast, he shares his learnings from implementing a serverless architecture on Confluent Cloud using Kubernetes Operator.
Serverless is a cloud execution model that abstracts away server management, letting you run code on a pay-per-use basis without infrastructure concerns. Confluent Cloud's major design goal was to create a serverless Kafka solution, including handling its distributed state, its performance requirements, and seamlessly operating and scaling the Kafka brokers and Zookeeper. The serverless offering is built on top of an event-driven microservices architecture that allows you to deploy services independently with your own release cadence and maintained at the team level.
There are 4 subjects that help create the serverless event streaming experience with Kafka:
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us