We talk sometimes on this show about how making a service like Confluent Cloud run fast isn't just a matter of throwing Kafka on some really big cloud instances and mashing clay. There's actually a lot of engineering to do. So I talk today to Alok Nikhil and Adithya Chandra about their work. They're both engineers on the Cloud team here at Confluent, their work optimizing Kafka in Confluent Cloud.
We get into some pretty cool stuff. Before we get there, this is your reminder that Streaming Audio is brought to you by Confluent Developer—that's developer.confluent.io, a website with everything you need to get started, learning Kafka, Confluent Cloud, whatever it is you might want, it's probably there. We've got video courses, event-driven design patterns, executable tutorials, all kinds of cool stuff.
You'll probably want to end up using Confluent Cloud to do exercises and some of the Kafka tutorials. If you do, when you sign up, use the PODCAST100 code for an extra hundred dollars of free usage credit. Now let's get to the show.
Hello and welcome to another episode of Streaming Audio. I am, once again, your host, Tim Berglund. It's like I'm just never stopping being your host, Tim Bergland. I'm joined on or in the virtual studio today by Alok Nikhil and Adithya Chandra. They are both software engineers on the Cloud team here at Confluent. Alok and Adithya, welcome to the show.
Thanks Tim. Can I just start by saying, I'm bummed out that you're actually leaving this and I'm also honored that we are probably one of the last few, you said, that you'll be signing off with, but it was super awesome. We've worked in the past once. Great. Thank you so much and congratulations on your career here at Confluent and good luck with what you do next.
Thank you. Thank you. I wasn't going to mention that on the show today, but in a few episodes, it will be the end of the Tim hosted podcasts and there will be new hosts streaming audios and going anywhere.
Yeah, yeah. I'm excited to be on this as well. Yeah. I really enjoyed the previous podcast that we did and I think, let's get started.
Welcome back. It's always, I think, as I say that the triumph of hope over experience for folks to return to the show. I think maybe next time it'll be better. Anyway, we're going to... You guys have a blog post that was actually published. So this is the middle of November 2021 that we're recording this. This will probably might even be out in the new year. I'm not sure when, but you had a blog post in the middle of the summer called Speed, Scale, Storage: Our Journey From Apache Kafka to Performance in Confluent Cloud. Literally reading that right off my screen right now. And honestly, this is the podcast version of the blog post.
So the blog post will be linked in the show notes, but I always think it's fun to talk through this stuff, because if you didn't read the blog post and if you're listening to Streaming Audio while you're walking or working out or taking care of your caterpillar farm, whatever. Whenever it is that you listen to Streaming Audio, this can be a good time to go through these things.
So honestly, I just wanted to talk through some Kafka performance and Cloud performance things with you. So kick us off. What was your general approach? And this will be a jump ball. You guys can... either one of you can take this, but what are the important topics in performance? Kafka performance in the Cloud.
Yeah, for sure. I think I can start briefly and then Adithya can jump in any time. But basically I think what we were trying to focus on is, particularly with Confluent Cloud, trying to set itself apart from Kafka, open source version of Kafka. And when you manage Kafka on your own, the kind of struggles that you go with trying to tune Kafka. Confluent Cloud is trying to, wrap all of that into single package, right? So we... And performance itself, when you think about it has multiple dimensions. So we tried to restrict this to some extent in the sense that we were using a few principles that we thought were very relevant for us to achieve these goals. So particularly we had three principles in mind. We said, "Let's start with the first one." We said, "We need to understand our workload." I mean, there's nothing worse than trying to tune for something that you probably never run or host.
So that's kind of where we started with. We tried to understand our users better. We tried to understand our workload better. We had metrics in place that allowed us to see how some of the way some of these workloads were using Kafka, particularly in the dimensions that were relevant to performance, for example, number of partitions, maybe a number of connections. So these are just an example of what this entailed, right? And the second principle was more about looking at infrastructure and where infrastructure helps us tie these things together because it's, again, it's talking about the workload. It's probably not intuitive what kind of infrastructure you need to support a given workload, but at the same time, it doesn't make sense to just generalize everything and just run everything on a big, massive machine, a box. It doesn't work economically.
It doesn't work... Even more than economically, I think it just doesn't work if you just don't use those additional course for example, or additional IOP's [inaudible 00:05:37] provision. So this is just another example of where we needed to consider another principle. Third principle I would say, finally, would be more about observability. Which is, now that you've put things in place, now that you have some of these patterns in place, are we doing good? Are we still achieving the goal that we need? Are we falling short of something else? Should we regress somewhere else? So I think that's kind of the idea about how you tune something in the Cloud.
So you're saying a good economical performance, fully managed Kafka, isn't just running the Kafka process on a bunch of instances of some type and putting a form in front of it.
It's a little harder than that to make this work well. So, and performance is what I call a rugged landscape kind of problem. And let me illustrate that and if you're not familiar with the analogy, but there are some problems that are like Mount Fuji in Tokyo or near Tokyo, or like Mount Rainier. If you see them from a distance, it's just this mountain, right? You're like, there's nothing. And now there's this. And then it's this peak.
And there are problems like that, right? If you wonder in your IDE what the best font size is, that's a Mount Fuji problem, right? And that changes with time, of course. I just took my glasses off before we started rolling here. That's a thing for me. So, your vision changes and so maybe at some point you want a bigger font, but you can't just make it huge because then you can't see as much code on the screen at once. And you're dealing with near... short term memory and all that stuff, so, or working memory. So there's an optimal font size. You can actually figure out what that number is. It's just optimization. With performance, there's a bunch of different things you can change, a bunch of dimensions, and instead of a Mount Fuji, you want to think of a rugged mountain. I live by the Rocky mountains. You've looked, you fly over them.
And there's just all these peaks and you kind of look across and you're like, "I don't know what the highest one is." And when you're walking in the mountains, it's really deceptive because you think, "Okay, I'm almost at the peak of this hill," and you get there and you're like, "Oh no, there's this other peak." And then you get to a peak and you might think I'm at the top, but it's hard to tell. It's hard to tell when you're at the top without searching all the spaces and traversing the different dimensions of that landscape. Changing different parameters in the system is just going to change the elevation of things. So you're making this point that you actually have to observe. You have to understand and you have to observe, because it's a rugged landscape problem, you have to take all these measurements.
That's not for me. This is like your podcast. I don't know why I felt like that that illustration was important. Let's dive into knowing your users. That was really principle one. What do you do there?
Cool. Adithya, do you want to just take a... There's some work that others has been doing on this too, so.
Yeah, sure. So yeah, I mean the thing about Confluent is it's a managed service and we have all these different customers who are using it in very different ways. What we get to see is what's happening to their Kafka broker as such. So that's the thing that we have visibility into. And based on that, we need to... we want to tune, or we want to improve performance in areas that would be beneficial to a large number of our users. And so, to understand that better, we focus on different aspects. The third principle as well, which kind of ties into it, which is absorbing. So we kind of learn from our fleet first. So when we are learning from our fleet, we have a bunch of things. We have these historical metrics where when we are creating. So we get to see what's happening to their workload over time. Do they have patterns?
They have a lot of partitions. Do they have... What are they? Message sizes. Are they sending small messages? How many... What kind of consumer? How big are their consumer groups? And what's happening with their offset, offset commits. Are there times when there are more offset commits? Now, are there rebalances? There are a host of different things that we look at. And then what we want is a reproducible workload that we can run before we make our changes. This helps us be reasonably confident that there are no performance regression. And when we improve something, that it is a net benefit, it's a net improvement and it does not spoil performance in other areas. So, that's our main objective there. So yeah, we look at these metrics and the thing about, like you mentioned... It's a very broad space and it's hard for us to know exactly which areas will be beneficial.
Some workloads are CPU bounds. So we know we want to improve CPU performance. There might be a better algorithm. You might cast something and that'll improve performance a lot in that particular area, but they may also be disc bound, IO bound, in which case we want to do something else. Or if they're memory bound, we may want to change things. And so that's kind of like what we are thinking about when we create these workloads, but we also have to keep in mind that we cannot run any possible workload. So we have tried... What we try to do is create these profiles and create our workloads to match those.
And also, when we are creating the workloads, they are related to Confluent Cloud. You have this concept of CKU's which have limits, and we have to be confident that we can hit those limits in all... in most cases. And the limits are again, in a lot of different dimensions. A max limit in dimension A, a max limit dimension B. But if you do A, B, C, all of them together, you may not be able to. So these are some of the things that we need to keep in mind where we are understanding our users and yeah, maybe Alok has more things to talk about.
Yeah, no, no, I completely echo that. Particularly, I think some of these workloads, it's a problem of finding whether you want to optimize for the local minima, the global minima, these are the kind of things that become tricky at large scale, right? So sometimes you could over fit the workload to something that you think is generic, but when maybe a new customer signs on, for example, and they start a workload, and we're completely off the mark. That's a pretty potential regression. So I think it's... that's kind of where it becomes a little tricky. So it's a balance of how much can we extract versus can we just leave it at the defaults in some cases, because that just works.
Right. Right. So you, Adithya, you described, checking notes here, a six or seven dimensional mountain range. With my rugged landscape analogy, we can only visualize three dimensional mountain ranges where you can move in latitude and longitude, and then you've got performance. You want the peak and Alok really good point about the local mat... You said minimum, but maximum, whatever.
They're the same. Finding the local maximum versus the global maximum. Obviously, you want the global maximum, but you can't tell when you're there, because it's too expensive of to search the whole space. And the solution, I mean, this is, I guess this whole discussion is sort of a love note to simulated and kneeling is the algorithm that gives you... What's the right word? Good confidence that you're probably somewhere near the global max.
It's a probabilistic thing. You're probably close. Don't worry about it, it's fine. And that's how it works. It's too expensive to search the whole space so you get there. But what we have working for us there is, you guys don't just optimize on one workload, you have to introduce randomness in solving problems like this.
And that's all the different customer workloads.
So new customer has some new and idiosyncratic way of doing things. And that kind of knocks you off the peak of what you thought was good. You're back to the grind and trying to figure, "Oh wait, that wasn't the peak I have to climb back up here now," you know? So, that whole thing. It's a good visual. It's a good visual. Think of mountains.
Yep. Yep. Yep.
Give it an arduous mountain climb in inclement weather.
Exactly. In inclement weather is the new workflow. Yep.
But yeah, I think you said message characteristics, topic partition, distribution, number of connections, which I guess... Oh, consumer group size and the number of consumers and producers. There were a couple other things you said, but I can't remember them, but that's a lot to understand and we have the benefit, being a Cloud service of just being able to see all that.
Yep. A hundred percent.
Nice. All right. Gets us to infrastructure. Now the simplest understanding of that and, Adithya, you were talking about if a workload is memory bound or CPU bound or IO bound, there can be different kinds of instances and it's not just instance type, right. There's other stuff. So walk us through infrastructure choices and how those are informed by knowing your customers.
Right. So, one thing is hardware is still getting better. So all we have to do... Even if we just keep moving to the latest hardware behind the scenes, you will get benefits like Confluent Cloud. You buy a CKU and then you're using your Kafka instance. And then if behind the scenes we are changing hardware, then you'll automatically get improvements. So understanding hardware is pretty important for a managed service. And there are different dimensions, like you said. For CPU, we have the new thing is like these arm instances, which have a very different price performance thing compared to X86 instances. But at the same time, we have to make sure that there are no performance regressions and then we have to have all our... It's not just in Kafka.
We also have all these other things that run alongside Kafka that where we have to make sure all of them run reasonably well. We have different things like SSL. We have to make sure all of that works well. And this is where knowing your users and building that simulation framework, because we want to do this, like you said, it's a search problem. And then there will always be cases which are unique. And then if we change it behind the scenes, when people are using the product, we have to be very confident that they will not have notice it. The performance will only get better and there won't be a regression. And that's why making any of these changes behind the scenes is super... The first principle is very important so that we completely understand to a reasonable extent, the landscape.
And then we can mitigate some of the things where we have missed with a slow rollout with better observability, which is the third one, which we'll get to. And the hardware itself is... CPU is one of the things that I mentioned. We have the whole type... A huge thing of instance types. You have CPU bound instances, memory bound. Do you get more memory? What's the ratio of memory to CPU? What's the ratio of storage.
Do you go with larger discs and then you have all kinds of discs. Do you go with... Even if you use network attached discs, do you go with higher IO discs or do you go with a cheaper storage? What will give you the best price performance ratio? And in our case, we want it across the board. And then there are also network optimized ones, where you get higher bandwidth that they're optimized for. And in addition to all this, infrastructure is not just the hardware that we come with. It's also once we get the hardware, what are the other things that we need to change, tune? If we have more cores, we would want more threads in Kafka to utilize those cores.
Kafka tuning, not just infrastructure choices, but-
So we have to tune Kafka to run well on those... on that particular infrastructure and Alok has been working on this closely. So I think you have some more thoughts.
The Kafka tuning part.
Yeah, a hundred percent. I think, yeah, that's... Just to extend that thought, right. I think even with hardware, we also are constantly in this dance of finding the best hardware with something also that's widely available because capacity is a big issue too in the Cloud. And particularly some regions just don't have enough. Let's say you just decided to choose the new instances type and then fire network large in our networks optimize.
Which is the crazy thing that if you don't... If you're not you, you don't know this. If you're not an engineer building.
Cloud service to scale. You don't realize you can run out of the Cloud sometimes.
Like S3, you're not going to run out of S3. That's just not on a global sense, but within a region. Yeah. There are things that there are only so... There are computers there and there are only so many of them.
Exactly, exactly. And it's funny because some of the supply chain issues that we've been seeing, because of the pandemic. It's interesting that it actually is also showing up in the Cloud because some of the newest AMD chips that we want to try and use are just not available yet.
That's the worst.
Yeah, so it kind of is. It's interesting, as I said, it kind of holds us back from pushing all the way through. And sometimes you can kind of have to re-organize our plans around that. But yeah, just also to continue the thought process that other teams talking about, the OS [inaudible 00:21:15] definitely the operating system and JVM. All of those up the stack need to be tuned just as much as the hardware to make full use of the system. One example that we recently actually encountered was some of the page caches when we try to flush them to disc, right?
The page cache itself, the operation is controlled completely by the operating system. Now, if you just go with the defaults, it works great for general use cases, but some of the high throughput and high IOPS use cases that you've seen in the Cloud, this can cause significant amount of trashing. Particularly because you just flush too often or you just don't flush enough and then when you flush, you flush for long periods of time, completely blocking all of the IO. And so some of that kind of starts startling up the stack and then customers start seeing impacts all the way to their application. So these cascading effects are pretty significant. And then some of these issues eventually raise up as poor performance. And then when you actually-
Somebody files a ticket.
Yeah, exactly. Somebody files a ticket and then you are looking at it and you're like, "Oh wait, this doesn't... This never showed up in some of the workloads and tests that we did, but now it is." So that kind of goes back to the feedback loop and say, "Okay, let's take all of this, go to know your users, principle one, see how you can actually fix that there." Yeah. It's only an endless cycle of just constantly going through this.
Is JVM tuning... GC tuning isn't what it was 10 years ago. The modern garbage collectors are... They hate you less, but is that... Do you still spend time on that?
Yeah. Adithya's the resident expert in GC tuning.
Yeah. Kafka has been great for us. It uses memory really well. So we haven't spent a lot of time on tuning. We've made sure that we give it a limited amount of memory, like four gigs to six gigs, and that's worked really well in production for us.
But it's not just the... But there are these exceptional situations where things might not work as we anticipated. There are certain cases where Kafka will be using more memory than usual for a particular workload. If you're creating a lot of new transactions, for example, and it's important for us to be able to identify these cases and then fix them. So again, this is where monitoring helps us. We have... We keep track of the GC time across our fleet. We have dashboards for it.
And then whenever we see full GCs, which we don't expect, we go in and I investigate. We have these tools to take memory dumps because this is pretty rare. It does not happen often. Most of the time things are going well, but in the rare cases that it does happen for a particular workload in a particular scenario, we take the memory dumps. We've been able to identify a lot of issues using these dumps. Like I mentioned, the recent one that was fixed in Apache Kafka was related to a lot of user IDs, which was using up memory. And this is something that's the piece... It's like an operational muscle that we've developed at Confluent Cloud. It's something that we, whenever we see an issue, l everybody knows how to take a memory dump and we have these smaller heaps. So it's easier for us to download those and then analyze it using these tools and find out where are these allocations. How do we improve that?
Which gets us to principle three, observability. We've done a show fairly recently in the last few months on observability in Confluent Cloud. But tell us how that intersects with your performance optimization work.
Yeah, yeah, that's true. We actually have a challenge, especially when I reflect on how in the past where I've worked in distributed systems, right? So in the past I've worked in my SQL developing a DB engine. We actually lacked the metrics back then to dive deep enough into the system to try and understand when things work at scale. Kafka actually has significantly more metrics and it's actually sometimes too much. And you probably don't want to index again, that kind of where it becomes a problem of over indexing on things, right? So what we do try and focus there is again, on... Of course, the basics, CPU usage, memory usage, network packets per second, throughput, all of the things that allow us to understand how does it actually perform just by looking at the basic system stats, right?
And beyond that, of course we have specific component wise metrics, something along the replication side, something on the storage side, something that goes dives deeper enough into Kafka that allows us to, for example, during on call, try and understand where the problem could be. But beyond this, this is mode of metrics, but we also have significant amount of logging, request tracing, profiling, all of these that kind of come together as a suite of tools that allow us to better understand performance at scale. One example of a tool that we use is acing profiler. So this is a widely used Java JVM tool that allows us to collect real time profile of the different stack traces, the different function calls without actually going in and running a Java agent to instrumenting the Java process, which typically slows things down.
So this has been significantly useful for us to try and understand, for example, lock contention or anything that has to do with maybe we have a bug in that; we have a regression. Particularly this is useful for some of our nightly tests that we run tying back to principle one where some of these continuous tests that we do, flags something that's off the baseline. When we go back, we automate some of the files that we collect and we go look at it and we actually do see that there is a recent changed that probably is now indexing into the map and it shows using the map as a map, goes through all the keys in the map. This is something that happened recently. So, yeah. So this is the kind of thing that allows us to catch things early on.
And then it's not just about what the issue is. It's also about trying to find where the issue is. So yeah. So these tools have been really, really useful. And we also internally use droid and big query. So droid allows us... I believe there was also a block post that came up recently on how we use droid at Confluent Cloud for some of the metrics, APIs that we expose to the customers. But we also use droid internally to access some of that... those metrics that allows us to, for example, dive B, put into on a per topic basis, on a per partition basis. How the usage was for a given customer or for a given cluster. And so yeah, overall it's... without these tools, it's almost... it's basically flying blind, right? And just driving blind. You never have enough feedback to actually understand the system at that point.
My guests today, have been Alok Nikhil and Adithya Chandra. Alok and Adithya, thanks for being a part of Streaming Audio.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
Maximizing cloud Apache Kafka® performance isn’t just about running data processes on cloud instances. There is a lot of engineering work required to set and maintain a high-performance standard for speed and availability.
Alok Nikhil (Senior Software Engineer, Confluent) and Adithya Chandra (Staff Software Engineer II, Confluent) share about their efforts on how to optimize Kafka on Confluent Cloud and the three guiding principles that they follow whether you are self-managing Kafka or working on a cloud-native system:
A large part of setting and achieving performance standards is about understanding that workloads vary and come with unique requirements. There are different dimensions for performance, such as the number of partitions and the number of connections. Alok and Adithya suggest starting by identifying the workload patterns that are the most important to your business objectives for simulation, reproduction, and using the results to optimize the software.
When identifying workloads, it’s essential to determine the infrastructure that you’ll need to support the given workload economically. Infrastructure optimization is as important as performance optimization. It's best practice to know the infrastructure that you have available to you and choose the appropriate hardware, operating system, and JVM to allocate the processes so that workloads run efficiently.
With the necessary infrastructure patterns in place, it’s crucial to monitor metrics to ensure that your application is running as expected consistently with every release. Having the right observability metrics and logs allows you to identify and troubleshoot issues relatively quickly. Profiling and request sampling also help you dive deeper into performance issues, particularly, during incidents. Alok and Adithya’s team uses tooling such as the async-profiler for profiling CPU cycles, heap allocations, and lock contention.
Alok and Adithya summarize their learnings and processes used for optimizing managed Kafka as a service, which can be applicable to your own cloud-native applications. You can also read more about their journey on the Confluent blog.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us