May 25, 2021 | Episode 160

Running Apache Kafka Efficiently on the Cloud ft. Adithya Chandra

  • Transcript
  • Notes

Tim Berglund:

In our continuing series on distributed systems engineering, I talk to my colleague Adithya Chandra about performance optimization in distributed systems. There's some really new things that he's gotten into and he shares them with us on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.

Tim Berglund:

Hello, and welcome to another episode of Streaming Audio. I am as always your host, Tim Berglund and I'm joined in the virtual studio today, which is a virtual video studio also, just as a reminder, these are available on YouTube as well as in audio format. Anyway, I'm joined in the virtual studio today by my coworker, Adithya Chandra. Adithya, welcome Streaming Audio.

Adithya Chandra:

Yeah, thanks, Tim. Thanks for inviting me. And it's great to be on the show and talk about distributed systems.

Tim Berglund:

Cool. Now you are an engineer in the Kafka performance team here at Confluent, and this is one of our episodes in a series where we're talking about distributed systems engineering. So I want to talk a little bit about what you actually do on the Kafka performance team, but also about how you got into that kind of work and just what separates it from other kinds of engineering.

Adithya Chandra:

Yeah, that sounds great. And in the Kafka performance team, we are looking mostly at improving Kafka performance in terms of scalability, in terms of cost and a big chunk of it is like running Kafka super efficiently on Confluent Cloud so that we get the biggest bang for the book when we are running it there. And also like scalability in terms of we have lots of customers and all of them have completely different use cases. How do we run Kafka efficiently for each one of them? And especially there are these use cases with like there'll be some customers who are really large, some customers are really small. Some customers have large message sizes, some of them are small message sizes.

Adithya Chandra:

It's the whole bunch of things. And so we will really want to focus on that to improve it. And like have a platform that you like our customers don't have to worry about performance at all. They can throw any kind of workload and we should be able to handle it. So that's kind of what excites me about working on this and how I got here, basically, I've been working... I was at AWS for seven years before this, pretty long time working on the cloud as a group, there was tremendous growth. And I worked on Aurora on the storage layer when I did perform... I did cluster management, load balancing, et cetera. And I also want-

Tim Berglund:

Did you remind us what AWS Aurora is?

Adithya Chandra:

So yeah, AWS Aurora, it's a MySQL and Postgres compatible database, but the key innovation was this separate storage. The incredible thing was like if you have these block storage layers where you're writing like you have elastic block storage, EBS, in AWS where you're writing like booking blocks to a risk, like what you can do in the Aurora storage engine instead of entire blocks, you can write devs, it can be small devs, it can be a small number of bytes, it can be large number of bytes. So for a storage engine that is on the network, it's kind of like using Kafka, right? It's something on the network. If you just write the changes instead of writing an entire page each time, I mean like large number of blocks, which is a collection of pages, what you get is the amount of... If network is your bottleneck, the amount of stuff you have to write on a network goes down significantly.

Adithya Chandra:

So then on our storage side, what we will do is depending on the database that you have, we could apply those changes. And then we could support a ton of features because of that key change. So that was a really interesting innovation that was done at AWS and the storage engine had like tons of features because of that like a typical thing that happens in a database is when you go down and then you have to recover, you have to replay all these changes.

Adithya Chandra:

So if you have a normal MySQL, all the changes get replayed on a single instance. But in this case, what happens is you have all these, you have this big storage cluster, but you're not ready to a single storage instance installation, your database is now writing to a sequel 100, 20, 50, depends on how big our storage clusters. So let's say 100 and all of these can replay the changes because they've got a subset of these changes and they can come back, and they're almost like when your database restarts it's catching up really quickly can just say, "Hey, give me the latest version of the page." And they would like replay it and give you the latest. So that was amazing, so the restart times were fast and we could do availability, right? Instead of you having to worry, we would have in these instances in different Azs. So you wouldn't lose data if any one of them went loose, it was much highly available. And we could also do backup online when, again, your main instance doesn't have to worry about backup.

Adithya Chandra:

You have these other, like the storage servers, which are collecting all these changes, logs, and they are actually working through them and assisting them on this. But they're also like have these backup things which are like pulling this and persisting it lets say in S3 or like something that is a cheaper storage. So you get online backups which are always up to date. You don't have to worry about it. These were some of the features, and then there were a bunch of other things that got built on top of it, like you could clone your storage volume so pretty quickly because now it's on the cloud, you have the storage clusters basically. And then if you spin up a new database cluster, which needs to a new database instance from your current one, which needs to point to this, then you can just have like it can be the same storage nodes that it talks, it just needs to decide from what point you would have cloned and then only writes from there would be a written separately, so that was an interesting feature.

Adithya Chandra:

And also becomes very interesting in terms of replicas.You have all these replicas sitting around in your database, your masters writing and the replicas can now separately read from the storage. So it's kind of like separating storage compute, and it gives you all these advantages because you have these replicas that can directly read from these storage clusters. They can ask, hey, give me this page or that page, whichever page they're interested in. And they can decide how much they'll cache. They don't have to cache everything from the master. They still have to do for transactions, et cetera. There are a bunch of like who wants us to this, but this is the high level thing of how it was beneficial. And that was-

Tim Berglund:

Seven years. Oh, go ahead.

Adithya Chandra:

So that was just one part of it after that I wanted to talk about performance, but that's another story. So yeah-

Tim Berglund:

It sounds like its good preparation. Seven years of that was good training for more distributed systems work, right? You spent a lot of time building really a pretty interesting product. That sounds like there's some sort of log like concepts in there that enabled some of that coolness if I heard you-

Adithya Chandra:

Yes, definitely. It was basically a log [crosstalk 00:08:08] Yes, definitely Yeah.

Tim Berglund:

Cool. So you came there -

Adithya Chandra:

It was bit more-

Tim Berglund:

[inaudible 00:08:21] performance engineering.

Adithya Chandra:

Right. So, there was a segue after that. So I went off talking about Aurora for a long time. But the other thing that I did was I was also a part of the Elasticsearch team at AWS. And a key problem there was a slightly different problem, was we had all these instances of clusters of Elasticsearch running and managing it was a pretty significant problem because there were so many of these. And to do a really good job of managing it, you have to understand the performance and what is actually happening in each of them. So we build something called Performance Analyzer that you look at where for each particular cluster, where is your CPU? Where are your resources being used? Like who's using your CPU? Who's using, where is my memory being allocated? Who are the top consumers? So that we could then take decisions on better load balancing and better audit unit.

Adithya Chandra:

So recently Amazon launched Amazon Autotune in their elastic search service which utilizes there's a lot of these things to automatically tune things like how much memory to allocate, where to allocate memory. And like you can use that for a lot of things that we do on Kafka as well, like have the cells balancing clusters, which can automatically move stuff around and load balanced. So that's kind of how I ended up in this team.

Tim Berglund:

That's their Autotune not to be confused with the voice processing Autotune service completely different thing, right?

Adithya Chandra:

Yeah.

Tim Berglund:

Yeah. That some of us when we sing we could use that. So tell us about how then you came into the Kafka performance team? Because I always like to know when I talk to people who do the kind of work that you do, how did you get there? And it seems like seven years building Aurora and Elasticsearch Autotune things would be pretty good preparation, but tell us a little bit about what you do now.

Adithya Chandra:

Sure, yeah. So now I think the first project I looked at was how can we improve costs Confluent cloud? How can we actually take Kafka and run it with like half the memory? So that was very interesting to me, like having worked at Aurora which was logs, but the abstraction for the user was not logs. And Kafka just takes logs and the abstraction as logs, I think that's super powerful. So a lot of the things that we built on the storage engine was actually logs, right? Like how do we take this with the incoming queue? Like how do you persistent on this? A lot of it was that. And if you can move some parts of it and you can... Anybody can consume the log, like I could clearly tell that this is super powerful and it enables you to build a lot of interesting software on top of it like Aurora, you could have all these databases that use this.

Adithya Chandra:

So I was overall interested in this and at Confluent so the first thing I worked on was this like basically run it with half the memory and the interesting thing there was I had spend a bunch of time in Elasticsearch where memory was a big problem. So you were understand where it's memory actually being allocated. But in Kafka it's like super good at memory, right? If you use it. I was surprised that it uses such a small amount of memory, it's a storage system. And it runs with like six gigabytes of the JV & P. So for context, heap is where we allocate most of the objects in Java, when you create a new object, it's the allocated on the heap and the references to it are held in the stack. And so you need a larger heap if you have a lot of these lying objects that you are constantly like using.

Adithya Chandra:

And Kafka because it persists things to the disk. And it also uses the operating system page cache for more efficient I/O, doesn't really keep too much of these live objects at any point in time, the working set is pretty smart. So that was interesting. I got an opportunity to look at all the other production clusters in Confluent that are running generate a lot of synthetic workloads to understand for different use cases where you have high and different message sizes where you have consumers lagging, consumers who are consuming immediately after a producer, et cetera. And yeah, we were able to sub conquer really well with even just six gigabytes of heap. But what we saw is for a lot of cases, you don't even need six gigabytes. You can go to like four gigabytes and still get fantastic performance, that's pretty small and the working set size or like gigabyte of memory, like the allocation rate was also not so high that we needed too much more.

Adithya Chandra:

And that was interesting. So there was a bunch of things around, how do you measure this? What do we do if there are certain cases where when you exceed this total heap utilization. So those are some of the interesting problems. And the other interesting thing was the page cache itself, which is used to improve I/O reads and I/O write. And also I got to mentioned to set the context, why was this possible? How can you alter the blue shrink memory by 50%? why was it not possible before? I think the key thing was we started using much better disks.

Adithya Chandra:

And how did we start using much better disks? The thing to keep in mind is we have all these SSDs now, which are super good in terms of bandwidth. So they are very fast. You can do random reads, et cetera, but they're very expensive in terms of per gigabyte cost. Like if you now have, let's say 10 gigabyte, like you should have a petabyte of an SSD, it's pretty expensive. But a Confluent launched a tier storage which allows us to move all this data which is a slightly older to a different tier, a cheaper tier, which where the cost per gigabyte is much smaller, but bandwidth is a lot more expensive, slower, so throughput can suffer. So what we did is we moved to SSDs for the tier immediately to what is attached to the broker. And there, we got much higher bandwidth.

Adithya Chandra:

So what happens when you shrink your page cache is that your writes can get smaller, your reads can get a little more frequent, but this SSD can do bandwidth really well. It's more expensive if you want to provision a lot, we do need to provision a lot anymore because we have infinite storage, when we go to S3 or we go to a different object storage in the cloud. And so that was what that really enabled us. And we were able to shrink the page cache significantly. We had to tune parts of it. We've found interesting tuning lessons as part of this exercise. So there are two main things that you need to understand about the page cache. Basic thing is when after writes to the disk. So you have, let's say you have a file, but partitioning Kafka and you wrote all these records to it.

Adithya Chandra:

And these records are not immediately written to the disk because let's say you have a lot of partitions, let's say 1000 partitions, 1000 files and they're all small updates that are going in each file. So what Kafka does is it just tells the operating system, "Hey, I want to write all these things. And all these things are kept in memory and not immediately flushed to the disk. And then we have these called figs where the operating system has these flusher threads that make up every once in a while depending on your conflict and then write all these to disk. When they're writing it-

Tim Berglund:

These are operating system level flusher threads, this is not in-

Adithya Chandra:

These are operating system levels flusher threads. And the thing that you can do here, the operating system does the drivers, like for example, if we take EBS, which is the AWS, what it can do is it can take 32K sequential reads, let's say you have multiple updates that are next to each other, we can combine them into a single 256K, right? I can write it in one write and that is a lot more efficient. So you get a lot a bit much higher bandwidth by doing this buffering. And so your disk throughput is that you can get much higher throughput with the same disk. So that's the advantage of a bigger cache. So you can keep things longer in memory and you can also let the other way it works is if you're exchanging the same piece of information multiple times, and then rewrite it to disk only once. This doesn't happen very frequently with Kafka, but there are some metadata files where this can happen.

Adithya Chandra:

So that can also reduce the amount of I/O that you want. So that's how it's efficient. And on the read side, how it's efficient is basically when you're making these changes, they're all kept in RAM. So if your consumers come out and read immediately a producer, which is what we see in most of the use cases, all the other consumer laggers in milliseconds in most cases. People come up and read what's written almost instantaneously. Then you can service it directly from your memory, you don't have to go to the disk at all. You have all the things that have been written and you've kept it in this past but transient storage in main memory, and you just give the responses back. So this is another thing.

Adithya Chandra:

So what happens here is if you shrink the cache then you have lesser of amount of time that you'd want to keep that. And so that was pretty interesting. And some of the things that we have advantages in the future that we should look at is like when do we do out of band reads? For example, one thing that comes up and does read is the tiered archiver where we're actually moving stuff to a different tier, and for that, we want to read in bigger batches.

Tim Berglund:

Ah, right. And those are older messages by definition.

Adithya Chandra:

Yeah, right. And those are older messages by definition.

Tim Berglund:

Apologies to the page cache.

Adithya Chandra:

So yeah, that is not going to heap. So if your goal is everything has to be in the page cache that's not going to happen, but that's where SSD help. So it's good for them to read. But the thing that we want to do is each would probably directly read from the disk. It doesn't have to come to the cache. Because once it writes to SSD, the other optimization we have is once it's written to an object store, if you read something from the object store, we service it directly. It doesn't go to the cache anymore because we think this is historical data, and then it may not be read again. So we can directly send it to the consumers that are requesting it. So that way it does not make you not use the cache for all these new writes and new reads that are happening inside the... so that is one.

Adithya Chandra:

And in terms of improvements, what we can do is when it starts moving, we can skip the cache. Another thing to keep in mind, this was very interesting, is how the page cache itself works, right? Like when those flusher threads come up and write to disk. The one thing we tried is we actually configured those flusher threads to come in very late so that you have as much, let's say you have a lot of writes that are in memory, and then they come up very late and they write. And then we notice this interesting thing. We actually got worse, more writes, more smaller writes with a large flusher tier in large, if they came up much later versus something that came up earlier, and this was very hard to understand so we had to look at a bunch of traces, et cetera, from the operating system.

Adithya Chandra:

But from what we understand of what is happening here is that if you wait too long there are all these other consumers. So you have to understand how the read cache made of the page cache works. It's basically LRU cache so these recently used, so whoever is reading, they go to the top of the cache so anything that was updated slightly, or it starts going to the end of it and you have a fixed amount of space. So when you run out of space, everything at the end gets thrown away. And in case it's not written, if it's a written write buffer, then it gets written to disk.

Adithya Chandra:

So now what happens is if you delay the flusher threads long enough, then the actual memory pressure kicks in, it goes to the end of the LRU and gets written to disk. So you get again, small writes that can actually get written. So the main advantage we got with the flusher thread was they would come up and write in these batches where they would take a bunch of guys and then put these sequential writes, instead of we made it large enough, then these writes would reach them the end of the LRU and then instead get written like in that past, that was also interesting-

Tim Berglund:

Do you still get a big gain on SSDs doing larger sequential writes? Does that matter as much?

Adithya Chandra:

No. Like it's not much larger sequential writes. For example, like on EBS, like the max for a hard disk is a megabyte in terms of writes, for an SSD is just 256K. But with 256K you do get a benefit in terms of... Another thing to keep in mind is that this is not just SSD, right? You also have these cloud provider things whether it's going over the network. So there are other things in between when a bigger batch or does have. So if you're applying, it's still faster in terms of 4k. And I think it's not just probably sequential, but in this case it's a single call and so that's what we've seen from our experiments that we could actually bring it down by like a factor of two or three by setting tuning this buffer size write.

Tim Berglund:

Oh, wow.

Adithya Chandra:

But they're very fast. So the final thing is these SSDs are so fast that it actually doesn't matter from where you sit, like where you were staying in most cases, it doesn't matter unless you have particular use cases where you have thousands of partitions, you have thousands or they're all writing different places, they're small, but it's still pretty hard to exhaust IOPS on some of these modern...

Tim Berglund:

Now, to some degree, performance optimization is everybody's problem, right? Whether you work in distributed systems or not, you're going to at some point, you're going to do performance optimization. Now, we always talk about doing that prematurely and how it's sort of fun and we easily get sucked into it even before it's economical useful to do, but let's just agree that it's everyone's task. How do you think, and I've heard you say all this so I think the answer is in the previous 10 minutes, but I wonder how you would describe the difference between performance optimization as a distributed systems engineer relative to something else like I know there's a lot of different kinds of things one can build that aren't distributed systems, but what makes this different in your mind?

Adithya Chandra:

Yeah, that's pretty interesting, right? The first thing that changes, like you said, you have multiple things, but let's say when we say distributed and it's something with the network and you have two different people who are communicating with each other. A lot of the things that we see actually comes from this particular thing. So the main thing that we've focused on is and the thing we see over and over again is that if you batch better, you get a good performance in terms of lower CPU utilization and higher throughput, but you get worse performance in terms of layout. And that is kind of what we are always fighting against. And that's where a lot of these improvements in our team that we're working on. We are also looking at replication improvement where if you have... Like, again, there's the system part, like you mentioned, if you have 1000 partitions that kind of can be done on another software that's working on a single like instance.

Adithya Chandra:

But you also have 1000 partitions. And then in Kafka, what we have is we have three replicas. And then if you have replicas, so that is like 3000 replicas and now that this particular day let's say 1000 readers, and then you have like 2000 other followers that have to replicate from this. And if these are all in say a different host, for example, let's say you have 100 broker clusters, then what'll happen is all the 99 will have to talk with this particular broker. And all 99 will have to replicate from this because that's how you're spreading. And then you'll have the smaller writes and you'll have a larger overhead versus like, for example, if you brought them closer, if you had these things distributed around, let's say just 20 brokers or 15 brokers now only 14 of them.

Adithya Chandra:

And if you have this single broker is getting like, say 10 megabytes per second, then it only gets divided among them. And we're just replicating like you're just doing I/O of like say a megabyte, probably even lower, but we can get much better performance. So these are the kinds of things that come up only in a distributed systems environment, because it's a question of how are you distributing work across all these guys? How are they talking to each other? Which parts are they? And the other thing is, this is, again, not even this in distributed systems, right? If you have 1000 different partitions and then two brokers are talking to each other, and then what they are looking at is there are a bunch of changes. And then the second broker gets all the updates on the first broker, and it has to let's say go through each one of them, it goes in a loop and it looks at, "Hey, do you have changes? Do you have changes? You have changes." And that starts spending CPU.

Adithya Chandra:

And what you can do better is if you can have a notification mechanism that this other broker knows, "Hey, these are the only ones." Most of the times we don't expect changes because we're constantly replicating. And these are the only partitions that have data then you can again do... It can be more efficient when you are exchanging or replicating. And this is another place which probably only comes up in this distributed system space. And there's also the whole load balancing question which is very interesting, like how do you balance across all these guys. The resources are the same, do balance for CPU, do you balance for memory, do you balance for network, do you balance...

Adithya Chandra:

And then you have these interesting like heuristics, basically you can look at number of partitions, number of readers and just do balancing on that so that you don't like it's a little confusing to look at all these others, if you just balance that, that access a pretty good proxy for all these other resources utilization that I spoke about. And this only comes up in distributed systems because you have these so many small ones and now you have this opportunity to assign work to each of them and you get to decide over time how things are changing. And then you have this load balancing opportunity. If it was a single instance, you wouldn't have it. And it's also opens up a bunch of things around sizing. It opens up a bunch of stuff around what kind of instances should you run? Right? Because we can like split up work in these different ways, we have the opportunity to either use small instances or large instances. And then we want to use the ones where we have perfect performance.

Adithya Chandra:

So if you can think about it, if you use a large instance depending on your workload, the advantages that you get is when you have very high throughput, certainly the number of other brokers that it needs to replicate from will be smaller because all of them are bigger. So you get this performance improvement in terms of better batching. But the big downside is now you have these large instances when you don't have traffic, you're losing a lot of money that you're leaving on the table, they're sitting around doing nothing. And you also, it may be harder to get them. And then if one broker one profit goes down, there's a bigger blast radius, probably.

Adithya Chandra:

So there's this other advantage of going with these smaller ones. And then from finance's perspective, you need to look at, hey, how small can you go? How wide can you go? And there are all these scalability bottlenecks that will start coming up in your metadata. If you go to really small centers, what happens when you go to 100 brokers? Where exactly do you hit bottlenecks? Should we go to 100 small brokers or should we go to 10 large brokers? That becomes a completely different way to look at it, which only comes up with distributed systems because now we have the ability to easily load balance. We have the ability to move things around. And so if we can start thinking about these things.

Adithya Chandra:

And another area that comes up is isolation, we want to have guaranteed performance. You want your P 99 latency, et cetera, to not be affected. Let's say you have some clients which misbehave a little bit with some code pod doesn't have that doesn't have the right quotas. And suddenly, it uses a lot more memory than what is expected and then it can fail. And when it fails, it can take down, it can affect performance for everyone else, right? So in addition to all these things, we can take decisions on how we would have distributed these in terms of topics or tenants whoever you want to look at it and that gives you the opportunity to have fault tolerance. If you have a large cluster of 100 brokers or larger, you can say, hey, these 10 are only for this kind of workload. So that if it goes down, I know these are the topics that we'll get checked and I'm sure that everything else would probably keep running pretty well even though this is kind of affected. So that's the other angle to look at so, yeah.

Tim Berglund:

So coming up against time here, I want to ask one more question. If you had one word of advice to give to an engineer who was not currently working in distributed systems, but was listening to this and thinking, well, there's a lot of performance optimization stuff there, I kind of know that, but there's these additional aspects of like the network and things that are interesting and just, it seems fun to some people, right? This is a really appealing kind of work. What would your advice be to somebody who is a software developer, who isn't working in distributed systems engineering right now, but wants to be. How do you get started? Short answer.

Adithya Chandra:

How do you get... I would say it depends on the area you're starting from. I would say a good starting point is start using these systems like Apache Kafka, et cetera, like install it, start using it. And also start from something very basic, don't worry. Like performance optimization probably happens a little later, but just to play around with, you can start different processes, figured out how they can communicate with each other. There are standard things that you can use, let's say GRPC, or you could make SATP calls, set that up, look at how things would talk to each other. And then this basic thing around how things talk to each other, even if you don't go to all the levels of details around like TCP, and then where does time get spent? Even just having a high level understanding of making calls with two different processes that can run at the same time.

Adithya Chandra:

And then now it's very easy to spin up VMs and to spin up things on different computers. So then if you can actually bring these up on different computers, you'll go through all the basic things that come up over and over again, which is basically discovery. How do these two guys talk to each other? How do they know where they are? And then if you are interested in performance then you can start asking if I send a single request, how much time is it granted? And if I say keep that where do I put those instances? If I put them close to each other in the same building, what kind of latency do I expect? If I put them far away and you can do this with the cloud providers today because they allow you different AZs, different zones, different regions, and then you start seeing, hey, this is what I expect.

Adithya Chandra:

And then you get a feel for what is happening. And when you don't see that, when you don't, when you say like you know that an instruction, a computer it's a two gigahertz processor takes so much time. And then you'll start seeing patterns around, hey, this doesn't seem right. This should be taking like a millisecond, it's taking like 10, 20, 100 milliseconds then there must be an opportunity here, where this time going? And I think that's a pretty good way to get started, get excited about it. And you would start from the basics, right? You would start from the fundamentals. There's a lot of details and complexity as you start looking at it, but fundamentally it's pretty simple. Like it's like two people talking and you kind of understand, hey, these are the guys, this is what you must be doing need for latency you should be able to process messages when that person's messages come, otherwise it's going to be slow.

Adithya Chandra:

So yeah, Building something like very basic exchanged some basic messages, then you start to develop a pretty good understanding of whether you like this field, why it's exciting, which are the parts things can go wrong. And then there's also failure. So we do a bunch of things around things going wrong. This is the next level where you can inject issues. What happens is there's some issue with these two. You can talk what happens there's suddenly a bunch of messages. What happens there's a large message, small message.

Adithya Chandra:

So yeah, with a very small set up, you can start playing with all these different things to start building your intuition around what is happening and what is possible today. So you can work with hardware that is possible today. And as new hardware and new things come up, all of this will change. So then that will give you a feel for how things will change over time. As something might become faster or something might become cheaper. I guess, as these get cheaper memory, doesn't get so cheap. So you will shrink memory and you'll get you'd pay a little more for SSDs. And that gives you a big benefit in cost, but tomorrow memory might become cheaper or something else might change. And Yeah, I think that's how it'll evolve over time.

Tim Berglund:

My guest today has been Adithya Chandra. Adithya, thanks for being a part of Streaming Audio.

Adithya Chandra:

Excellent. Thanks a lot, was great to be here.

Tim Berglund:

And there you have it. Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST—that's 60PDCAST—to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available. So don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign-up links for those things in the show notes. If you'd like to sign up and while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review. And we think that's a good thing. So thanks for your support, and we'll see you next time.

Focused on optimizing Apache Kafka® performance with maximized efficiency, Confluent’s Product Infrastructure team has been actively exploring opportunities for scaling out Kafka clusters. They are able to run Kafka workloads with half the typical memory usage while saving infrastructure costs, which they have tested and now safely rolled out across Confluent Cloud. 

After spending seven years at Amazon Web Services (AWS) working on search services and Amazon Aurora as a software engineer, Adithya Chandra decided to apply his expertise in cluster management, load balancing, elasticity, and performance of search and storage clusters to the Confluent team.

Last year, Confluent shipped Tiered Storage, which moves eligible data to remote storage from a Kafka broker. As most of the data moves to remote storage, we can upgrade to better storage volumes backed by solid-state drives (SSDs). SSDs are capable of higher throughput compared to hard disk drives (HDDs), capable of fast, random IO, yet more expensive per provisioned gigabyte. Given that SSDs are useful at random IO and can support higher throughput, Confluent started investigating whether it was possible to run Kafka with lesser RAM, which is comparatively much more expensive per gigabyte compared to SSD. Instance types in the cloud had the same CPU but half the memory was 20% cheaper.

In this episode, Adithya covers how to run Kafka more efficiently on Confluent Cloud and dives into the following:

  • Memory allocation on an instance running Kafka
  • What is a JVM heap? Why should it be sized? How much is enough? What are the downsides of a small heap?
  • Memory usage of Datadog, Kubernetes, and other processes, and allocating memory correctly
  • What is the ideal page cache size? What is a page cache used for? Are there any parameters that can be tuned? How does Kafka use the page cache?
  • Testing via the simulation of a variety of workloads using Trogdor
  • High-throughput, high-connection, and high-partition tests and their results
  • Available cloud hardware and finding the best fit, including choosing the number of instance types, migrating from one instance to another, and using nodepools to migrate brokers safely, one by one
  • What do you do when your preferred hardware is not available? Can you run hybrid Kafka clusters if the preferred instance is not widely available?
  • Building infrastructure that allows you to perform testing easily and that can support newer hardware faster (ARM processors, SSDs, etc.)

Continue Listening

Episode 161June 8, 2021 | 32 min

Adopting OpenTelemetry in Confluent and Beyond ft. Xavier Léauté

Collecting internal, operational telemetry from Confluent Cloud services and thousands of clusters is no small feat. Traditionally, this data needs to be collected in multiple ways to satisfy all the different requirements. However, this sometimes leads to discrepancies between various systems. With OpenTelemetry, we can collect data in a vendor-agnostic way. Many vendors already integrate with OpenTelemetry, which gives us the flexibility to try out different observability solutions with minimal effort, without the need to rewrite applications or deploy new agents.

Episode 162June 10, 2021 | 9 min

Confluent Platform 6.2 | What’s New in This Release + Updates

Based on Apache Kafka® 2.8, Confluent Platform 6.2 introduces Health+, which offers intelligent alerting, cloud-based monitoring tools, and accelerated support so that you can get notified of potential issues before they manifest as critical problems that lead to downtime and business disruption.

Episode 163June 15, 2021 | 25 min

Boosting Security for Apache Kafka with Confluent Cloud Private Link ft. Dan LaMotte

Enabling private links on the cloud is increasingly important for security across networks and even the reliability of stream processing. Staff Software Engineer II Dan LaMotte and his team focus on making secure connections for customers to utilize Confluent Cloud. With the option of private links, you can now also build microservices that use new functionality that wasn’t available in the past. You no longer need to segment your workflow, thanks to completely secure connections between teams that are otherwise disconnected from one another.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free