Hello, you're listening to Streaming Audio. In this week's episode, I sat down with Nikoleta Verbeck to talk about performance and tuning, monitoring, how Kafka works under the hood to work efficiently? All those things that either you know before you deploy, or you are forced to learn really quickly after you've deployed. And frankly, we couldn't have got a better guest for this. Nikoleta is absolutely amazing. She's a gold mine of information. And I just got to spend this episode, hitting her with an ax and collecting valuable nuggets.
That's probably not the best metaphor I could have gone with, but you get my point, right? This is one episode where I can guarantee you you're going to learn something. And you're going to learn something you didn't know, and you're going to learn something you didn't know you needed to know. Let's just skip my usual introduction and dive straight in. Streaming Audio is brought to you by Confluent Developer. And I will tell you all about that at the end. Grab your notebooks folks because here we go.
My guest today is Nikoleta Verbeck, who is a principal solutions architect, at confluent. Nikoleta, welcome to the show.
Hey, nice to be here.
Good to have you. My first question is what's a principal solutions architect? What do you actually do?
Yeah. Our professional service team kind of handles any troubleshooting, any problem solving, any architecture design, helping you devise a solution to use cases you have. And so that's kind of what I do all day, is help customers realize their use of Kafka in event streaming.
Is that mostly troubleshooting or figuring out for how they're going to tackle new projects? Is it a mix?
A little of both actually. We spend a lot of time problem-solving issues they're currently having, maybe they've grown bigger and now they're trying to figure out how to scale where they've already built or this is their first time with Kafka and they want to do it right, out of the box. And so we'll help them all along the way.
Okay. You some more stories.
I have quite a few.
Okay. Well, we're going to get into those, but so we got in touch and said, do you have any suggestions for how to run Kafka properly? Right. And [inaudible 00:02:32]. And wow, you have suggestions. You've sent us a laundry list. We're going to go through this and I know I'm going to learn a lot, so I'm sure everyone else will. Let me pick one at random. Let's start with producers. One of the things you said is, "It's a problem to use multiple producers in a single service," which took me by surprise.
Yeah. Yeah. This tends to be, a lot of people migrating to Kafka from things like RabbitMQ or a lot of your traditional message queue systems where they push a model of just create new clients for every single topic that you want to write to or each queue you want to write to or such. And while in Kafka, our Kafka producer is a thread safe producer. And especially in the job realm and it's actually able to talk to many topics with one instance. We treat the topic as metadata. We treat the partitions, the actual shard of the data set. And we do a lot of interesting things behind the scenes to improve that throughput. And the biggest thing of that is this notion of batching. And we'll actually batch multiple records together as they target to a particular topic and partition.
But we're also able to batch those batches up together as they target to a particular broker in the cluster. If we're able to share topics, even if it's a single topic or your multiple topics, we can put them together, get a little bit better throughput, because now we're bundling every partition that was on that node together, targeting that particular broker, sending it across and we're reducing our broker workload. Because a typical thing most people think about is, request per second. Well the broker doesn't measure its performance in that, it measures its performance in requests per second, not records. And so if we can get more batching in each of those two different segments, the batch of records per partition and the batch of records per node, we get higher throughputs, we got less requests per second. We got less load on the broker to perform all that.
Right. And you don't normally have that much control from a programmer's perspective over which broker you're connected to. Right?
Not typically because it's you can kind of manipulate where partitions live and try to isolate common partitions and topics together on common nodes to help that. You can also kind of override your own partition if you choose, but not too much control, but at least having them together saves you a ton of performance potential.
Is this fair? You're better off rather than trying to control which broker you're connecting to have a single producer that's going to manage that for you.
Yeah. Yeah. Yeah. Because now we're aggregating topics, we're aggregating partitions, we're aggregating connections. I mean you look at Confluent Cloud and you look at what we're measuring you on and your performance and CKU values and that's connections, that's bandwidth, that's request per second load on the backside and stuff. This helps minimize that cost margin and increase your performance capabilities.
See, I'm often interested in things like throughput for real time systems. Batching is great, but sorry, a latency. This what I'm trying to get to, what's the trade off between throughput and latency with batching? How do you decide what to do there?
Yeah. That comes into a few adjustments. A lot of the time, it's adjusting your linger.ms. And your size boundaries, because then in all things, Kafka, we typically have the two boundaries time and size. Manipulating those, but there's always a base cost. It's going to cost me X amount of time. Because typically we think of latency and time to send a message from your producer to your broker and your broker to your consumer and such. And that's just going to have a flat cost.
And so it's finding what that cost is, adjusting those thresholds to meet either on par or halfway through because we are asynchronous, so we can't have more in flight, but adjusting that to where we're having enough requests in flight, we're not too many requests in flight at the same time, we're taking advantage of the fact that it's going to cost us 30 seconds, well or 30 milliseconds. We're now able to say, "Well, if my baseline is 30 milliseconds, maybe I set my linger at 15. Maybe I adjust for that aspect." And now-
I think we should pause there. You should explain to me and all of us what linger.ms is, because not everyone's going to know that.
Yeah. Linger.ms is one of the primary Kafka producer settings typically and it controls the time boundary of batching. So the Kafka producer always attempts to batch records by topic and partition before sending them off. And we have the size boundary, but linger.ms tends to be the time boundary. And most people run into it first because again, they're thinking of latency and time. Let's adjust our time. But that controls the notion of, when we open a new batch, how long am I willing to leave that batch open before I consider it closed and ready to send?
Right. I decide I'm going to set linger.ms. To the very minimum and that way it will send every record immediately, which sounds great. Is it a bad idea?
Yeah. It can be. I mean there are some use cases out there where doing that is it's perfectly fine. Typically, those are seen in our IOT solutions, things like that, where you're trying to get the minimalist latency possible. You're willing to have massive Kafka clusters to do so, even though you might not be sending a whole lot of data in terms of size, but you want that minimal latency, you don't care about ordering. You don't care about a number of things to get that latency. But in 90% of these cases out there, you're going to want to set that linger. You're going to want to balance that with your cost.
What's it going cost me to send across my network, things like that? And get that batching, because that batching is going to increase that performance and workload on your brokers. Not only is it going to help the producer, but it's just going to help what's the load the brokers are doing? Because now I'm getting more records into a single batch and that single batch into a request or multiple batches into a request to that broker. And now, I'm not sending one request for every single record, which is the work variant on the broker.
Yeah. Right. Almost, that's kind of analogs to the transactional overhead. That you're paying for each batch. Okay. Okay.
That makes sense to me. Before we get off that, are there any signs I should be looking for that I've set linger.ms badly? What's the symptom?
Usually the symptom a lot of the time is A, I mean kind of to the point earlier, a lot of people don't even know linger.ms exists to begin with. We're used to plugging in the Kafka client and just rolling with it and typically out of the box, it's perfect. Like a lot of use cases, it just run out of the box, just fine. If you're still running in the defaults, you probably need to go and evaluate it from that perspective. That's usually kind of our first look at is, let's go look at these properties you're setting, are you adjusting it? The other is just kind of looking at the producer is going to report metrics and this is kind of one of the big things too is make sure you're monitoring everything. Kafka, in our Java realm, we kick out a ton of metrics into MBeans and JMX, same with the brokers.
And some of those metrics are actually, what is your percentage of records per request in terms of counts. And you can back that out to a one minute rate and stuff like that. And actually see, am I getting a good record to request rate percentage? And knowing, am I doing well there? Because that's going to boil down to really looking at your brokers and looking at primarily two metrics that come out of there. One is, the idle thread percentage, request idle handler thread percentage and that request handler queue, right? Because these are the threads and this is the queue that is handling those requests that the producers and consumers and admin clients and everything talking to Kafka have to pass through. And if you're spending a lot of that queue, not very idle and stuff like that means, you're overworking your brokers potentially. And let's see if we can tune out and get more out of those brokers for their current value.
My first point of call would be to do larger batches so that the queue deals with more in each step. Is that right? Okay.
Exactly. Yeah. That's usually the biggest starting area is, let's get batches better and adjusting. Over time we kind of might figure out what a good batch size is, at the time, but a year from now maybe you've doubled or tripled or quadrupled your number of records you're trying to produce, let's reevaluate it. It's something that's never going to stop changing and you should probably look at it over time.
I mean, it seems like one of those things where in a lot of cases, people make writing software to produce to Kafka. There's not a natural batching in their domain. Do you think-
Not typically. A lot of them tend to, they don't think about the batching, they're just trying to get the records across. They might be thinking of it in the terms of an order by the key things like that. So it's kind of an abstracted away thing that we kind of put in these smart Kafka producers and stuff. It's just kind of adjusting and tuning for that.
Yeah. So you got to start thinking of it as you grow, is that fair to say?
Well, that's fair. I expect to learn more about the internals of a system as I grow with it.
Yeah. It was what makes learning new tech fun is, you got to start out with it, you get start playing with it. And then as you start getting more workloads on it, you get to discover all those little intricacies that you didn't expect to discover when you first started out. How does this work under the covers? I want to go adjust the engine and play. Yeah. It's kind of the heart of being an engineer. It's getting to play and learn and experiment.
Yeah. You've always got to know one step below the covers to do a really good job I find. Yeah. Okay. I'm going back to your laundry list of things that can go wrong. And so you said not enabling compression, which sounds fair to me, but where does compression matter? And there are choices of compression. Which kind of compression do you enable?
Yeah. Kind of to that aspect of earlier, it's one of those settings that out of the box is not enabled. We don't have compression on by default, but we do offer four different kinds of compression. And it's something you should turn on. That compression happens at that batching layer. All those records that you get into that batch targeting that partition were able to compress those down at that time before they're transmitted across the wire. That alone, we're going to reduce our network overhead, our network consumption, which is important, especially in clouds where you pay for that network. How much bite you're sending in across. But it actually counters on the counter side that a lot of people go unnoticed, which is the consumer. Well that producer's compressing it, that consumer is the one who actually gets to deflate it. Now not only am I saving bandwidth, going to Kafka, I'm also saving bandwidth coming out of Kafka.
I'm also saving my dis consumption. Now, I'm able to fit more into a smaller space on the discs of every broker. I'm able to handle that a little smoother if you've got SSL turned on, we all know, Kafka and it's zero copy capabilities. Well, once TLS is there, zero copy kind of goes out of there. Well, compression's also going to help there. That's less that I have to copy, less that I have to move, less that I have to stream to dis. So you get some performance gains there, but when setting it, there's some trade offs. We're going to have CPU cost. That producer's going to take the brunt of the CPU cost, that consumer's going to take the brunt of the decompression costs. And so that's why we offer the options we do for compression is, what sacrifice are you willing to make?
And we have things like LZO, which has really good compression at the benefit of reduced CPU cost. But then you have all the way up to something like Gzip, which is going to get you the best compression you can, but have a lot more CPU costs. And then there's some of the newer ones, like with a lot of the newer Kafka clients, you actually have now access to ZSTD. Which is kind of balancing act between those two different compression capabilities and figuring out, well, I do want utmost compression that I can, but I do want to have a little more CPU to my producer so you can account for that. But in the grand scheme of things, I can scale producers a lot cheaper typically than I can scale my brokers. And so that's something to keep in mind while you're thinking through those cost measurements.
You are offloading that compute cost to the producer side and the consumer side?
Yeah. And I can see how that would free up the broker. So that brings up two questions, but let's start to clarify the first one. When you enable compression, you are compressing a batch of records at the producer side, the compressed data goes over the wire and is actually stored as a compressed batch of records on the topic.
And it will be shipped back out as the same batch and not actually decompress till it reaches the consumer side?
Correct. Yeah. The producer's able to... It's going to put the whole batch or attempt to put the whole batch in the request, depending on again, size boundary constraints. But that's going to get it over there. We're going to write it to disk. Part of that payload is a bunch of metadata. That's, I'm going to have my raw record compression, but then I have metadata that's going to say, "Okay, here's the number of records that are in this batch. Here's any extra information I need to know about it, that's size, things like that." That's going to get on the broker. And that's how that broker's going to know how to handle on the consumption side.
And on the consumption side that batch might get in one request, it might need to take two fetch requests. Again, that all depends on size tunings. Those are some things when you start getting into deep performance tuning, you can start messing with is, "Well, I know my average batch size is this, well, let me go adjust my fetch size on the other side to that, to account for that and try to minimize splitting and things like that." But it's getting the deep performance really trying to squeak out as much as possible.
That sounds very deep in the weeds. Just to be clear, this isn't a day one concern, right?
This is, I've bought million dollar hardware and I really want to see where I can take it.
Right. Yeah. Because you get to work with clients of all those sizes?
Oh yeah. I mean-
Which must be fun.
... between customers that are running in cloud and customers that are running on hardware with very expensive rate arrays or going straight J bought on NVMEs and there's all kinds of out there.
I can't believe. I can't believe. The other question that came to mind from that is, actually two questions. Oh, God you're making me think of so many questions. That's my job. That's okay. I'm allowed to ask you as many questions as I like. First one is, if you are compressing on the producer side and maybe you don't know this, maybe you don't know the answer, but there are plans to have schema validation from schema registry on the broker side, right?
If it's compressed, how's that going to play in?
It gets a little more complicated in those. The broker for the most part is scanning through those records. And for those that don't know how the schema registry and that kind of works. One of the things we put in there is this magic bite. It's attached to every single record at the beginning of its serialized data set. And so we're able to just scan through that real quickly, because part of that metadata is their offsets and where things are at. So we're actually able to pick that up, get that metadata go, "Oh, this is its ID.
I know that based on this topic, it's subject naming strategy is this." Now I can go look that up and evaluate, okay, is this schema ID the correct and valid schema ID for this topic and in its order and stuff like that, because we are in schema registry and all that. It's a subject that contains many schemas as the version history goes and schema ID is what we put in the metadata there of that magic bite. And I'm sure there's a lot of more details there that I'm skimming over and simplifying on the broker side, but in layman's that's kind of what's happening.
Okay. We'll also pull people in from the schema registry team for the podcast at some point. But the other thing I was going to ask you, and this is really in the weeds, but I'm going to ask you anyway. Because you've got an overhead, you've got computation overhead for compression, you've also got one for encryption. Right?
Do you know what the balance of that tends to be?
It's kind of one of those that is your mileage is going to vary and you're really like to my earlier point, monitor everything. If you don't even think you don't need it right now, you'll need it eventually. And it's better to have it now and be able to see that history when it comes up, then it's going to be, "Well now I'm scrambling. I'm trying to get some stuff out. I did monitor this now I got to go get it in there. And now I don't have any history I have just now." And that's kind of one of those things that's going to help you with that balancing act of, "When I turn these things on, how did my CPU profiles change? How did my load profile change? Do I need to add more nodes to my producer pool or my consumer pool?" You know, you won't know if you don't measure it.
Yeah. Yeah. I've definitely been in that situation where you think you've got performance problem and you think, ah, these look like things I should monitor. They're important. I'll start monitoring those and there's no help at all for a couple of weeks until you've actually got some data in.
Oh yeah. Yeah. I mean to that point, if you ever talk to any of your people at compliment, we do have sample dashboards for Grafana and things like that. And Prometheus configurations and YAML configs that come with our Ansible and things like that can drastically help just get yourself off the bat in the best practices on that monitoring aspect with minimal effort.
Yep. Yeah. There's probably something out there for whatever tool stack you use to monitor at the moment, right?
We're trying to build out quite a bit. I know we've kind of built a lot around Prometheus and Grafana, because those are kind of the defacto anymore and into your elastic search and is in splunks and app dynamics and New Relic and Datadog. And so there's dashboards for a lot.
Yeah. Cool. Another one from your producer list, which again like compression it's like yeah of course, but give me the details, Nikoleta, not defining a producer callback. Because when I produce the record... Explain what it is first to the people that don't know. And then tell me how you should do it properly.
Yeah. When you use the Kafka producer, you can send just your Kafka record in it. That's typically where most people start, that's your key value, headers, timestamp, all that kind of stuff. But there's a second attribute you can optionally give. And that is the producer callback. And while the Kafka producer is going to try to get that record over there, as best as it can, it's going to use all of its little retry policies and attempts and stuff that it has baked into itself. There's inevitably sometimes where that's just not going to happen. Whether that is, brokers are down for a long period of time and the retries are exhausted or you get into this concept of back pressure. And it's not something most people get into right off the bat. And that's, you need to have a way to handle that.
And that's this callback and that callback allows you to subscribe to your record and know that was it successful or was it not successful? And you get that through typical callback structure, in Java it's an interface definition with a single function, that's going to get back either the record that you produced with some extra metadata, like which partition did it actually write to? Which node? Things like that. But it also comes back with a potential, optional exception and you want to evaluate that exception looking for, well, what happened? And this can come into play, if you've set your linger too low, you've lowered your retries too much. And you're actually erroring out. Because if I don't capture that retry, well then I don't know my record didn't actually make it.
I assumed it did, but I didn't know. I exhausted retries. The producer tried to tell me, but I gave him nothing to tell, to be able to tell me.
Again, we need to get away from ship and hope.
Yeah, exactly. But it's also a way to get that signal back from the broker side that says, "Hey, you're sending me too much. Everybody's sending me too much. I need to push back a little bit." That request queues full, that request handlers are doing as much work as they can take on. Maybe the network threads are maxed out, something on the brokers is, it's slow. And that callback kind of gives you that extra signal on top of just failure of, "Hey, you're timed out. We timed out for whatever reason, I need you to pause and handle gracefully back to yourself."
Whether that's maybe auto skilling out a little bit further, your cluster to handle this pause and the cycle a little bit more. Maybe it's pushing upstream from your application further and saying, "Hey, can you slow it down a little bit? You're spaming the system." And it's kind of a good measure and make sure you're doing both, capture that exception, make sure you're retrying, maybe the error is to your early point, maybe the schema you are using is wrong and actually not valid. And we are trying to tell you that this record will never get produced because it is invalid.
And you want to capture that and be able to handle that. Whether that's used in a policy of a dead letter queue, sending it off to another topic that, for those that don't know, dial letter queuing, sending it off to a secondary topic to be evaluated at a later time. Whether that's through an automated system or feeding that into an elastic search log somewhere to search against something. Or just saying, "Hey, I got this, I need a back pressure." Things like that. So there's a lot of things that can be captured out of that callback.
Yes. Yeah. I've definitely got it in my mind for error handling, but I'm definitely guilty of not considering it for things like back pressure.
Yeah. The back pressure's usually a concept that kind of goes overlooked or just not heard of not known until it becomes an issue. It is something I would encourage all to learn and understand because as you grow, it is a concept you will run into and you'll run into it a lot in distribute systems.
Yeah. Yeah. Do you have any tips because it feels like, you send a request, you produce a record, you send it off and eventually some point asynchronously, you'll get this thing saying, "Whoa, slow down." And it often doesn't feel like a natural thing to put in the programming model to say, "I'm going to slow down on the new stuff because the old thing told me to slow down." Do you have any tips on how to actually program that?
It depends on some of the use cases. I've seen in certain cases where they absolutely cannot pause the upstream for whatever reason, whether that, maybe that's SLAs to third party customers to the coffee cluster or whatnot. And so they'll just start buffering that to disk. So you use something to buffer it to disk, kind of hold it there, until things catch back up or maybe if this is happening frequently, this is probably a good case of let's adjust our linger. Let's increase those batching. Let's try to get more out of it to handle that a little more gracefully.
I've also seen in instances where you absolutely can't even do the buffer to disk you need guarantees. They'll just have a secondary Kafka cluster that's available off to the side and that's the spill cluster effectively.
My first one's overtaken, let me spill over to this secondary cluster. I'll start writing it there and we'll aggregate it on the backside when it comes to it.
And this kind of happens to like launch events, think like video games launching, and I need to have this guarantee that, I've got my Kafka cluster while we can scale that up reasonably fast, that still takes time. I'm going to pay the cost. Let me have just a secondary cluster off to the side, just in case. And so you have kind of this spillover for this, because you don't know. I mean, sometimes your launch event goes really well. Sometimes it doesn't go as well as you thought it would though. And sometimes it goes-
I've definitely seen that in the world.
Yeah. Sometimes it goes beyond what you could even measure.
Yeah. And that's a nice problem to have, but you still want to solve it, right?
Yeah. Okay. So to summarize that then you are saying, you might have a programmatic solution to back pressure, but at the very least you want to be aware that it's a requirement to tune for?
Yeah. Okay. Let's move on in your... We should publish this list as a handy reference guide to people. Maybe we'll do that in the show notes, but moving on, on your list to consumer side, right?
And it feels like this is going to be a parallel to the producer side. You've said to me, a problem with using one consumer per thread in a service.
Yeah. Kind of the same on the producer side, I mean we try to take advantage of those fetches. We try to take advantage of in memory buffers, things like that. And it's not necessarily a requirement, while the consumer out of the box itself, isn't a thread safe thing, it is capable of doing it. And that's usually where we encourage it. A lot of the time maybe the evaluation of a single record does take a while. Your first inclination and it's kind of been always the push in Kafka, which is, "Well, let me just add more partitions. Let me add just more consumers and I'll just scale it that way." Well, you might still be one threaded microservices, things like that. That's not always the best nature. Because more partitions, each partition does have a cost.
Each consumer does have a cost. You have fault tolerances there to deal with and such. So moving to a multi threaded model can help at a certain point. You don't want to get too many consumers out there, too many partitions because it's going to affect some things and I'll go down here a second, but you can move to a model where I have one consumer, I thread that off, they have records coming out of the consumer to a thread pool. And I manage that. This usually means you're starting to get a little more into the weeds of the Kafka consumer and how you tune it and manipulate it and do stuff like that.
You're also going to have to start managing your own offsets at this point, because now you've got to do thread coordination. It's not just safe enough to say, "Well, if I got to my next poll, then that most likely means I can commit all the records that were just handled." Because out of the box, that's what the consumer assumes. You're going to need to start managing your own offsets, keeping track of that. When you do do that, make sure you do start using the rebalance callback, which is on my list of other things.
Yeah. But you can choose if it's time to go into that right now or push on?
We'll come back to that one.
But kind of to the earlier point there, when I get too many consumers, I start getting to a point where each consumer's going to require at least one partition to get the perils of work. And now I get to a point where my fault vector kind of happens. I've got a lot more services out there. I've got, if one of them happens, I've got to rebalance. Well now my consumer group is so big that rebalance can take a little longer. I've got to shuffle between all of these nodes, hundreds of consumers potentially. And that's going to take some time. So that means my time recover is a little more delayed. That's going to affect my guarantees, latency guarantees, things like that.
It's also kind of one of the things that goes overlooked is, well now I've got more partitions that can actually now affect the producer. To some of those earlier things we talked about, batching is a very big key to performance in Kafka. And well, now if I have more partitions, I'm going to have more of this concept of what I call spray. I'm going to have to put more records, across more brokers, across more partitions and now I'm reducing my potential for a bigger batch.
Right. Yeah. Yeah. I've always got why you want more partitions, but now I start to see why you want fewer as well.
Yeah. Yeah. It's a balancing act. I mean a lot of what people think about on this is, "Oh, just more partitions I'll scale up." And that's kind of what everybody gets out of the box, but that's not necessarily the case. It's something to think about. Don't just always add more partitions. Think about it a little bit. Is there a better way to go about this that isn't just adding more partitions? While it's the simplest answer, it can be a costly answer.
Right? Yeah. And then it's not easy to change that number down the line.
No. Yeah. Shrinking partition counts is a little harder than growing partition counts.
Do you have any tips for that, if you have to?
Not really. It's a-
It's that bad?
It's a painful experience to begin with.
Because usually you're talking about new topics, different partition counts, possibly the use of a replicator or something to migrate data if you need to, it's not an easy endeavor to go down at the moment. Hopefully we will get to a point where we're able to do that for you, but that's not available at the moment.
Feels like there's a technical reason why it's hard, but there's also, you're changing the semantics of ordering a bit?
Yeah. I mean, you've got-
That's always going to be with us?
Yeah. You got the ordering faults like, "Well, do I re partition all this data that was in these partitions, I'm trying to effectively delete if I want to preserve that data? Or it's adverse effect?" Like for those folks that care about ordering and stuff like that, adding part at adding partitions is, it's a big event potentially. And a lot of times you need to time it for a time of day to do it. Because you need to guarantee that order somehow. And you need, when I add that partition, my ordering's going to be out of whack for a little bit while all those quotation and tables realign.
Okay. The best advice is to pick the right number first time as always?
Yeah. And that's kind of one of the things I kind of start out with when people ask about that is, in your development phase, in your staging phase, well, let's figure out what your unit of scale is first. How much is one consumer going to be able to do? See what that looks like. Is that X number of records per second? What's that look like? When I need to add that next consumer, what's that look like? What's that threshold? And go, "Okay, well, so I know to date, I'm going to be dealing with this much data. I know it's going to be like this. That means I'm going to start out with this many consumers. I know in a quarter, two quarters, three quarters, ideally a year out that my growth trajectory's going to look like this. And now that I know what my unit of scale is, I know how many consumers I'm possibly going to have in a year's time or half year's time, depending on how far you can forecast out there."
That's going to give you a good idea of how many partitions should I start out with today? If I know in a year's time I'm going to go from 20 consumers to 100, I should probably start out with 100, let's minimize those events of repartitioning and ordering adjustments and stuff to once a year.
Yeah, good. It's the first time anyone's ever given me a concrete plan of how to guess this.
It's kind of the best I've kind of figured out that tends to work is, let's start to the earlier point, let's monitor, let's know where we're at, let's evaluate this and let's really try to send as much data as we can through. Let's baseline our application. Because a lot of people don't even know what the baseline of their application is to begin with. And so, if you don't know that you don't know what the scaling capabilities are.
Yeah. It's often very hard to predict, but if you can simulate it, it gives you a lot more intelligence.
Oh yeah. Yeah. I mean, we have a number of tools that can help you simulate that data too. It doesn't have to be end to end, you can use the CLI tools or if you really want to get it into it, the Trogdor, if you haven't played with that comes with [inaudible 00:41:44]. That kind of-
I only know that from Strong Bad, what Trogdor?
It is kind of a utility that comes bundled with Kafka that can let you really evaluate performance metrics. It usually goes unnoticed, but it is one of the major tools we use to evaluate the performance of every single release we do here. Make sure we're not backtracking at performance when we do changes, things like that. It can be something that can help you test your applications as well beyond just the typical, the producer-perf-test CLI tool or even JMeter. JMeter has some capabilities for those that have used JMeter and like it, you can do Kafka stuff.
What's Trogdor actually doing to test?
It's able to basically act as a producer, consumer at scale and really replay a number of different scenarios. Well maybe my batching needs to look like this, my request rate needs to look like that and you can really get into it.
And I think we actually have a bunch of stuff on Developer about it.
Okay. We'll link to that in the show notes.
You're on developer or a blog post?
That sounds like a huge topic in itself, but at least we know where to go to start simulating these things.
Oh yes. Performance simulation can be a couple hour long conversation.
Well, we'll move on because we don't... We'll spend that out to a separate podcast perhaps, but okay, so going back to your list, something that looks like my task board, actually, because you've said it's danger to over commit. Tell me about that.
Yeah. Yeah, this kind of starts out a lot of the time, you start getting into that multi threading of the consumer, you start managing your own offsets and you start doing a boundary of commit. A lot of people try to just do it at the end of the poll or every record and stuff. And when you think about that and think about to the producer's aspect, well, every commit is a produced record. Every time I commit something, I've got to produce a record back to Kafka to commit that. And that's, it's usually a record going into the consumer offsets topic for those that haven't seen it, which is a compacted topic, if you didn't know. We've got all these records going back and if I'm over committing, I'm overproducing and the way the consumer does that is, one commit is one request to the broker and getting to their other point, well, the broker handles requests.
And so I'm putting work on it, on those aspects. So I can overload the work queue and take away capabilities from just getting records in and out of the broker to just handle offset commands. The secondary that a lot of people don't even think about is, it's load that it's going to put on the compactor.
Which is this backend thread that goes around compacting these compacted topics. Cleaning them up, stuff like that. While if I'm over committing, I've got now just a ton of data that that compactor got to crawl through and compact out and get to its final stage and that's going to put load on it. Maybe it's not going to get to your other compacted topics as fast as you would like it to do. Now you're dealing with excess tuning there.
It's also going to put extra load on your disc volumes to deal with all that. It kind of can cause a number of issues on the broker side if you're over committing and some performance loss on the consumer. That single sender thread that those consumers have on the back end that talks to the broker, now has to handle all those requests and the return trips and all that while also trying to facilitate fetch requests. So it can be a performance penalty [inaudible 00:45:55]-
So while you're committing, you're kind of blocked on fetching the next one? Which makes sense. But-
It's able to do both, but it's now juggling both the sends and the response of both of those request types, right? The fetch consumer requests and the commit request types.
I'm pleased with that because it's very similar mental model that I have to worry about on committing those offsets as producing. I feel like the same driving principle is keeping the two efficient. So I don't have to think of two different principles at once, which makes me happy.
Yeah. The only downside agent, you can't batch up commits into a single request. You're kind of stuck with the one at a time because if we try to batch it, then that kind of removes the point of trying to-
You're trying to commit like a high watermark?
So your solution is to try and consume several records before you try and commit the offsets?
Yeah. I try to find a balancing act. Typically, we like to do either account threshold or a time threshold of records. Doing it, time usually ends up being where everybody usually goes, because again, most people are thinking of records per second. They know their volume of records they're producing and managing and stuff like that. It's usually a concept everybody's familiar with. Doing it on a time boundary tends to come a little more second nature thought to it.
Okay. But that surprises me because it seems like something that's so domain specific.
It can be. You might go to a count threshold or a transactional threshold, maybe you need to do a number of things before you can say, "Okay, my watermark moves," like maybe you're putting data into a database table. Maybe you're doing a transaction there and you want to manage that commit that way. But a lot of the time, that's not usually the use case, but it is out there. Usually most of the time, your time boundary or a number of records boundary tends to be the de factor everybody starts out with.
And does that cover everything we need to think about on performance, tuning the actual fetching of data?
Not completely. One of the setting sets that a lot of people just don't know about is, how do you tune the fetch? And a lot of the time they start out with, maybe I turn my consumers offer a little bit. Maybe my deployment is a red green, or kind of those different types where I'm stopping everything, I'm spinning up everything. Or maybe I had a bug, the whole thing didn't work and I'm relaunching it. Or you're migrating an older batch driven ETL system to Kafka. So you still have kind of, some of that consumption lingering and what you usually tend to see a lot of the time is, you'll be watching your consumer lags. And you start seeing one or multiple partitions that are signed to consumer start falling behind, but you'll see one partition, that's just fine for that consumer, but all the others are failing and that's usually the symptom of needing to tune your fetches.
And there's two main ones that come into play. And like all things again, we have time and size boundaries. In this case, it's usually the size boundary that's coming into play here. And when we fetch your requests from the brokers, we usually say one of two things, "I'm willing to wait this long to get my data. Or I want this much at the minimum or the maximum." And so we start needing to tune the maximum on that threshold, which is, I need to get more data back. Out of the default, we're saying, 50 Megs of data currently that I want in total. But we do have another setting that goes unnoticed a lot of the time, which is our fetch per partition or max partition, fetch bytes.
Which is one Meg and out of the box for default. And so if I'm doing ETL stuff, I probably am going to want to bring that up. And a lot of people's first is, "Well, let's take that to the 50 Meg." Well, if I'm saying I'm willing to take 50 Megs on a single partition, but my max fetch request response size is 50 Megs. Well now I'm locked on that partition. I'm going to keep reading from that partition before I go to the next ones, because the brokers and the way all of this consumption wants to do, is it wants to drain a partition first before moving its kind of cursor to the next partition that it might be handling for you. If I can adjust that to where I say, even though that one partition might have 50 Megs worth of data available to me, but I have three partitions I'm consuming from as a single consumer-
Because I'm low balance.
Yeah. Yeah. I'm going to want to set my fetch max as a whole to something big enough to account for all three of those partitions, getting a chance to send data back to me in that fetch request. So I'm not locking on that one, because that broker's going to kind of send, "Okay, well here's your first Meg from this partition, I've backed out that partition' setting, you told me I'm going to give you the next same amount from the next partition," and so forth. So this kind of helps balance that issue that pops up.
Right. In that case, am I going to say to myself, "Well, once it's low balance, my consumers will probably consume from three partitions on average. And divide the big value by three and that will be my smaller value."
You can do it that way. That usually a good baseline to start with, when you start getting into it. I mean, out of the box, do you remember, the default is 50 Megs total, one Meg minimum. Usually where people are starting out at with is they start manipulating that partition fetch and they start increasing it and forget about the other one. Before they know it, they're fetch in 50 Megs. They're locked down on partition now. It is a balancing act as time goes on as your volume increases and maybe your performance gains increase. Especially if you start doing that multi threaded consumption model, maybe you start getting more threads on a single instance. Maybe they're able to handle a lot more. Maybe you're not keeping up with that thread queue and you're starving those threads. So you start increasing those fetch capabilities and things like that, increasing that buffer. So just some things to keep in mind as you start tuning, tune both and account for both
What we're going to have to put the show notes is like a cheat sheet for all the parameters Nikoleta says you really should know about.
There's a few of them. There's a lot of metrics too, that keep an eye on that go overlook. Like to the earlier stuff on the producer that the metrics around records per request, the average compression ratios, things like that. They're usually ones people don't think about, they're wanting to monitor and make sure the system's healthy, but usually performance of the system goes overlooked initially.
Yeah. Yeah. It's true. I've got a lot of sympathy for the people on the other side of your desk when you're talking to them, because it's like you, out of the box it seems to work quite well. And you go and look at the docs for which parameters look interesting. And there are loads of them and how do you navigate?
Well, and the docs alone are ominous. You go through them and it's hundreds of these things and you've got to read every single one of them. And so it's a lot to do. It is one of those things I would highly encourage those that are wanting to get into really understanding those, to check out many of the videos we're doing on these and our education team's work they've done, because they kind of help you understand exactly what that parameter covers and what it's actually doing. Because it's one thing to read that blurb that is in the docs, it's another thing to actually truly understand what you are doing. If you were to read the stuff around the producer batching, you might not get the full picture of what's actually going on.
Yeah. Yeah. And we have a course on the internals of how Kafka works by Jun Rao, which feels like it's definitely your first port of call for understanding this stuff under the hood.
We'll link to that in the show notes. We have a podcast with him too. We'll link to that in the show notes. Richly cross referenced podcast this one.
Oh yes. Anything from Jun is worth watching.
Yeah. Yeah. He's a smart guy and he knows the insides of it like the back of his hand.
Okay. I have another one from your list and you've touched on this briefly, but let's go into it in depth now. It's a mistake to forget to provide a ConsumerRebalanceListener. What's that? And why should I be providing it? And what should it do?
Yeah. Kind of similar to the producer callback, this is more on the consumption side and it's not really dealing with the per record. It's dealing with the failure scenario. So when you start managing your own offsets, most people forget that this even exists or didn't even notice because typically it's another one of those properties or it's another if you're in the librd realm, it's a function call that you've got to make or some definition of the kind. And I'll speak to the Java side, it's an interface that you're implementing that gets called when there's a rebalance event. And rebalance events happen for a number of reasons. You're redeploying your consumers applications. You've got consumers going online, offline as they roll through their restarts or a broker went down, I need to rebalance where the listeners go, whatever. So the consumer group needs to rebalance the partition assignments, right?
This function actually gets called on every single consumer before it starts. And there's two functions that are defined in this interface. One is, I'm about to rebalance, close things up. And the other is, the rebalance is done here's your new assignments. And this kind of gives you a chance especially if you're doing the thread model, "Hey, the rebalance event's about to happen. Let me stop the work I'm doing because I might lose partitions."
And it will tell you know, if you're using cooperative, the new cooperative sticky stuff, you get some hint to which partitions are going away and which aren't, but it gives you the chance to go stop the work a little bit, wrap up where you're at. Because you can hold it for a period of time. There is a timeout so be cautious of that. But you can hold the work, hold the rebalance a little bit, wrap up what you're doing, do your final offset commits, clean yourself up.
Oh, so you are allowed to...
Sorry. You are allowed to like do that last bit of work of saying commit my offsets before it goes down?
Yeah. That's in effect of what it's kind of designed for is to give you that opportunity to say, "Okay, something's going to happen. The partitions I'm currently latched onto are going to change. Somebody else is going to get this work. If I don't commit my offsets now, there's a potential for duplicate consumption. I might have finished the work, but I haven't committed offsets yet, there's that differential. This is your chance to commit that not replay the work. Cancel the work in progress, clear any work queues that you might have." It's kind of your chance to clean up, get ready for new records to come from different partitions
Hit save before the laptop reboots. That kind of thing.
And that gets kind of broadcasted to every consumer in the consumer group who's still in the group. So, if you left, you issued your leave group request, if you left, you're probably not going to get it. But as long as you're in it, you'll get it. You can handle it. Every consumer gets it so everybody gets to handle it. It's not just the group leader that's going to get it. It's not like trying to do your own consumer partition or any of that.
This feels like a real post research and development. This is when you're really into production. You're living with the reality of computers coming up and down all the time.
Yeah. I mean, it's never a factor of, is it going to die? It's always a factor of it's going to die, when is it going to die? And am I going to handle that death gracefully? Especially when we start thinking Kubernetes and it's whole cold to herd model of deployment management.
Yeah. Prepared is, if you're prepared for it, then you're going to survive it. Yeah. Okay. Looking at my list again, there's so much you've given us here. Undersized per Kafka consumer instances. What does that mean?
This kind of plays into some of it and it happens in the producer side too, which is, a lot of the time when you first start out, maybe you pick that small AWS instance node type, for whatever reason, you wanted three or four paralyzations. But as time's gone on, you're now at 10, 15, 20 consumers in your pool, they're spreading that work all across each other, but they're really small, right?
And on both sides, well, now I've increased my number of producers or I've increased my number of consumers and the work they're doing well, yes, it's taking up maybe the one or two CPUs on that node. It's better if I can actually shrink that. Let's go to four or five, six CPUs because now I'm reducing that spray. That earlier concept I had is, it's not just on the producer going out or the consumer going out to the brokers, it happens on how many producers do I have in that data hitting those?
And vice versa on the consumption side of that, which is, well, if I have all my data spraying against that front to begin with, well that data's going to be sprayed counter on the backside because I've reduced my potential of batch, right?
A record that could have gone to the same partition in the same batch, if it landed on the same producer. Well now got landed on this producer all the way from this one. Sometimes it's best to not just go out, but also go up. Let's vertical scale that up a little bit. Let's dense those up for those reasons of, let's increase those batch efficiencies.
Let's increase those request efficiencies. Now, I'm getting that single consumer fetch request able to get a lot more data from a number of partitions, get that into there. Let's spread that work. Maybe we start really moving into that. Multi-threaded model a little more and taking advantage of it. Because if I keep scaling out horizontally, I'm going to run into those earlier, same problems of I'm adding more partitions, I'm adding more consumers to the consumer group. I'm adding more producers to their side of the pool. And I'm just spreading myself thinner when I could come back up a little bit, go a little vertical and get a little more bang for my buck.
Yeah. Again, you've got trade offs in different ways you scale and you've got to be aware of both and decide which is going to hit the underlined model best.
Yeah. Yeah. And I mean, it's like a little bit of a rubber band. It's good to start out like, let's go horizontal a little bit as time goes on, but at a certain point let's reevaluate and let's pull that elastic back in a little bit, go a little more vertical, tune different properties and then go a little more elastic again. And do that, because we can't... Some of these adjustments like changing your instance type size or adjusting settings, little more here effort involved there than just adding another node to the pool.
So, it's kind of one of those things of, maybe every quarter, every other quarter think about, "Okay, how can we bring this back in? We've stretched ourselves out a little bit, let's bring in, let's adjust, let's make that part of our quarterly retrospective of, okay, what have we done? How have we grown? Let's adjust, minimize that, get reset for the next quarter."
Yeah. Yeah. We've talked a lot about adjusting lots of different parameters and changing layouts and stuff. I just want to be clear here. Is this something I'm expecting to do a lot or is it in response to changing growth? Is Kafka this constant you're always going to be maintaining it and tuning it thing? Or is it life's going to change around your business and you have to respond to that?
I think it's a little of all the above really. It depends on the use case. There are a lot of use cases out there that the work doesn't grow. Maybe it's a simplified workload, it's different in whichever way. Your growth is minimal. Or so a lot of these settings, maybe something you look into once a year, or once every other year, or once every five years.
And some other areas that you maybe, you've got Kafka in that main line of your business, you're handling events, you're growing successfully as a startup or even as an established enterprise on this new workload or you're gradually migrating data. Well, all that comes with some effort that needs to be put forth while we try to make Kafka as scalable as possible, as minimal as possible sometimes you got to put a little more effort in, you got to figure it out. You got to tune some settings and dive deep.
Yeah. Yeah. Okay. Let me see if I can wrap this up with what's top of my mind going away from this and you can tell me if I've missed anything. I don't really want to reduce the number of petitions I've got, I need to understand batching and compression and I should really worry about those call back hooks. And if I do that, I won't be too unhappy in the future. I'm monitoring. And monitoring.
Yes. Don't forget monitoring.
If I go away from this with those four things, will I be happy?
You'll be happy for a good while. That'll get you to some pretty reasonable scales, really tuning the batching and stuff. This is that next here, this is going from, "I just start out with Kafka to I'm really using Kafka. I need to scale this up." Right?
And so that's where a lot of these settings and monitoring and metrics really start showing their own as that next stage. I'm getting a little more advanced with my Kafka usage.
Yeah. Okay. Well, happy for a long time is probably the best guarantee I've been offered for a while. Well, I think we should probably leave it there, while I could pick your brains for another few hours. We'll probably have to bring you back for more another day, but Nikoleta, thanks very much. This has been really educational and interesting. Thanks for being on the show.
Yeah thanks. Yeah. Thank you.
And that was Nikoleta Verbeck, sharing some common mistakes and a lot of hard one knowledge. I think there are going to be a few people who get to the end of this episode and start tweaking a few cluster parameters and batch sizes. Good luck if you're one of them. We're going to put as much of that detail as we can in the show notes. But if you want some more, then take a look at developer.confluent.io. That's our education site for Kafka and it will give you in depth guides for a lot of what we've covered today. There's also a really useful course by Jun Rao, that explains how Kafka works under the hood through so many different parts of the system. And if you look at that, you'll really understand how Kafka works and exactly how Nikoleta's advice plays out as your data travels through network threats and IOQs and onto disk.
That course also has exercises to cement your knowledge. And if you want to go through those, you'll probably want a Kafka cluster to play with. The easiest way to get one is with Confluent Cloud. Sign up with the code, PODCAST100 and we'll give you $100 of extra free credit so you can spend longer on it. Meanwhile, if you have thoughts or questions about today's episode, please get in touch. My contact details are in the show notes, or you can leave us a comment or a like, or a thumbs up, or a review just let us know that you've enjoyed it. And if you are out there saying, "I already knew everything in this episode," well, you should be a guest on a future episode, my friend. So get in touch. And with that, it remains for me to thank Nikoleta Verbeck for joining us and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time.
What are some of the common mistakes that you have seen with Apache Kafka® record production and consumption? Nikoleta Verbeck (Principal Solutions Architect at Professional Services, Confluent) has a role that specifically tasks her with performance tuning as well as troubleshooting Kafka installations of all kinds. Based on her field experience, she put together a comprehensive list of common issues with recommendations for building, maintaining, and improving Kafka systems that are applicable across use cases.
Kris and Nikoleta begin by discussing the fact that it is common for those migrating to Kafka from other message brokers to implement too many producers, rather than the one per service. Kafka is thread safe and one producer instance can talk to multiple topics, unlike with traditional message brokers, where you may tend to use a client per topic.
Monitoring is an unabashed good in any Kafka system. Nikoleta notes that it is better to monitor from the start of your installation as thoroughly as possible, even if you don't think you ultimately will require so much detail, because it will pay off in the long run. A major advantage of monitoring is that it lets you predict your potential resource growth in a more orderly fashion, as well as helps you to use your current resources more efficiently. Nikoleta mentions the many dashboards that have been built out by her team to accommodate leading monitoring platforms such as Prometheus, Grafana, New Relic, Datadog, and Splunk.
They also discuss a number of useful elements that are optional in Kafka so people tend to be unaware of them. Compression is the first of these, and Nikoleta absolutely recommends that you enable it. Another is producer callbacks, which you can use to catch exceptions. A third is setting a `ConsumerRebalanceListener`, which notifies you about rebalancing events, letting you prepare for any issues that may result from them.
Other topics covered in the episode are batching and the `linger.ms` Kafka producer setting, how to figure out your units of scale, and the metrics tool Trogdor.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us