Mark Twain talked about liars and statistics, but he never once mentioned benchmarks. Coming up with an honest test built on open tools running in an easily documented, replicable environment and doing it for a distributed system like Kafka is not simple. I got to talk to Alok Nikhil about how he did it anyway. Listen in on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Hello and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined today by my colleague Alok Nikhil. Alok is a, I should say, an engineer on the cloud native team here at Confluent, so he works on things related to Kafka running in the cloud. Alok, welcome to the show.
Hi Tim. It's my pleasure to be on this. Thank you so much.
You got it. What would you add to that? I gave a very short description of what you do, but what would be your longer account of your job?
Yeah, definitely. I actually came into Confluent about six months ago, so prior to that, I actually work at AWS and Cisco, so most of my work has been around building cloud native platforms, be it either in the networking space or in the database space. At Confluent, most of my work still revolves around getting Kafka on the cloud and how best we can actually leverage our managed solution to provide our customers with high performance, high scalability. In general, just make it easy to host Kafka. Yeah, that's basically what I look at.
Nice. Now, you and Vinoth Chandar recently wrote a blog post about benchmarking Kafka against Pulsar and RabbitMQ. That's a super nuclear topic for you to write about and then agree to come on a podcast about. I admire you. Because you know how stuff like this goes, right? I mean-
Yeah, benchmarks area a fraught enterprise. As I've always said, I've said this publicly many times, Mark Twain said there were liars, damned liars, and statistics. I put benchmarks even after statistics on that list. I mean, I guess, occasionally this is like being a conspiracy theorist, sometimes there are conspiracies. Yeah, sometimes vendors probably just outright lie about benchmarks. I don't think that happens much though. I don't think there are many actual, real, successful conspiracies. I don't think there are a lot of lies in benchmarks. It's just that benchmarks, it's easy to tune them to make yourself look good and your competition look bad, even if you're talking about open source things. You're competing for mindshare and adoption, it's still a competitive environment.
And they always lack context, and this is why when I get performance questions, when I'm talking to developers in a community setting, I always say, "Hey, look, there's benchmarks in the world, but your benchmarks are the ones that matter. You have to test the way things perform in the context of your system, and that's the best benchmark." But even that is oversimplified, because as we're going to find out, there's just so much you have to know about systems to measure their performance well. With that, Tim's philosophy and overall skepticism of benchmarks, it's fascinating stuff and for us to dig through it, we're going to be able to learn a lot. Just tell me, I guess, to start with, what did you set out to explore here between these three systems and why?
Yeah, I think that's a good introduction to benchmarking. You're right. I think, just as a precursor to these posts, we actually deliberated quite a bit on how exactly do we want to measure and what do we actually even want to achieve here, right? You're right in the sense that benchmarks are, I would say more often than not, are skewed in a lot of ways to the vendor that's actually putting out that benchmark. But it's also a way for people who have never probably seen either of those systems in production to actually see what they're capable, irrespective of how the numbers are put out. But yes, definitely everyone has to do their due diligence before they actually adopt any of these into production.
With respect to just what we wanted to do with Kafka and RabbitMQ and Pulsar, based on the data from the field, based on the community, we were looking at upcoming platforms. We were also looking at Kafka's performance, in general, which hadn't been benchmarked for about six years now. I think last this was done was when [Jay 00:05:06] wrote a blog post on demonstrating, I think, about two million writes per second on three commodity machines. So we wanted to look at all the contributions to the, sorry, the community had been doing over the past six years and how that actually impacted Kafka's performance. [crosstalk 00:05:23]
Right. It's funny that two million writes on three commodity machines, good point, might be time to update that. We've been talking about that for a little bit now.
What have you done for me lately?
Yeah, so that's the thing that we were looking at. And it was also a data point for not just Confluent, when it talks to its customers, but also Kafka in general when it's in the summits, when people talk about performance and scalability, we just needed some kind of a reference point to talk about it. So that's what, primarily, we wanted to establish in this benchmark and definitely we wanted to look at the merits and the demerits of the computing systems, which were Pulsar and RabbitMQ.
And we focused primarily on these two because Pulsar is a more upcoming competitor. We've seen that their architecture is quite different from what Kafka does and we were also very curious to see how that would play out in an even streaming set-up. RabbitMQ has historically been very popular with AMQP applications. Just the messaging for low latency is kind of where RabbitMQ fits very well, so we wanted to focus on that and see how well Kafka preforms against that.
Cool. It seems like durability trade-offs are at the heart of a lot of performance optimizations. As I look at Kafka, because I'll confess, I'm not really your guy when it comes to performance tuning Kafka. I mean, I can talk about the concepts and I know where the knobs are. I definitely don't know that as well when it comes to Kafka and Rabbit, I'm happy to say that, happy to admit that. But it does seem like durability is where the money is. First of all, am I right about that? And if so, where are the durability trade-offs in each one of these? If you could take me through Kafka and Rabbit and Pulsar, how do you see those trade-offs that are going to impact end-to-end performance?
Yeah, definitely. There are many different aspects to measuring performance and how different systems are impacted by the knobs that you actually have, and each of these systems actually do provide their own durability guarantees. And primarily, we are looking at durability as a north star, in a way, because that has the biggest impact for these systems. For example, if you look at RabbitMQ, the message durability that RabbitMQ offers by default is very loose. It doesn't really offer much of message persistence unless you actually turn them on, which means it can optimize messaging or message brokering to be almost all in memory, which considerably improves performance.
So we wanted to establish the baseline for what the expected durability guarantees are across these systems, and that's why we decided to look at all the trade-offs that each systems offers. So yeah, particularly, we were looking at Kafka's and Pulsar's and Rabbit's default configuration and how each of those fair against each other.
Can you talk to some of the trade-offs in Pulsar, understanding that this is a Kafka podcast, but certainly Pulsar aficionados are listening, so those trade-offs in ways that those aficionados will agree with?
Definitely. Pulsar actually offers, as I was saying, the architecture is a bit different from what Kafka does. And, in fact, if you look at Kafka's core it's a series of distributed commit, append-only logs that are distributed across the brokers. Pulsar, on the other hand, actually offers more of a tiered approach for compute and storage, where the storage is offloaded onto another Apache open source project, which is BookKeeper, where BookKeeper follows more of quorum style storage model versus Kafka, which is more just replication based durability. That's basically where we were looking at, which is what does Pulsar offer by default, in terms of recovery, in terms of basically how fast recovery can be done, in terms of just the failure scenarios? Because it's a quorum style replication and storage that BookKeeper offers and, in fact, because of which Pulsar offers, we were more interested in trying to understand how this persistence actually plays an important role.
For Kafka, in fact, if you look at replication, replication is pretty much the main guarantee that Kafka gives for ability, as well as durability. A lot of the protocol is centered around understanding and knowing and building for disk failures, so it doesn't necessary assume that the messages that are being persisted to disk will stay there. Which means that it does rely on the fact that every time, let's say, a broker crashes and restarts, it does know that there could be some messages that were never committed to disk. The replication is built on trying to recover that from the rest of the brokers. For Pulsar, on the other hand, we were not really sure, primarily because the documentation doesn't really explain. And, in fact, we even put it out in the original blog post and trying to understand what exactly Pulsar, rather, why exactly Pulsar needs to persist to disk.
But from our initial parse over the code and trying to read, we feel like, without persisting to disk, Pulsar cannot offer the same message durability guarantees, so that's something that we were looking at, because trying to sync to disk and wait for the writes to come back does add significant overhead to throughput, as well as latency. So that's basically why Pulsar's durability was kind of up in the air throughout the blog post.
Does anybody's benchmark involve a synchronous fsyncing ever? That sounds unthinkable to me.
That's a fair point. In fact, if you think of what Kafka does by default, we actually do not flush every message to disk. In fact, Kafka's greatest win is the simplicity in the architecture, which is it leverages the OS's page cache for most of the reads. It also leverages the operating systems flushing policies. So we don't necessarily think that that's something that's a deal breaker in a lot of ways. And, as you said, it's not very common for people to actually try to sync the message to disk, especially considering, in the cloud, you probably have some kind of a layer, like the Elastic Block Storage in AWS, which doesn't necessarily behave the same as an SSD when you try to sync to disk.
No. No, there are layers upon layers of lies in what successful write means when the operating system... some API tells you something was written. You have no idea what that means.
Ever. Even, frankly, if you have a locally attached IDE drive in a server that you're looking at that's, as you know, when you really dig into what a write means and were things recorded to physical medium, it's kind of hard to really know. And so-
... like you said, Kafka's approach is not to take any of that for granted but provide guarantees based on replication and the watermark of, you don't call it a watermark, but what messages are readable.
That's correct. That's correct.
Yeah. Okay. You used a framework for this called the OpenMessaging Benchmark. Could you tell me a little bit about that?
Sure. Coming back to the thing that you brought up at the beginning, which is it's easy to actually skew most of these tests, not only just the tests, but just the construction of the framework itself, in a way that could probably not just make Kafka look good, but could also make Pulsar and RabbitMQ look bad. So we didn't really want to tread those waters, because those are really murky, so we decided to actually go along with one of the vendor neutral, that's pretty much what the community, at least, considers frameworks, which is the OpenMessaging Benchmark. There was, I think, about two, maybe three years ago, there was a report that was put out by the Pulsar folks that used OpenMessaging Benchmark and they did start the OpenMessaging project as a way to standardize benchmarking for event streaming workloads.
So we wanted to see if we could actually continue adopting that. And, in a lot of ways, it helped improve that framework further, help evaluate, maybe, more event streaming platforms that might come up or improvements in our existing platforms. We decided to go with that, and that was actually a tricky situation for us, because in the initial parse, it did look like OpenMessaging Benchmark did not have significant commits for about a year. Yeah, so there were a lot of gaps in where the configuration was laid out. Particularly, we noticed that Kafka's configuration was kind of unfair, primarily, because we have the concept, or rather, basically we have the option to use multiple producers, when you actually [inaudible 00:14:55] the workload. But Kafka's driver configuration was forced to use a single TCP connection to drive all of those producers.
Yeah, so this is a big issue, especially in the cloud. When you think of single TCP connection in the cloud, you are limited by packet rates, because of what the cloud architecture can do. So even if you use big, jumbo frames and try to push as much data as possible in each packet, you're limited by the maximum throughput that you can theoretically achieve, which is way lower than what, let's say, about two of three producers can do on their own dedicated TCP connection.
Yeah. I think it was Larry Ellison who said the cloud is just somebody else's computer. It turns out it's also somebody else's network infrastructure, which is very carefully configured and you normally just don't think about the fact that there are switches and there are things, and they have policies, and you don't get to violate those policies.
That seems like that might not work out so well.
And the funny thing, because you say, "Ooh, there hasn't been a lot of commits in a year," and I had this visceral reaction of, "Oh no, it's dead."
Which is what we assume, and that's usually not wrong, but it's kind of funny that we just assume that software can't be complete. It can't be like, "Well, no, this is the thing that we set out to write and we don't have any more work to do on it, because it works now." The world around it is changing and, like you said, there was Kafka configuration that wasn't right, so probably that wasn't a good thing, but it's just funny that we assume in a tool like this, there must be significant ongoing development for it so be healthy, and I don't know that's always true. But you were able to get past that and made the decision to adopt OpenMessaging Benchmark.
Yeah, that's correct. I think, overall, we did notice that there were a lot of merits to it, right? The way the workloads were laid out, did seem sane for us, which is they were quite representative of real world use cases. Primarily, if you're looking at, for example, and event streaming platform, your focus in on how best sequential message processing is, or rather, how well can each platform leverage the underlying hardware to offer the best sequential performance. And if you compare and contrast with a typical database, we don't really do point-in-time selects with streaming solutions.
You just don't.
Yeah, exactly. So it's almost like you never actually find any event streaming solution, even implements and API for that, right? We kind of considered that as one of the constants that we can actually measure against, so that's why we just decided that, "Hey, this makes sense to an extent," which is it'll just leverage the existing workloads that the framework offers. And, in a lot of ways, one of the things that we did really like about the framework was the reporting at the end, where it actually summarizes the results of the run. It did include all the relevant metrics that we wanted, which was basically the produce and the consume rates, the backlogs if any were growing, and then the latency numbers themselves, which is the end-to-end, the publish latency. It kind of checked all the boxes that we were looking for just monitoring. And yeah, so that's why we just decided to go with this.
Cool, cool. And so, that's interesting that you say measuring how you utilize the underlying hardware to provide the best sequential message processing performance. Could you walk me through the actual tests that it does that you used? And, I guess, buried in that question is, this is a series of tests and configurations, what is the benchmark? It's a repo. What's in the repo?
Physically, what is it?
Sorry. Absolutely, yeah, that's right. Yeah, by the way, we did actually open source it, as you said, just so that the community, or anyone who's actually interested in reproducing these numbers could reproduce them, so we try to be as transparent as possible. With the test set-up itself, the repo actually contains two things, one is the definitions for the workloads, and I'll just get to that in a bit. And also, our deployment configuration that allows anyone to stand up a cluster in AWS, with the relevant machines that we used in the blog post. For example, with the hardware that we are using here, we configured three brokers and one monitoring [Nord 00:19:44] and four clients to fund these benchmarks. And with the brokers themselves, we actually kind of tried to use the original, the yardstick that they came up with, which was the instance types.
We tried to focus on what could actually be a bottleneck for these event streaming workloads. I mean, the ideal situation or the ideal event streaming platform, would probably max out the disk, while still having some head room on the CPU, and not really a lot of memory, considering that the consumers were keeping up with the producers. So we tried to focus on that to see if we could use that a reference point, and then come up with an instance type that gives you the best bang for buck, where you actually get the maximum streaming performance out of that, so that's where we looked at what the existing instance type was. And, as I said, because it was [inaudible 00:20:44] and it didn't really receive much activity for about a year, the instance types, themselves, kind of were navigated and AWS, in a lot of ways, it's funny because AWS uses new instances very, very, very frequently.
We actually noticed that the instance type that they used with the I3 class of instances, which great for storage, but the funny thing is AWS already came up with another one, which is the I3en which is storage optimized, as well as offers great networking, which I found to be almost stable states, when you actually do steaming platforms, right? I mean you're bottlenecked by your network it doesn't really matter how fast of a disk you have, so it really doesn't make sense. So that's where we decided that the I3en class of instances, which we went with [2xlarge 00:21:35], which basically has about eight cores, 64 gigs of RAM, and a couple of 2.5 terabyte NVME SSDs. So yeah, that was the hardware set-up.
And going past that, I mean, we also looked at establishing the baselines of these hardware, I mean, rather, in this hardware, which is how fast could the disk go, how fast could the network go, and we noticed the disk capped out at about 330 megabytes per second, which is not bad, to be fair, primarily because this is a 2xlarge, which is a cheaper variant of the I3 in class of instances. And AWS does this thing, which is, depending on the size of these instances, which is just the number of cores and the memory that you get, it caps the throughput of the disks, so that's another interesting thing that we need to consider. And, as I said, primarily, when we look at this benchmark, it's not just anyone's machine that's running on PRAM, it's in the cloud and we need to find the best fit in the cloud for these platforms. So that's why we decided to focus on these aspects and then we decided the I3 and 2xlarge kind of suffices that yardstick.
Got it. And so, when I'm looking for instances to run Kafka on the cloud, what are my takeaways? I think I heard you say consider network to be your limiting factor first and then disk and then CPU and memory, is that basically the priority?
Yeah, that's pretty much it.
Is it that simple?
Yeah, to be fair, that's actually pretty much it. As I said, Kafka is, in a lot of ways, as I said, because there's a simple architecture, it doesn't really complicate things in terms of configuration. So the thing that you brought up, which is basically the memory being the least, rather, priority for this is actually very true, especially for Kafka. One of the things that we noticed is that when we switched the instance types from I3 to I3en, for the same type of instance, which is the 2xlarge, they do offer significantly different amount of memory from the previous generation to the new generation. In fact, they actually go lower in the new generation, because they brought the price down and a bunch of things like that, right? For us, it was actually a lot easier to get Kafka and, in fact, even RabbitMQ up and running without much of a change to the configuration, than Pulsar, which kind of takes the approach of managing its own cache, managing its own off-heap memory.
We found it a bit more difficult to start tuning it, because we would see a lot of these out-of-memory crashes on Pulsar. So from that perspective, yes, I would say for someone looking to use Kafka in the cloud, I would consider networking to be their biggest metric that they have to measure first, primarily because, as you will see, as I walk through the results, you'll see Kafka can really, really, really make use of the disk.
Got it. Let's talk about some of those results.
What did you find?
Yep. Basically, before I get into that, with the workload, I can just go into a bit of the workload we did. The workload itself was a typical streaming benchmark, I would say. It was not too complicated. In fact, this was very similar to the one that the Pulsar community had done about three years ago, which is we had a single topic about 100 partitions configured for three times four tolerance with three brokers. And then, we had a single message size coming in at about a kilobyte, and then we configured the batch sizes to be about one megabyte, and give about a linger, which is basically wait for about 10 milliseconds before sending the message out from the producer. And we also made sure that we wanted to keep all of those durability guarantees that we spoke about earlier, so we decided to keep synchronous replication in Kafka, rather configure Kafka to do synchronous replication with setting X to all. So you will wait for every single produce to be replicated before you actually act back to the producer.
And then we decided to test across a range of rates, which was about 10,000 messages, 50,000, 100,000, 200,000, 400,000 and about a million. But these were target rates, not necessarily the rates that we actually achieve, but these were the targets rates that we were going for. We, of course, focused on the main two things, which was the throughput that the system could push and the latency itself that the system achieved in trying to push a single message end-to-end. From the throughput's perspective, I would say Kafka actually did the best, primarily, because of the OS page cache, as well as the fact that it kept things simple.
We had two disks, both capable of doing about 350 megabytes per second, I would say 330, rather, megabytes per second. And the theoretical, rather, the practical maximum that we could achieve without using, let's say, Kafka or anything, maybe use a tool like dd to measure it, it was about 655, 660 megabytes per second. And Kafka came really close at about 650 megabytes per second, and we did notice that there was no CPU bottleneck and we had taken the [inaudible 00:26:49] bottleneck out of the equation, too. And we basically saw that Kafka pushed the disk to the limit at that point.
Ah, so closest to the theoretical maximum for the disks. And by the way, that theoretical maximum of 660 megabytes per second is the same on the hardware for the Pulsar and the Rabbit deployments, as well?
That's correct. That's correct. We decided to keep the same instance types for the brokers across the board. Same number, all of that was configured to be the same. Yeah, so the hardware was pretty much the same.
Gotcha. And what kind of latency did you see at that peak?
That's a good point that you brought up which was at the peak, because, for us, we were trying to measure the stable throughput at which the systems were internally overloaded. Kafka actually, at about 600, maybe 650 megabytes, the latency is about, I would say, 60 to 70 milliseconds, but that is only because we're pushing the disk to its limit. Let's say [crosstalk 00:27:52]-
Ah, yeah. And that's p99 we're talking, right?
Exactly. That's p99. You're right. The only issue is, basically, if you start pushing the system, so for example, when we compare this to Pulsar, it's a little more interesting when you look at Pulsar's performance, which was about 320, 30 megabytes per second, which was exactly the single disk throughput limit. Because Pulsar's architecture is a bit different, where they use a journal disk and they use a ledger disk. They use a journal disk to actually write data from different partitions. It's almost like a single long append log containing writes from all kinds of partitions [inaudible 00:28:28] and accommodated to it synchronously. So because of that, there's head-of-line blocking there, where you're waiting for every write to come back before you move onto the next.
And the ledger disk, on the other hand, that was more asynchronous, but as you would see, asynchronous means you kind of have an opportunity to batch a bit more and kind of extract more of the disk. But yeah, we noticed that Pulsar's limit was completely based on the single disk throughput. Now, given these constraints, and actually, just before that, RabbitMQ was CPU bottlenecked, which means we actually didn't even notice RabbitMQ come close to the disk limits, which would add about 34 megabytes per second. So given all of these constraints, where we started seeing bottlenecks of one kind or the other, we didn't really feel it fair to measure latencies at these bottlenecks. And, as I was saying, it's basically the worst time to measure latency when the system is overworked.
It digitally makes sense for us to actually focus on that. So then, we decided to look at starting to take a few steps back, reduce the throughput a bit more, look at the throughput at which all systems didn't really look like they were overworking the system. And that's kind of where it was Kafka was about, I would say 300 megabytes per second, Pulsar was at about 200 megabytes per second, and then RabbitMQ was at about 30 megabytes per second.
Yeah, exactly. 30. The thing with RabbitMQ, to be fair, is it was never built to push a lot of throughput, the primary focus RabbitMQ has is around latency, low latency.
Yeah, exactly. And, in fact, in the latency tests that we've done so far, we noticed that RabbitMQ does really, really, really well, almost achieving p99s of one millisecond of that loaded.
Okay, so it really dominates in latency and the trade-off there is it's going to cap out in order of magnitude sooner, in terms of throughput.
That's correct. That's correct. And yeah, so that's where we noticed that, yeah, maybe Kafka was kind of middle-of-the-road at latency there, which is about 15 milliseconds, I would say, of p99, actually five milliseconds p99 and p99.9 a 15. Yeah, so it was a good result for Kafka, but yeah, it could definitely do better with all kinds of, maybe, configuration changes that we can do in the future or, in general, maybe stuff that we can actually focus on for improvements in replication. But yeah, hands down, RabbitMQ still does significantly better than the other two platforms.
In latency. And, by the way, dear listener, if you're new to discussions of benchmarks or if you've seen this p99, p95 sort of thing, what that means is, 99% of the events, whatever the thing is that you're measuring, in this case messages, had a latency of, say, five milliseconds or less. If your p99 is five milliseconds, that means 99% of the things were that number or less. And so, you can get a p95, because just the probability distribution of latency is such that you can get a p95 of an even smaller number. You can get a p50 or what you'd call the median of a smaller number than that. And so, as you increase the percentile, the number of messages that you want to be that or faster, you're capturing more and more of your outlying slow events. And I think, you just said, Alok, if you go to 99.9, then it goes up to 15 milliseconds. Well, because there are outliers in there and you're going to start to capture those.
But p99 is a typical one. It's pretty aggressive. You were saying, 99% of our things are this or faster.
Exactly, exactly. That was the key part that we wanted to measure. It's basically easy to lie by using average as your statistic, right? Which is, it's never representative of your true experience. So yeah, that's why we decided to focus on these percentiles and make sure we had some kind of a guarantee or at least some kind of an agreement on, "Okay, where do we want to focus on, in terms of what the experience will be for latency?"
Right, right. And, usually, these are going to be normally distributed, there's going to be a classical Gaussian bell curve to these latencies. But if you just communicate an average, there's another number and that's the fatness of that bell, the standard deviation, how far out is one standard deviation. With just an average, that's nice, but if you don't know the standard deviation you don't have that. And so, rather than communicate those two things and say, "Well, now you have to remember in statistics when you studied the normal distribution. Go do some calculus and you can figure this out." The p99 is just a good kind of standard, and, frankly, perfectly honest. Any vendor, any open source project, anyone doing latency benchmarks is likely to communicate in terms of that metric and it's a pretty fair way to compare.
That's right. Yep.
Here's a devil's advocate question, getting back to Tim, the skeptic of benchmarks.
You can only do this by picking the same hardware and running well-configured software on the same hardware, with similar workload, the same workload, that's the only fair way to do this, but say you weren't an impartial scientist here, but you wanted to make Pulsar look good, is there a different instance type you could've picked? What would you do there?
Yeah. I think it's actually easy to look at what that bottleneck was, right? Which is, the single journal disk. We definitely could've picked an instance which offers one faster disk and one slow disk. Funnily, though, no cloud provider does that. They either offer the same performance for all disks or-
You're saying that and I'm thinking in the back of my mind, "Can you get that? I didn't think you could get that."
[crosstalk 00:34:54] And I always assume it's something I don't know. There's lots of things I don't know, but right, okay.
Yeah, there is, of course, a way to do that, right? Which is, basically to use a service like EBS. But, unfortunately, with EBS, you don't really get the same kind of guarantees that an SSD can give you, primarily, because it's a service, it's not a disk, right? So you kind of have issues with latency then, so it's almost like you're trying to make a trade-off, which is go you go with EBS, configure... by the way, which would also be more expensive, considering that you would have to provision more IOPS on EBS to get the same performance. So you kind of make the trade-off there where you pay more money, you're okay with spikes in latency, which means your p99 starts to look bad, and then you say, "Okay, now I can push as much as I want on those disks." That's kind of the challenge here, which is for Pulsar, you could make it look good. But yeah, you would need something really, really, really specific to optimize that on the journal write path. So yeah, that's what I would say.
Gotcha. And that journal and ledger split thing makes all the sense in the world when you think about BookKeeper as a storage system. You have to do that. It's like a database at that point. There's a commit log and then you take from there and build the stuff that is your long-term storage and do the quorum writes, which, if you're going to be a quorum system, then some of those might be higher latency writes than others, because they might be in remote regions or whatever, that's how that kind of a system works.
So having that journal on the front end of it is sensible. Just talking about it, it might sound like a really bad thing, it's a perfectly sensible thing, and it's a negative consequence of the positive utilities that you're getting out of the back end of that storage system. It's designed to be distributed in this particular way and, therefore, the trade-off is we've got this ledger on the front end of it. I mean, again, this is a Kafka podcast and I'm a guy who works for Confluent, so we don't need to pretend about what my incentives are, but I don't want that to sound like some obvious criticism of Pulsar. It's a different set of trade-offs and you take advantage of other utilities in the system in exchange for that trade-off. But in terms of peak throughput, well yeah, there you are. That's going to be a limiting factor.
And, at that point, man, this stuff gets subtle. You have to know a lot about these systems to understand the benchmarks and to make the trade-offs and you have to make the decision. Are the actual advantages of that quorum write, distributed storage backend, are those utilities things that I need to maximize at the expense of minimizing other utilities in the system? And, of course, if you're making a decision about what event streaming system to adopt, based on performance benchmarks, when the pandemic's over and we can go out for coffee, I want to go out for coffee with you and just talk about the way you make decisions in your life, because there's more to the story than that. However-
Yeah, that's fair.
... we're talking about benchmarks today, so we can limit our discussion to that.
Yeah. No, that's fair. Sorry, the point that you brought up, which is fair, right? Which is a lot of these systems, there are a lot of nuances to it. I mean, as I was saying with RabbitMQ, the number of configurations, or rather, the way you can configure RabbitMQ, it's just crazy. It's got so many different knobs that you can play with. And, of course, each of those have their own use case, it's not fair to say that they're not really relevant. But you have to be aware of what each of those options, configurations can do. Primarily, if you look at the architecture for RabbitMQ, you're looking at exchanges and cues. So they're kind of a different concept toward a typical event streaming platform like Kafka or Pulsar has. Kafka and Pulsar keep it simple, they have a topic, they have a producer, they have a consumer that's interested in that topic, right?
Now, with RabbitMQ, on the other hand, you have exchanges that determine how do you route messages, and then you have cues that are bound to these exchanges that determine who consumes these from which exchange. Now, depending on what you choose for these parameters, the performance could be either Nx or 50x, so it's huge. The range is just insane.
Got it. Got it. What that means is there's a lot of ways to run the benchmark wrong, and even the most right, most fair, most objective benchmark you can think of is probably then open to criticism-
... after the fact of well, "You set that up, so that product X would look like it sucks."
Which, by the way, after you write a blog post like this, I mean, that's your life for a little while. [crosstalk 00:40:14]
I started with saying, I think you're courageous.
Yeah, it is. I'm in a lot of Twitter flame wars and stuff like that [inaudible 00:40:21].
Oh yeah. That's like, "Hey, let me go talk about what I think about Fiat currency. I'm just going to go get on Twitter and talk about that. That's fine. I'm sure nobody's going to mind."
"It'll be great and we'll all be friends." Yeah, it's not how I play it. But anyway, this actually is, as it were, scientifically necessary for us to try to understand this and, honestly, that is, to some degree, a broader community function, because when you do this, you do have to take that input and feedback and criticism on the methodology, and you have to really interact with it. You can't just say, "No, I know Pulsar better than you, Mr. Pulsar specialist, and this is how it works." You have to talk through that stuff.
That's right, that's right.
So the result of this benchmark, I think, if we've got time, I've got two more big questions.
The results don't make me feel like, "Oh wow, I'm embarrassed, we have some catching up to do." It seemed like it worked out pretty well, and again, there are many more factors to consider than just performance, but you have to consider performance. If you're running in the cloud, it's going to drive your operational costs long-term.
So that matters. Even so, what's going on in the world of important Kafka KIPs. Obviously, KIP-500 comes to the top of the list, but what else is there in the world that will help move Kafka forward even more? What do you see coming up?
Yeah. I think it's a very relevant point, which is there are a lot more broader use cases where Kafka can, of course, performance might actually not be ideal for us. And particularly, if you're looking, for example, just increase the number of partitions in a topic, just basically cause more random messages to go through across these partitions, I feel like that kind of a distribution starts to hurt Kafka's performance a lot more. And we also have started to see [crosstalk 00:42:32]-
Because this one topic, 100 partitions, you said, right?
Exactly, exactly. So we could definitely do better in the larger number of partitions and that, to be fair, is something that Pulsar can handle very well, and we have seen that in the past, we've seen that in their own benchmarks, and we do acknowledge that fact. And, particularly, the whole thing that we did with measuring the performance was also a way for us to understand what the bottlenecks would be for Kafka, in general. And we do see that, as you were saying, KIP-500 is one of those things which, in the control plane, not particularly in the data plane, but in the control plane, starts to improve scalability for Kafka, just replacing the ZooKeeper layer, by the way, that KIP-500 is replacing the ZooKeeper layer with a Raft consensus protocol, allows us to just add a lot more partitions that a broker can support versus just limiting it to a fewer number, just because of ZooKeeper's scalability limits.
That's the primary focus of the control plane side, but once we deliver that, we now have to start looking on the data plane side, how can we start improving things. And this is where it's actually a lot more exciting... it's like if you drew a decision tree to go from, there's just way too many parts to go through. Primarily, because there are a lot of ways in which you can actually improve and focus on for performance. But if you think of what the future for computing is, what the future for storage is, you start to see a lot more cores that are single [inaudible 00:44:13]. And then, a lot more, or rather, a lot bigger cores that are single [inaudible 00:44:17]. And you also start to see these storage trends that are staring to use SSDs, not particularly NVME, but SSDs, and we start looking at the advantages an SSD can offer, which is does do a lot better for random writes than a typical hard drive does, right?
Keeping all of this in mind, we actually think that our bottlenecks start to show up in the replication, especially because Kafka is a pull-based replication, not push-based, so it does take a significant amount of, or rather, adds a significant amount of overhead, especially when you have a large number of partitions that you're replicating between brokers. And because it's fully replicated across all machines, it's kind of a system where everyone's pulling data from the partition leaders and, yeah, you start to see these replication bottlenecks. That's the other area that we're looking at and trying to focus on, maybe in 2021, and hopefully deliver something that's solid improvements there.
So improvements in replication scale. It sounds like, to your point, KIP-500 is a control plane improvement, but given that control plane improvement, that exposes us to different scale characteristics in the kids of clusters that will exist in the world. There will be more higher partition clusters once having an ultra-high partition cluster isn't a bad thing. There'll be more of them, and that's going to expose some of these other growth edges that the team can attack and optimize, because of those, I don't want to call them use cases, but sure, use cases that have profoundly large partition counts, you'll start to feel the need to do that optimizing that you don't feel right now.
That's right, that's right. As you said, it's not necessarily something that's out there in the wild. We haven't really seen customers come to us and complain that, "Hey, you don't really support this many partitions and Kafka's performance is horrible," or whatever, right? We haven't really seen a lot of data points, but what we do notice is, because of improvements in supporting larger number of partitions for a broker, your cost tends to come down to run the same kind of workload. From a cost and efficiency perspective, it is still a big win, and that's kind of why this is something that we still concentrate.
Final question, what do you think has been the most honest critique that you've received, and how did you respond to it?
Let me think about it. I think, in a lot of ways, the Pulsar community got back to us and they said, "Hey..." we kind of went back and forth, because they suggested using a faster disk with a journal. As I said, that's something that's not standard. But I think the honest critique was the fact that we should've tried a larger number of topics, maybe start mixing workloads a lot more. Primarily, we were looking at a workload where the producer produces messages, and the consumer is constantly just consuming messages. So there is not opportunity for these messages to build up and then you go lag behind in the consumer and then pull those messages from way back in time, right? So then you start hitting the disk and a lot more of those start to come into picture, which is basically kind of what they call the catch-up consumer test. So that I think was the biggest critique that I found was very relevant and, in fact, could help us introspect into Kafka's performance.
Well, I like the fact that there's been interaction there, and I hope that can continue to go.
And, as a broader community, we can help refine the way that we study this stuff and improve our tools.
Yeah, definitely. I think that's very key here. You're right. I think event streaming and event-driven architecture is, obviously right now, a very key component in a lot of companies and businesses. I think the findings that the community has, in general, are going to be very relevant from one product to another. And yeah, you're right, it's just there is good cooperation, and then the fact that there are enough checks and balances that one doesn't run away and claim things that are not true. And yeah, I think that's been a pretty good interaction, so far.
My guest today has been Alok Nikhil. Alok, thanks for being a part of Streaming Audio.
Thanks for having me, Tim. It's a nice experience to discuss the blog post out there. It's not just the words there now, but it's also some voice to it.
Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST. That's 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit and there are a limited number of codes available, so don't miss out.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack sign-up link in the show notes, if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there, that helps other people discover us, which we think is a good thing. Thanks for your support and we'll see you next time.
Coming up with an honest test built on open source tools in an easily documented, replicable environment for a distributed system like Apache Kafka® is not simple. Alok Nikhil (Cloud Native Engineer, Confluent) shares about getting Kafka in the cloud and how best to leverage Confluent Cloud for high performance and scalability.
His blog post “Benchmarking Apache Kafka, Apache Pulsar, and RabbitMQ: Which is the Fastest?” discusses how Confluent tested Kafka’s performance on the latest cloud hardware using research-based methods to answer this question.
Alok and Tim talk through the vendor-neutral framework OpenMessaging Benchmark used for the tests, which is Pulsar’s standardized benchmarking framework for event streaming workloads. Alok and his co-author Vinoth Chandar helped improve that framework, evaluated messaging systems in the event streaming space like RabbitMQ, and talked about improvements to those existing platforms.
Later in this episode, Alok shares what he believes would help move Kafka forward and what he predicts to come soon, like KIP-500, the removal of ZooKeeper dependency in Kafka.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us