March 15, 2022 | Episode 204

Handling 2 Million Apache Kafka Messages Per Second at Honeycomb

Transcript
Notes

Kris Jenkins: (00:00)

Over a million events a second for a hundred enterprise clients. How do you get to that size? Today we'll be hearing a scaling war story from Liz Fong-Jones of Honeycomb. But before we get there, streaming audio is brought to you by developer.confluent.io. The one place to learn everything you need to know about Kafka, whether you are getting started or yes, trying to scale those performance issues. We'll give you all the information you need to know. And we'll also teach you how to use Kafka and take you through some step by step courses. If you want to use those and you sign up to Confluent Cloud, do use the code PODCAST100. It'll get you $100 of extra free credit so you can and take that course a little slower or a little larger, but before then, please enjoy today's episode of Streaming Audio. Welcome to Streaming Audio. My guest today is Liz Fong-Jones, who is a developer advocate at Honeycomb.io, has been for the past three years. Is that right, Liz?

Liz Fong-Jones: (01:04)

Just about coming up on three years in February.

Kris Jenkins: (01:07)

Coming on three years. So what are Honeycomb? What being a developer advocate of Honeycomb actually mean?

Liz Fong-Jones: (01:16)

So Honeycomb is a company that helps our clients achieve observability. And we do that by ingesting telemetry data from their systems and we help process that and surface it so that people can get real time insights into why is my site slow, which users are experiencing downtime and help people correct that as quickly as possible. And as a developer advocate, my mission is to help make Honeycomb the best it can be for our developer audience. And that means everything from working in future development to backing infrastructure, to helping educate the world about what we're doing.

Kris Jenkins: (01:55)

Right. Okay. That seems to cover a lot of bases. So if I'm using Honeycomb, I'm sending telemetry from my web server, my database and from my blockchain or whatever and letting you worry about showing it to me in a useful way, right?

Liz Fong-Jones: (02:12)

Yes. That's precisely correct. That we have SDKs like open telemetry that allow people to generate that telemetry data in the form of traces and trace spans and metrics, and they can send it all over to us and then we handle reliably ingesting it and storing it and querying it.

Kris Jenkins: (02:30)

Ah, well, that's the question, isn't it? Reliably. Tell me a little about-

Liz Fong-Jones: (02:34)

Very important.

Kris Jenkins: (02:35)

... About your architecture. Because I know it's more than I would want to put together myself.

Liz Fong-Jones: (02:41)

Yes. So there certainly are ways that you could put together a first pass of a observability platform that first pass might look like having a Jaeger instance that is collecting the data and sending it on to a Cassandra instance or something like that. That's one possible option but at honeycomb we operate at a much larger scale than that would be feasible for. So we have a dedicated fleet of index workers that handle making sure you're authorized to send us data. And then we slice and dice that data into individual events. And then we take each individual event and we send it off to Kafka for durable, reliable serialization and ordering. And then on the other end of Kafka, we've got a fleet of consumers that are basically reading off the queue in order to break those events down into their constituent columns.

Liz Fong-Jones: (03:36)

So if you have a field that's called user agent or a field that's called IP address or a field that's called trace ID or span ID, each of those things from each event coming in order of the stream is appended to a file per attribute or per field. And that enables us to group things together that might share common properties. You might see common compression properties that you can use to make the user events compact down a lot. Or you might see that you have almost every single value has cookies turned on is set to yes. Those are things that you can do much more affectionately when you are dealing with all the data from one column at one time.

Kris Jenkins: (04:25)

What's the database there?

Liz Fong-Jones: (04:28)

So our database here is a homegrown solution. It is a column store that is similar to Facebook's Scuba or the Google Dremel/Column.io backend. So basically, we are slicing and dicing the data. We're storing it and shipping it into flat files and those flat files make their way over to S3 at some point. But we invented this back in 2016, 2017, a lot of these technologies were not necessarily available open source so we wound up building our own. Not necessarily that we made that decision today, but that's the decision that we went with at the time in 2017.

Kris Jenkins: (05:04)

Yeah. It's funny sometimes you make decisions that don't seem like they're that long away, but five years can be a very long time in internet time. Right?

Liz Fong-Jones: (05:14)

Oh, very much so. And especially in COVID time.

Kris Jenkins: (05:18)

Yeah, absolutely. I've been writing something myself, reflecting back on a business that we built I think in 2012, it was, and in hindsight we would've used Kafka, but I'm not sure Kafka was usable in 2012 in the way we would've needed it today. Right?

Liz Fong-Jones: (05:35)

Right. Or people ask me today, "Well, doesn't pulse our handle what you do. And I'm like, "Yes. I mean maybe, but also we are in the world of having started this complex system in 2017 and maintaining the continuity of it at 99.99% reliability ever since."

Kris Jenkins: (05:53)

Ah, there's the rub. That reliability question and you've grown up.Right?

Liz Fong-Jones: (06:00)

Yeah. We've grown a lot and our customer demands have increased. It doesn't suffice to drop customer data on the floor. Our customers are relying upon us to get that telemetry data in. That's why they're paying us. So they don't have to worry about where that telemetry gets sent. And for the most part, we only get one chance to receive it. If we drop that batch of events it's gone forever. There's a crater in that client's graph forever.

Kris Jenkins: (06:23)

Yeah. Absolutely. So as data is coming in, does it go straight into Kafka or is there a step before Kafka?

Liz Fong-Jones: (06:32)

We obviously have these inverse controllers that are handling validation. Is your API key valid? Are you over quota? Repacking things from whatever wire format we get, whether it be Zstandard compressed JSON, gsip compressed JSON or a gRPC protocol buffer. So we are validating these payloads and then we are passing them on via a cast off producer and using Zstandard compression to get the best bang for a buck and sterilizing everything in a consistent format. The other interesting thing is there is one Kafka topic at Honeycomb. There's one producer and two consumers. One producer binary, two potential consumers. That's it. We are using Kafka at very high throughput, especially very high throughput per partition, but we're not actually using Kafka as a general purpose data bus. And I know that a lot of Confluent's other customers love using Kafka as a generic data bus, but that's not our particular use case, which I think is a definite case where we are different than a lot of the mainstream.

Kris Jenkins: (07:43)

So it's really just one really wide topic?

Liz Fong-Jones: (07:46)

Mm-hmm (affirmative).

Kris Jenkins: (07:48)

How many partitions do you know out of interest?

Liz Fong-Jones: (07:50)

Kris Jenkins: (07:52)

That seems like a lot to me.

Liz Fong-Jones: (07:53)

Kris Jenkins: (07:54)

Okay.

Liz Fong-Jones: (07:55)

It's less than people recommend. Because people recommend that you have many partitions to let you low balance between them pretty fluidly, but we've got only 70 partitions per volume.

Kris Jenkins: (08:08)

Okay. And what size of customer base is that serving?

Liz Fong-Jones: (08:16)

Yes. So we are serving hundreds of enterprise customers. Essentially we think in the observability world in terms of number of events per second. We think about number of trace plans that people are sending us per second. And this is actually something where I can use Honeycomb to an interest [inaudible 00:08:40] itself. I can go look to see right now [crosstalk 00:08:43]-

Kris Jenkins: (08:42)

I was going to ask if you ate your own dog food on this.

Liz Fong-Jones: (08:45)

Oh yes, we do. So I can actually go in and look right now and see how many messages per second is our Kafka purchasing. And the answer is our Kafka production cluster is processing about right now as we take this 1.4 million messages per second.

Kris Jenkins: (09:05)

1.4 million a second. That's real term data in action. Love it. Okay, but it was a journey to get there. So it took you a while to reach that scale. So take me through some of the steps that actually got you to 1.4 million messages a second.

Liz Fong-Jones: (09:21)

Yeah. So what happened was in 2019, we were a lot smaller of a company. We had only 25 employees. You could count the number of enterprise customers that we had on two hands. And that was a very different time because it meant that a lot of our cost was paid in terms of ongoing overhead of infrastructure. It was relatively that you could scale out. But once you scaled out, you couldn't scale back in. So we were trying to write at the edge of what we could, knowing that if we rushed it up, that was going to incur now you can't rush it back down. You can't rush it down the number of customer partitions. In our case, once you start sending data to a partition, once you have a consumer that's dedicated to reading from that partition, scaling that back in doesn't quite work as well for us.

Kris Jenkins: (10:18)

Yeah. And you don't want the number of customers to decrease.

Liz Fong-Jones: (10:24)

So [crosstalk 00:10:24] was-

Kris Jenkins: (10:24)

Sorry, you don't want the number of customers to be decrease but it could happen. Right?

Liz Fong-Jones: (10:26)

Yeah, exactly. So fixed costs and quickened rescaling was the way that we are approaching things. Just scaled these increments as we needed them. But also this was a painful world in that something like one fifth or one tenth of our company's burn rate was just our Amazon bill. So that 10 to 20% mark was a little terrifying, especially as the cost kept on going up and up the more data volume we went through.

Kris Jenkins: (11:00)

Was it linear? Was the price going up roughly with the number of customers or was it worse than that?

Liz Fong-Jones: (11:06)

Yeah. It was going up with a number of customers, but also there was this tension of, we know that even if we are getting more revenue per customer, having that burn rate go up linearly at the same time, that's a little terrifying when you're a startup and your runway is measured in months.

Kris Jenkins: (11:25)

Yeah. That's true. So give me some technical details. What did you do to start solving that? What's the first step?

Liz Fong-Jones: (11:34)

So the first thing that we knew that we needed to do was we knew that we needed to stop paying Amazon for redundancy that we already had inside of our Kafka cluster. It was either we stay on EBS volumes, C5 instances. And that we just go to a replication factor of two rather than three. That was the first cost cutting measure we talked about. But we had some chats, and we were like, "Hey, Amazon. We're considering going from three to two. There is a cost that if this blows up in our faces we're going to get mad at you." And they were like, "No, don't do that. Stay at replication factor three. We'd rather work with you to figure out what's needed to make that possible."

Liz Fong-Jones: (12:23)

So seeing that basically we evaluated everything from how much data are we transmitting across availability zone memories. Ideally you should only be transmitting across one time per replication. You shouldn't be reading from multiple different availability zones. That was one recommendation they gave us. But then the other interesting thing was we realized that elastic block store is, there is no commitment discount to EBS. That if you have a customer volume that is one terabyte you are going to be paying, that's what 10 cents per gigabyte, that's $100 per month per terabyte. Right?

Kris Jenkins: (13:03)

Right. So there's only a discount for purchasing, not for storage?

Liz Fong-Jones: (13:11)

Yes. There is only a discount on buying whole instances. And I don't even think a AWS S3 has a volume discount at least not listed on the website. So it was this challenge of, okay, if we know that we are already having a replication factor of three and they're scattered across different availability zone do we really need to be paying Amazon for the ability to attach and detach volumes from instances or to survive the loss of an individual instance? EDS has its own redundancy mechanisms.

Kris Jenkins: (13:43)

Yeah.

Liz Fong-Jones: (13:44)

So what we realized was we could get cheaper storage and get the compute essentially for free if we switched from running C5 instances and EDS to doing a reserved instance purchase of I3en instances. The high storage local SSD instances where determined instance you lose local storage. But that you can make a commitment. There is a volume pricing, a commitment pricing to it that if you say I'm going to need six I3en instances for the next 12 months, they'll give you a discounted rate, which is actually I think 20% lower than just the cost of the EBS volume worth the equivalent amount of SSD.

Kris Jenkins: (14:29)

Really?

Liz Fong-Jones: (14:31)

Yeah.

Kris Jenkins: (14:31)

If you can do your own recovery situation, then you've got-

Liz Fong-Jones: (14:35)

Yeah. Which Kafka provides for.

Kris Jenkins: (14:37)

It's cheaper processing and temporary storage than just the storage?

Liz Fong-Jones: (14:43)

Yes.

Kris Jenkins: (14:44)

Okay. That must have been a happy realization.

Liz Fong-Jones: (14:48)

That was a very happy realization that we had for sure. Was that there was a way to limit and constrain the cost. The downside of that means that you now lose the flexibility with an EBS volume. You can say, "Hey, I need more storage, increase the storage. Let's ratchet up the storage meter." And it'll just let you extend the file system. Everything's fine. In this case, if you need more storage, you need more brokers. And then the flip side to this was actually this interesting waste. We were buying 20, 30 of these instances and we are barely touching the CPU on them. We are just using them for the local disc.

Kris Jenkins: (15:27)

Yeah. So did you find a way to use it or is it just sitting there idle now?

Liz Fong-Jones: (15:34)

Nope. We just let it sit there idle. We were very paranoid. That if you just loaded up your cost of cluster with unrelated workload, it would hurt your latency and thereby impact reliability for users. If someone sends us a payload of data, they expect that data to get ingested within five milliseconds per event that they send us. Otherwise it starts buffering and buffering in RAM on their hosts rather than getting shipped off to us. So we knew we wanted to keep that very low latency. Yeah, that was the first step we took was like, "Hey, let's pay less for the storage."

Kris Jenkins: (16:18)

But do you have a problem if they suddenly go down that you lose data? Or is there a recovery story there?

Liz Fong-Jones: (16:23)

Yes. So this is part of what we continuously test at Honeycomb. We continuously test our assumptions about recovery. So for instance, we talked about the consumers. The retriever processes that read out tail off with Kafka and serializing into column stores. So we kill one of those processes every week. We completely delete the instance and test our automation to make sure it provisions the new one that's able to bootstrap, that's able to replicate data and start catching up based off of where it was in Kafka. So we started doing the same thing to our Kafka instances. We would kill one per week and verify that our mechanisms for re replicating other replicated partitions. And replacing the new broker, we started getting into a cadence of regular broker replacements just to validate that if Amazon makes one of our instances go away then it's not going to be an emergency.

Kris Jenkins: (17:19)

Okay. So you've got recoverability sorted, you've got this cheaper workaround for storage. Can I call it a workaround?

Liz Fong-Jones: (17:30)

Oh yeah. You can call it a workaround.

Kris Jenkins: (17:33)

What's the next step to reaching your present day scale?

Liz Fong-Jones: (17:39)

Yeah. So I think the next step that we took here was at this point I mentioned Zstandard earlier in this conversation. So we were starting to use Zstandard to transmit from our clients to us in some cases, but we are still using snap-

Kris Jenkins: (17:56)

That is the compression protocol, right?

Liz Fong-Jones: (17:56)

Yeah. That's the compression protocol. So we were using Snappy from the Kafka producer to the broker to the consumer. But that was primarily because we knew we wanted to use Zstandards, but our client library because we are a go shop we are not quite as happy to use a C++ Kafka client. I know Confluent has put a lot of effort into the C++ client, high performance client, but we use, it's the [inaudible 00:18:29] native Go client. So it took a while for the [SRO 00:18:31] native Go client to have Zstandard support. So once they had it, we adopted Zstandard and basically that instantly cut 25% of the bytes that we were storing on disc, 25% were bandwidth. It was this massive win to be able to send the same amount of income data through, but to have become compressed to be 25% less than it was before.

Kris Jenkins: (18:54)

Was that just because it was a native form? I mean, why do you think it made such a difference? Is it a better suited algorithm?

Liz Fong-Jones: (19:04)

Yeah. For our particular workload, it just delivered much better compression. We had many repeated strings and other things that the compression algorithm was able to deal with more effectively.

Kris Jenkins: (19:16)

Right. Okay. Yeah. That makes a lot of sense. Repeated strings are right for the right algorithm to be chewed through. Right?

Liz Fong-Jones: (19:25)

Yeah, exactly. Definitely makes it a lot easier. If you get the batching right. If you can't compress 10 messages at a time, you might have to compress a hundred messages at a time, but eventually you find those redundancies that are coming through.

Kris Jenkins: (19:39)

Right. Yeah. And they're worth finding your scale.

Liz Fong-Jones: (19:43)

Yeah.

Kris Jenkins: (19:44)

Okay.

Liz Fong-Jones: (19:45)

But the challenge that eventually we hit was that even when you compress you are still, in this circumstance, you're running on a fixed number of brokers. You have a relatively finite number of brokers, each disc on that broker has a finite amount of space. And we are seeing that margin of what we could keep without filling the discs on our brokers. Because we are doing byte space retention policies. We found that started to show road. That we previously were able to keep 36 hours, 48 hours of data around but it suddenly got to be less than 24 hours and we were like, "Oh oh, this is a problem. This is not good."

Kris Jenkins: (20:26)

Yeah. You can easily have a problem that lasts longer than 24 hours if you're having a bad day. That's all it takes.

Liz Fong-Jones: (20:32)

Right. It's not that it impacts our day to day operations. Under normal circumstances we're only replaying the last hour of data, two hours of data at most. But if something goes really wrong like let's suppose we have a data eating bug in our storage engine, we need to be able to go back and replay that older data to be able to reconstruct it. And if there's not enough of that window left around, because we discover a bug on Saturday morning and we realize that this has been a problem since Thursday or Friday. And let's say it takes us until Sunday to find the bug, fix the bug. You have to have that 48 to 72 hours of refill buffer, or else you have the possibility that you don't have a backup plan. You don't have a recovery plan.

Kris Jenkins: (21:16)

Yeah.

Liz Fong-Jones: (21:19)

So for us, Kafka was our recovery plan, but as we're talking about we only read the most recent hour to of data under normal circumstances. So it was like, "Huh, this is interesting." This is almost the worst of all worlds. Where we are paying to store on very hot local storage, 20 hours of data. Which is the wrong amount of data for long term recovery. And it's the wrong amount of data for our daily operational leads. It's terrible. Right?

Kris Jenkins: (21:48)

Right. Yeah. I see. So, why are you not just reconstructing entirely from Kafka? That feels like the first strategy I'd go with every time you need to recover.

Liz Fong-Jones: (22:00)

Sorry. I'm not sure that I follow that.

Kris Jenkins: (22:03)

Sorry. So when-

Liz Fong-Jones: (22:08)

The goal is to be able to recover from Kafka. To recover data from Kafka up to 48 hours old. Again, we recognize that Kafka is a streaming solution, but at the same time, we don't necessarily want to use something like [T-SQL 00:22:23], because we're hoping to look at 60 days of data. That's why we digested it into the columns or why we export it off to S3. The question is if you have questions about the quality of that data that's on your local disc or on your S3. Then you have to go back in time and you have to replay Kafka forwards. But if the data's not there in Kafka, you can't reconstruct the data forward so that was the central problem.

Kris Jenkins: (22:49)

Right. So, I can't quite see how you then squared that circle. What did you do? You've got this exactly wrong amount of recovery window. Where did you go [crosstalk 00:23:02] to increase-

Liz Fong-Jones: (23:03)

Yeah. We got the wrong amount of recovery buffer that we're overpaying. Yeah. So we looked at what it would've taken to increase the capacity. And that turns out be, as we were talking about at the start of this, the linear scaling cost. That you are paying for more instances, for more storage. And that just didn't add up. That didn't make sense for us. Especially given that we knew that in under normal circumstances, we wouldn't be reading from this. It was an insurance policy and that is an awfully expensive insurance policy. That's costing you know, hundreds or thousands or millions of dollars per year. That's not worth it.

Liz Fong-Jones: (23:37)

So what we settled upon was we actually had seen blog posts from Confluent saying that you all were working on a solution to this problem. What happens if you have data that you need to retain for a longer period of time for archival purposes, but that you don't necessarily need fast access to it. So this was the Confluent tiered storage feature that appeared in Confluent Platform [6.0 00:24:02]. So we started inquiring about it because it seemed like a thing where, we had been long time Confluent community edition users, but this was the first thing that really made us think about whether we would see value out of paying for Confluent Platform.

Kris Jenkins: (24:20)

Right. Yeah. The idea [crosstalk 00:24:22]-

Liz Fong-Jones: (24:22)

Yeah. We'd done these experiments before. We'd done these experiments of like, "Hey, what happens if you store a portion of your data on EBS like cold hard drive storage platters. EBS offers you the option of getting cold hard drive platters, but we tried it and it didn't really work for us because we were rolling our own tiered storage essentially. Rolling our owned tiered storage, basically saying like, "Hey, LVM you are the storage system go sort out which blocks are hot and which blocks are cold." [inaudible 00:24:58] has no knowledge of the file system, no knowledge of which files are hot, no knowledge of this file is useless unless you have all of the blocks. It didn't know or understand any of that. So it did a really terrible job.

Kris Jenkins: (25:11)

It's got to understand your data model and your file system structure to actually do it properly. Right?

Liz Fong-Jones: (25:17)

Yeah, exactly. So we were seeing produce times that were in the hundreds of milliseconds at the [inaudible 00:25:24] percentile and that was not acceptable to us. So we had to back out of that. So we tried before. We knew it was a challenging problem. And we knew that the right people to solve this were going to be people who are experts in Kafka, who understood how to tell the Kafka broker this is how you request cold data. This is what to do with the data once it's sitting there.

Kris Jenkins: (25:42)

Yeah. You can absolutely say that's only going to work if it's a Kafka level concern. So how did it go?

Liz Fong-Jones: (25:54)

So we started doing a proof of concept of this in, trying to think about this, that would've been the end of 2020. We started doing the proof of concept of trying up to your storage, seeing what there would work for us, initially tkicking the tiers with, we mentioned earlier the dog fooding. So we have a dog food environment and it's a one third or roughly one third, one quarter model scale of our production deployment. And that allowed us to gain some confidence in it that it would tolerate broker death, that it would tolerate what the latencies were like if you fetch that historical data. And then we went into production in January of this year, of 2021 and started deploying it out at scale.

Kris Jenkins: (26:40)

Right. It occurs to me that your scale model if it's a quarter, must be a mere 350,000 messages a second, right?

Liz Fong-Jones: (26:50)

Yep. Something like that.

Kris Jenkins: (26:50)

Just a tiny little model.

Liz Fong-Jones: (26:51)

Nearly. Yeah.

Kris Jenkins: (26:51)

Oh, crikey.

Liz Fong-Jones: (26:58)

But it has this really cool property of it's not a fake staging environment. It's something that Honeycomb employees, we use it every day. And [crosstalk 00:27:08]-

Kris Jenkins: (27:07)

You've proved that already, right?

Liz Fong-Jones: (27:09)

... Many of the aspects of our system. Yeah.

Kris Jenkins: (27:12)

Yeah. Okay. I'm convinced. You've given me a demo already. Okay. So you roll out Confluent Platform tiered storage. How did it go?

Liz Fong-Jones: (27:23)

Yeah, it wasn't Confluent's fault, but it exploded in our face. And what happened was we tried to change too many variables at the same time. We're trying to do both tier storage at the same time. And we are trying a new class of Amazon instances that had a brand new nitro hardware and firmware supervisor. In the course of this, we also were trying to go back onto EBS. Because we're like, "You know what, actually we value the ability to twist that knob. We're only going to size the EBS appropriately for two or three hours of data." But like, "Let's try new instances, new tiered storage, new firmware. This is the newest generation of everything. Let's do it all at once."

Kris Jenkins: (28:13)

So an ambitious start to the year and-

Liz Fong-Jones: (28:18)

An ambitious start to the year that ended in way too much pain. So yeah. [crosstalk 00:28:22].

Kris Jenkins: (28:22)

That feels like a timely warning.

Liz Fong-Jones: (28:26)

Yeah. It was a timely warning. We also had made some suboptimal decisions around, I think we had an earlier conversation about is it safe to turn on? Is it safe to turn on end clean leader elections? So we turned on brief and clean leader elections and it had accidentally gotten set on the tiering topic because Kafka, all Confluent's features utilize again dog fooding. All Confluent's features use Kafka as the data store. So it turns out when you have end clean leader elections on your tiering topic, that will cause the tiering system to go, "Oh God, something bad has happened. I'm going to fail safe by not tiering." So all of our discs started filling up on top of all this chaos it was-

Kris Jenkins: (29:15)

Oh God.

Liz Fong-Jones: (29:16)

... Yeah. It was a little painful. So we had to reset and be like, "Okay, let's set back to a bunch of conservative settings." We're going to go back to the regular instant storage instances, which get slightly larger instant storage instances. And it still has the cost savings because we're running fewer of them. We're only running six and not 30 of these things. And saying like, "Hey, okay. End clean leader elections, bad idea. Don't do it." In the [inaudible 00:29:53] new groove, why do we even have that lever?

Kris Jenkins: (29:56)

Yeah.

Liz Fong-Jones: (29:57)

So all these things where we're like, "Okay, back to basics. Let's just deploy tiered storage by itself." And it worked phenomenally. It was fine.

Kris Jenkins: (30:06)

Oh, cool.

Liz Fong-Jones: (30:08)

Basically get exactly what is expected on the 10, when it was the only variable you changed the time. And then after that, we were able to iterate from there and make smaller changes.

Kris Jenkins: (30:22)

One step at a time. Yeah.

Liz Fong-Jones: (30:25)

Yeah. One step at a time.

Kris Jenkins: (30:26)

Much more smooth.

Liz Fong-Jones: (30:28)

Yeah. So then I think that brings us to the last improvement that we made, which was about a month and a half ago really. About a month and a half ago, Amazon contacted us and said like, "Hey, we hear that you like our instant store bulk storage I3en series. We hear you like Graviton 2, which is the arm based instances that are lower total cost of ownership. Are you interested in trying this combination of storage instances that are based on arm?" And we said yes. That seemed like a useful thing at the time.

Kris Jenkins: (31:06)

[crosstalk 00:31:06] Steel handle to the bleeding edge.

Liz Fong-Jones: (31:06)

It was went with plans to potentially be able to fall back. And also changing only one thing at a time. That's what we had gotten wrong the first time around was we changed everything at a time. Now we had tiered storage running stably. Now we could actually go and say, "Okay, now we're going to change the underlying instance type on a stable non-changing set of software."

Kris Jenkins: (31:30)

Right. Yep. Okay. That makes sense.

Liz Fong-Jones: (31:34)

So we did some initial validation and Amazon was very clear with us, "Hey, you're not allowed to use this production workload until we give the all clear," which is fine. But we also having been burned from last year, having burned by, "Oh, this scale model working fine in dog food at one third scale." We deployed to two out of our six production, Kafka brokers, this new architecture, just to make sure it would hold up under the load. Because we knew that the load properties were subtly different between dog food and production. This wasn't us utilizing it cheekly for practical load. This was us doing this as part of our validation process to be able to confidently say we think once this goes [inaudible 00:32:20] we're actually going to be able to switch over. We don't want to repeat a January where we try a thing and it didn't work.

Kris Jenkins: (32:25)

Yeah, absolutely. So how long did it take for that to roll out? For you to better in or decide to go to production?

Liz Fong-Jones: (32:37)

It basically took us two and a half weeks to be confident enough. So we rolled most of our dog food cluster. We rolled a third of our production cluster and then we had enough confidence based on the foreign data that we're seeing that the claims that Amazon had, that it would be just as much storage and it would give us twice the CPU and it would only cost 14% more than the previous instances we were using. And double the network too. That in all these properties, it would let us scale up without increasing the cost any further than just paying 14% more for the instances and get double the performance out of it. We were like, "Okay. Yeah. Sold. That seems good."

Kris Jenkins: (33:22)

Yeah, absolutely.

Liz Fong-Jones: (33:23)

So they announced it at re:Invent on I think November 30th or December 1st. And by the end of that week we were a hundred percent migrated over. A combination of us testing our recovery processes by killing brokers every week anyway. So we knew that became a broker replacement workload. So we just rotated through all the instances and kill one, kill the next one, kill the next one. Wait for them to replicate. And then just keep proceeding. So it took us less than a week to fully cut over our entire workload to the [inaudible 00:33:53] instances.

Kris Jenkins: (33:55)

Okay. And how's it been going since then?

Liz Fong-Jones: (34:02)

Buttery smooth. Completely buttery smooth. Now the last things that we're trying to work on are just further decreasing that cross availability zone reading process. So trying to read from local Kafka brokers when we can and otherwise keeping an eye on the system using the Honeycomb metrics product to look at the Kafka broker metrics. And yeah Kafka is no longer a scaling pain that we have anymore. Basically right now I'm looking at this dashboard, it's the peak of the day on Monday and we're at near 30% saturation of CPU and our brokers were roughly 30% saturation on this brokers. This feels like a system that can constantly scale by another factor of two and be all right.

Liz Fong-Jones: (34:52)

So it's one of those things where you think about what this system was two years ago and you're like, "Oh gosh, this would've been scary." We'd be juggling Kafka brokers. We'd be paying for all of these C5 brokers with all their EBS drives and we had very little automation around broker replacements. And now it's basically running itself. Healing itself and just not automatically scaling, but we know what the capacity dimensions are having hit many of them over the past year. We know what we need to do to keep this thing running.

Kris Jenkins: (35:28)

Yeah. And it makes you think doesn't it, that things like metrics at first glance seems like such an easy thing to just be an afterthought in a system. But when you really start to get up to serious scale, there are so many small levers to pull and bits to check and test to get actual scalable tools you need. Right?

Liz Fong-Jones: (35:54)

Right. Exactly. We think about the headline number on our case study is something like 87% reduction cost per megabyte. And that didn't come from any one specific change. It's just a series of anywhere between 25 to 40% savings with each adjustment that we made.

Kris Jenkins: (36:15)

Yeah. Crikey.

Liz Fong-Jones: (36:17)

So you have to chain these improvements together and that's how you keep that cost low and the overhead low.

Kris Jenkins: (36:25)

Yeah. Well, it's one of those things that's fascinating, but I'm glad I didn't have to do it.

Liz Fong-Jones: (36:33)

Yeah. And this is a thing that we repeatedly try to emphasize to people is if you want any reliability of your observability pipeline, you're going to want to have Kafka in there or Pulsar or something like it. Except for we know that Pulsar is less mature because it's less battle tested. So, okay, you know you're going to need a streaming data solution to ingest your streaming observability data.

Liz Fong-Jones: (36:56)

Now, do you really think that you can do a great job of running that versus someone who is paid to work mostly full-time on it? And even in our case, we are decent we think these days at operating Kafka, but we still very much rely on and appreciate being able to turn to Confluent for any help with the actual Kafka code itself. So it's one of these situations where, unless you're very good at operating this particular Kafka that is very high throughput per topic, you're probably not going do a better job of running your own observability pipeline, as opposed to trusting someone who does it as a rational job. Oh, and by the way, where we also have a design team that thinks about how you design a good user experience when you're not interacting with this data?

Liz Fong-Jones: (37:44)

Or how do you handle the fast querying of the data. Because it's not just the stream reliability. It's also how you surface it quickly for query. There's so many dimensions that go into potentially building your own like that. We think that a lot of what we're sharing here as lessons, it's not so much anything of you should build your own. It's if you are using Kafka, here's how to optimize the cost. But think about whether you really want to be running your own Kafka. And this is why we're glad that Confluent Cloud exists. That if people need a slightly smaller scale Kafka and they want a metered access, that's available to them. They don't have to scale up their own brokers.

Kris Jenkins: (38:28)

Yeah. Plus on top of all that makes me think if you are these metrics the very day you need them to be working well, is your bad day. If there's a disaster in your business, a disaster in your system, that's the day you need this other thing to be completely reliable.

Liz Fong-Jones: (38:45)

Yeah. Of course, there are always correlated failures. Yeah, there are always correlated failures. Like if [inaudible 00:38:53] one goes down, everyone's having a bad time. We didn't go out. But there was a lot of scrambling during that week in which everyone else was having significantly worse pain. But yeah, it can happen, but definitely if you are having an outage in your own infrastructure only, you don't want to correlate failure of your observability.

Kris Jenkins: (39:17)

Mm. Yep. Gods. Well, thank you. Thank you, Liz and thank you Honeycomb for putting in all that work so that I never have to.

Liz Fong-Jones: (39:27)

Yeah, of course. It's what keeps me coming to work every day is making developers' lives easier with better observability.

Kris Jenkins: (39:35)

Cool. Well, it's been a pleasure observing your story. Can I get away with that pun? I'm going to try to and terrible pun. Yes. Thank you for joining us on the show. I know you have a blog post if anyone wants more gory details on this story. And we'll put that in the show notes. In the meantime, Liz Fong-Jones of Honeycomb. Thank you very much for joining us.

Liz Fong-Jones: (40:02)

Thanks for having me on the show.

Kris Jenkins: (40:03)

And that wraps it up for another episode of Streaming Audio. Before we go, let me remind you that we're brought to you by developer.confluent.io, the website that will teach you everything we know about Kafka, how to scale it, how to make it perform and how to get started because that's important too. You can't scale it until you've got started, can you? So why don't you join us on one of our courses that will teach you everything you need to know. And if you do use one of our courses and sign up to Confluent Cloud, use the code PODCAST100 and we'll give you $100 of extra free credit.

Kris Jenkins: (40:39)

In the meantime, let me say thank you once again to our guest this week, Liz Fong-Jones, thank you to Honeycomb for sending her to us. And thank you for listening. If you have any comments or questions or you just want to get in touch, you can find us on all the usual internet channels. If you're watching this on YouTube, there's probably a box just down there. And if you're listening to the podcast on Audio Only have a look in the show notes, because we include all our contact details there. Thank you very much for joining us until next time.

How many messages can Apache Kafka® process per second? At Honeycomb, it's easily over one million messages.

In this episode, get a taste of how Honeycomb uses Kafka on massive scale. Liz Fong-Jones (Principal Developer Advocate, Honeycomb) explains how Honeycomb manages Kafka-based telemetry ingestion pipelines and scales Kafka clusters.

And what is Honeycomb? Honeycomb is an observability platform that helps you visualize, analyze, and improve cloud application quality and performance. Their data volume has grown by a factor of 10 throughout the pandemic, while the total cost of ownership has only gone up by 20%.

But how, you ask? As a developer advocate for site reliability engineering (SRE) and observability, Liz works alongside the platform engineering team on optimizing infrastructure for reliability and cost. Two years ago, the team was facing the prospect of growing from 20 Kafka brokers to 200 Kafka brokers as data volume increased. The challenge was to scale and shuffle data between the number of brokers while maintaining cost efficiency.

The Honeycomb engineering team has experimented with using sc1 or st1 EBS hard disks to store the majority of longer-term archives and keep only the latest hours of data on NVMe instance storage. However, this approach to cost reduction was not ideal, which resulted in needing to keep data that is older than 24 hours on SSD. The team began to explore and adopt Zstandard compression to decrease bandwidth and disk size; however, the clusters were still struggling to keep up.

When Confluent Platform 6.0 rolled out Tiered Storage, the team saw it as a feature to help them break away from being storage bound. Before bringing the feature into production, the team did a proof of concept, which helped them gain confidence as they watched Kafka tolerate broker death and reduce latencies in fetching historical data. Tiered Storage now shrinks their clusters significantly so that they can hold on to local NVMe SSD and the tiered data is only stored once on Amazon S3, rather than consuming SSD on all replicas. In combination with the AWS Im4gn instance, Tiered Storage allows the team to scale for long-term growth.

Honeycomb also saved 87% on the cost per megabyte of Kafka throughput by optimizing their Kafka clusters.

EPISODE LINKS

Continue Listening

Episode 205March 22, 2022 | 42 min

Building Real-Time Data Governance at Scale with Apache Kafka ft. Tushar Thole

Data availability, usability, integrity, and security are words that we sometimes hear a lot. But what do they actually look like when put into practice? That’s where data governance comes in. This becomes especially tricky when working with real-time data architectures.

Listen Now

Episode 206March 29, 2022 | 23 min

Bridging Frontend and Backend with GraphQL and Apache Kafka ft. Gerard Klijs

What is GraphQL? And how can you combine GraphQL with Apache Kafka to query data in real time? With over 10 years of experience as a backend engineer, Gerard Klijs is a Confluent Community Catalyst, a contributor to the GraphQL project, and also a creator and maintainer of a Rust library to use Confluent Schema Registry with Java client. In this episode, he explains why you want to use Kafka with GraphQL and how they work together to bridge the gap between backend and frontend to make data more easily accessible in the frontend.

Listen Now

Episode 207April 7, 2022 | 70 min

Scaling an Apache Kafka Based Architecture at Therapie Clinic

Scaling Apache Kafka can be tricky, let alone scaling a team. When he was first hired, Domenico Fioravanti of Therapie Clinic was given the challenging task of assembling a sizable tech team from scratch, while simultaneously building a scalable and decoupled architecture from the ground up. In addition, he wanted to deliver value to the company from day one. One way that Domenico ultimately accomplished these goals was by focusing on managed solutions in order to avoid large investments in engineering know-how. Another way was to deliver quickly to production by using the existing knowledge of his team.

Listen Now

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog