Get Started Free
October 27, 2022 | Episode 240

Running Apache Kafka in Production

  • Transcript
  • Notes

Kris Jenkins: (00:00)

Hello. You're listening to Streaming Audio. And in this episode we are going to dig into the realities of living with Kafka and just a random assortment of facts and decisions you should know about before you go into production. And we have the perfect guest to guide us through. This week we're talking to Jun Rao, who's been deep in the engineering of Kafka ever since it began. And he's probably going to end the episode and get back to reviewing pool requests.

Kris Jenkins: (00:27)

Before we talk to Jun, let me tell you that Streaming Audio is brought to you by Confluent Developer. Our sites teach you everything you need to know about Kafka. Check it out at developer.confluent.io. And while you're learning, you can easily get a Kafka cluster started using Confluent Cloud. Sign up with the code PODCAST100 and we'll give you an extra $100 of free credit to get you started. And with that, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. My guest today is Jun Rao, who is co-founder of Confluent and a longtime engineer and committer on Kafka itself. Hi, Jun, thanks for joining us.

Jun Rao: (01:16)

Hey, Kris. Yeah, nice meeting you here.

Kris Jenkins: (01:19)

Good to have you. So you've been dealing with Kafka since its inception and you've kept your hands and eyes on the code the whole way, right?

Jun Rao: (01:29)

Yeah, all the way since 2010.

Kris Jenkins: (01:37)

God, that's a long time in internet years. So you must have seen a lot of use cases in production along your way, right?

Jun Rao: (01:48)

Yeah. You definitely seen a lot of use cases from various industry, from various geographic location. Yeah, it's pretty exciting time to see how the technology has been picked up and adopted pretty broadly.

Kris Jenkins: (02:02)

Yeah, it really has. It seems to be in just about every industry going these days.

Jun Rao: (02:07)

Mm-hmm, yeah.

Kris Jenkins: (02:09)

But the topic I really wanted to talk to you about while we're here is having seen all those projects and those design things come up along the way, I was hoping you could teach me all the things that could go wrong so that I can avoid them.

Jun Rao: (02:28)

Yeah. So that's pretty a broad topic. I think our time is limited, I would just probably share maybe a few things. To start with, I think the first thing I've seen especially a lot of people who are doing early adoption of Kafka is you have to narrow down a little bit the operational part. Today, I think people have choices to use like managed services or not. If you are using a managed service, actually the provider takes care of the operation for you, it's actually much easier.

Jun Rao: (03:03)

There are still places people are doing self-managing. So for that part, I think a lot of people when they started, they would just try Apache Kafka, it actually works pretty well and then it's pretty stable, I guess what they want and they'll just put the production load on that. But I think sometimes what people forget, just I think managing a distributed system like Kafka is it can be tricky sometimes because you have to make sure all the parts are healthy together.

Jun Rao: (03:35)

So that's why I think once you put this in production, you do need to invest a little bit in terms of the operational part, how you monitor things, how do you set up an alert, how do you have the logging right in place in a way that's easily accessible. So some of the mistakes I've seen, just people are very excited to try this new technology and put workload on that and then it worked. But I think since they don't have this observability or visibility, I think multiple things over time, went wrong.

Jun Rao: (04:16)

Maybe the first thing doesn't really affect you, but since you don't have the visibility, you didn't really take care of that. But then over time, I think as things accumulate, multiple things could go wrong. Then its brought into a state where now your Kafka cluster is not healthy and then at that point, maybe it's harder to react and fix it. So that's one of the things I first learned is just I think especially in the early adoption phases, actually pretty useful to think about not only the use cases but also how would you operationalize it.

Kris Jenkins: (04:52)

Yeah. Yeah. I've worked in places where DevOps stopped actually putting into production, but once it's in production, you got to live with it that sometimes the real work starts, right?

Jun Rao: (05:03)

That's right, that's right. Yeah. Especially as you are using this more broadly and then for more mission critical applications.

Kris Jenkins: (05:14)

Yeah. Let's stay on that for a bit. So if you have good monitoring, do you think there's a pattern of things that tend to come up the first time users? Things that you should be looking out for?

Jun Rao: (05:28)

Yeah. I think when you add a monitoring, I think there are different levels of monitoring that would be helpful. I think one of course, you know Kafka provides a bunch of Kafka level metrics. You monitor things like your loaded distribution, how many things are coming in, how many things are going out and are the data replicated properly? And then maybe some of the latency metric. So these are all useful, but also some metric that's just independent of Kafka these are general health metric for your JVM and a GC activity and other things. And then for your host just can be network, can be storage, can be memory, right?

Jun Rao: (06:23)

So I think you need a little bit visibility for all those. I think typically I think what you will start with maybe some of the high level metric we have in Kafka. The typical things we have is the percentage of the thread that's busy. You want to make sure that's under control. And then whether there's any offline partitions, whether there's any under replicated partitions. These are some of the early indicators when some problems may arise if that's not expected. But if you see any anomaly there, often, I think the next level thing is you probably need into some of the lower level maybe detail metric in Kafka and maybe some of the machine or process level metric to piece them together.

Kris Jenkins: (07:19)

Yeah. How do you even begin to do that? Is it... I haven't spent time monitoring a large Java service in years, but is it all JMX or is there one solution? What's going on?

Jun Rao: (07:34)

Yeah, I think Kafka broker runs in a JVM. So a lot of the metric right in the process and in the Kafka land, those are available as JMX metric. But we also provide a reporter so that you can essentially export the Kafka metric probably in a more efficient way into any system that can be used for monitoring, whether it's Datadog or Elastic or whatever. I think you do have a little bit of flexibility there.

Kris Jenkins: (08:11)

Okay. So sounds possible, but painful. Let's go up a tier. So beyond... Is there... Let's talk about design issues, because it's not just running the thing, right? There are choices you make when designing your topic scheme. Topic layout, should we say? Data schema. So give me some ideas on what I should avoid with that.

Jun Rao: (08:39)

Yeah, I think are probably a couple of things that's worth talking about. I think one is probably on the metadata design. So in Kafka at the Kafka API level, I think all the events are just bytes. So you have a key and value, but those are just by the byte arrays. This gives the application a flexibility that they can use any serialization mechanism to do this. So as part of serializing, you probably... I think it's useful to reason about things like schema, what data you're putting in there.

Jun Rao: (09:20)

Things in some of the early adopters, I think they tend to be a little bit loose on that because they just want to get data in quickly. Sometimes it could be just unschematized like JSON. I think that could work for if there is just one use case, then because you can coordinate between who publishes data versus who reads the data. But as your system evolve, once you have started adding a second reader or that team is getting bigger, this kind of direct communication cannot work because you could easily add something, but you forget to tell everybody and then things start to break.

Jun Rao: (10:09)

So that's why I think reason about the schema a little bit upfront and then have some story on who owns that schema and how is that... What's the path for evolving that over time as your business evolves or help down the road. If you don't, then what we saw in other place, they're very excited to get things quickly started, but six months later, they will regret it because they have to redo and rethink about metadata part.

Kris Jenkins: (10:43)

Yeah, yeah. Schemas can be so useful for just discoverability, understanding what your data is, right? That's where [inaudible 00:10:54] shine.

Jun Rao: (10:55)

It's also a well spelled out contract between the publisher and the subscriber. You sort of need that contract for this end-to-end business application to flow through.

Kris Jenkins: (11:08)

Yeah, there's a definite relationship between schemas for your data and static typing where your compiler checks things, right?

Jun Rao: (11:15)

Yeah.

Kris Jenkins: (11:16)

So that raises two questions. The first is, well, I always just go with Avro. Am I making the right choice? Is it always the right choice? Which schema format do I choose?

Jun Rao: (11:33)

There are... Yeah, I think giving people a little bit flexibility can be helpful. I think Avro is popular I think in some of the data domains, right? I think people using some of the data lake area I think is familiar with this mechanism. It has the benefit that it sort of completely decouples the schema from the serialization of the data, the schema stored outside of the data that may tend to make the serialization more concise. It also has a little bit story around schema evolution.

Jun Rao: (12:21)

So how to do that in a more compatible way. So these are good things for using Avro. I think it can be a pretty good choice. In some of the other places, if people are coming from the remote procedure core world, I think there a lot of people probably are used to things like Protocol Buffs, right? So it has a bit different design from Avro. I think the schema is not completely decoupled from serialization, so you tend to pay little bit [inaudible 00:12:59], but maybe not a big deal.

Jun Rao: (13:00)

It also has a bit like story in terms of evolution, slightly different twist from Avro. So that, I think if this is technology that's appealing to you, I think that can also be possible. And then at Confluent, I think we have a schema registry that supports both this kind of serialization as well as JSON's schema if that's your thing.

Kris Jenkins: (13:32)

JSON's schema I don't know much about. So when's that the right choice?

Jun Rao: (13:37)

I think JSON, I think in terms of JSON schema, I think it is a thing. I'm not sure how popular it is, but it does describe the structure of the data just on JSON data. It does give you some flexibility. I think a lot of people are familiar with the concept of JSON elsewhere. And it used to be I think one of the downside of JSON is coding is kind of verbose because not there, I think the data is and the schema [inaudible 00:14:13] when you serialize it.

Kris Jenkins: (14:14)

Yeah, that's right.

Jun Rao: (14:17)

I think but with compression enabled, right? Because a big part of compression, especially if you're doing it in a batch way, it is to really just strip out the overhead due to those redundant schema. So it could still get some efficiency on that. I think that then the benefit of it is okay, maybe you can work with more familiar tools on this data.

Kris Jenkins: (14:40)

Yeah. Okay. So tell me if I'm wrong. Is it fair to say use Google Protocol Buffs or JSON schema if that's the domain you're already familiar with? And if you don't have one, choose Avro, would that be a fair summary?

Jun Rao: (14:59)

Yeah, I would say I think if you want probably the most efficiency of [inaudible 00:15:07] sending the data, I think Avro and the Protobuf probably would be a better choice. For some of the cases, if it's ease of use, maybe JSON can be appealing to a broader audience if they're familiar with those already.

Kris Jenkins: (15:25)

The great thing about JSON compared to the other two is you can just read it in a text editor, right? There's no binary there.

Jun Rao: (15:32)

Yeah, there's a lot of tooling available, so that's probably the convenient part of it.

Kris Jenkins: (15:38)

Out of curiosity, I bet you'll know this. If when you enable compression, it's stored compressed on disc? Is it compressed per event or per batch event? Or how does the compression work?

Jun Rao: (15:51)

Yeah. This is controllable on the producer. I think typically it's compressing a batch of events together. Because I think compressing them as a batch typically gives you much better compression ratio than compressing them individually. Because there's just more things that you can share as you are encoding the data. And then the compressed data will be sent from the producer to the broker and will be stored in the compressed format and it actually will be delivered to the consumer in the compressed format. It's only when the consumer application wants to read it, that's actually when we're due to decompress it. So we actually get a lot of efficiency end-to-end.

Kris Jenkins: (16:42)

Yeah, but you kind of lose that granularity I assume? I couldn't send you one event anymore?

Jun Rao: (16:49)

Yeah, but that's the thing with Kafka API, it's not really designed to randomly pick individual event. It's really designed for streaming events continuously from a particular starting point. So that API actually fits us pretty well, because typically, you'll be streaming new data continuously. We'll be delivering those in those compressed batches to you as they come.

Kris Jenkins: (17:16)

Okay, that sounds fair. And the other question you raised when you mentioned metadata is, is there any other metadata? Because I know schema is the big one, but what else gets attached? What else should I think about?

Jun Rao: (17:32)

Yeah, the other part of design that's a bit Kafka specific. I think in Kafka, we do have this concept called topic and a topic is partitionable. So it's something like topic in some sense is like a database table, right? So just when you design event streaming applications on Kafka, you also have to do a little bit careful design on how would you spread your data among those topics, right? So on the one hand, you need to understand topic is sort of the way of grouping events of similar types together. So that's sort of our container and then that's also the unit you can subscribe to, right? That's sort of the filtering you get in Kafka. On the other hand, topic is kind of a metadata that takes some resources, right? So it's not something that completely free. So you have to use that a little bit more judiciously.

Kris Jenkins: (18:46)

Okay.

Jun Rao: (18:47)

So in one of the early days, I was doing some consulting, I have this interesting use case when I was talking to a user. So this company is like a building, it's a enterprise company, right? It's collecting some data for its customers and then it needs a way to send some of the information back to each of their customers potentially in real time, right? So then they were thinking, okay, Kafka is good fit, right? We can just collect those events for all those customers in Kafka, and then we'll just read it back and then send it to those customers through some REST interface, right?

Jun Rao: (19:35)

Then the question is, how many topics should you use? I think the naive way or is okay, maybe just put each customer's data in its own topic, right? Because then I can just read them individually and then send it out. So that would be fine. The only thing is I think how many customers can you have? Where for them, I think they are already at probably hundreds of thousands of customers.

Jun Rao: (20:07)

They imagine this can grow into many millions and beyond, right? So then that's a challenge. I think having millions of those topics is kind of a little bit overkill for those because each of them does take a little bit of resources in terms of the Kafka metadata you have to manage, right? I think if you have to do this for all the applications, then it's sort of a compounds overhead of this.

Kris Jenkins: (20:36)

Yeah. That sounds to me like a place where they've confused topics and partitions. Is that fair to say?

Jun Rao: (20:45)

Yeah. I think you can sort of map those to partitions too, but I think each partition also has its own metadata. So it boils down to a similar issue. Just can you have multi-millions of partitions even beyond that? I think over time, we have been increasing the number of partition support. I think with KRaft, we actually can go to the million level things, right? But it's just I think you still have to think that a little bit. I think especially overall, is that just for one application, which may be okay, but if every application does that, then I think the amount of metadata could still be more than you truly needed.

Kris Jenkins: (21:32)

Right. So what's the solution for that customer?

Jun Rao: (21:36)

Yeah. So in the end, we looked at it, you saw, okay, what are the other ways of doing that? Okay, another way to do that, we'll just put that into maybe a shared, a single topic or maybe a few of those. Maybe the [inaudible 00:21:57], just a single topic. Then you can read it off and then you can send it to different customers. There, I think the only potential issue is okay, you have lots of those customers sometimes they may not always be... Or their service may not always be available, right? Maybe if one of the customers, their REST API is down, what do you do? You certainly don't want to lose those events for those customer. You want to keep retrying sending them to see if they are available. But while you are doing that, what about other customers' data that you are currently reading, because now they are in the shared topic or partition. If you-

Kris Jenkins: (22:45)

You're saying you've got this consumer and it's trying to send out all the data on that topic to different customers. If one customer goes down, it's kind of got to wait for them to come back up before it can get to the next one down on the queue?

Jun Rao: (22:57)

Yeah, because in Kafka, it's like a queue, right? All the events are delivered to you in that queue order, right? So in general, I think you have to consume all the events before you read the next set of events. In that case, if you can't finish processing some of the events, the typical thing, you have to wait until you finish processing them, until you get the next one. But that will be delaying the processing of other customers, which is also not what we want. So that also has its own issue, although it does simplify a lot of metadata design.

Jun Rao: (23:38)

So in the end, we look at that and say, okay, we can actually find a better middle ground. Because I think just like in a lot of the queuing system, you have this concept of dead letter queue, which is can use that for staging things. So it doesn't have to be a lot. For example, you can have all your customers' data in one topic and then you have this process. But if you encounter issue where you can't send data for a particular customer, what you could do, you can just stash them into another retry topic, right?

Kris Jenkins: (24:21)

Right, yeah.

Jun Rao: (24:22)

Then okay, first of all, you haven't lost those data because it's buffered somewhere else. The second benefits, now you can actually unblock the consumption of other customers. They can still be real time and then you can have a separate process that goes back to the retry logic maybe with backoff. and then retry those. So you can even extend that because if the retry fails, maybe you want to gradually maybe potentially back off a little bit more. So maybe instead of having one retry topic, you can have a few of those depending on the granularity of backoff time. But together with this right now, you can still solve this problem in probably a pretty efficient way. But now, you can actually have a much better management in terms of the amount of metadata you have to cope with.

Kris Jenkins: (25:23)

Yeah. Okay. I fear I'm going to ask a dumb question here, but I want to know the answer, so here it goes. Why in that situation where I've got 100,000 customers and one big topic, could I just have one consumer per customer? And then it doesn't matter if the... One consumer group per customer... And then it doesn't matter if my consumer gets blocked. Would that work?

Jun Rao: (25:50)

Yeah, that could also work. The only downside is because topic is sort of the only filtering we have. Now essentially, every customer will be reading the data for all the customer and then throwing away probably 99.9999% data, only to keep the .01% data for themselves. So it won't be efficient because your amplification is the number of customers, just think everybody's reading all the data. So that's the other problem.

Kris Jenkins: (26:28)

Okay. So I see how that works, but it doesn't sound like there's one silver bullet answer. You've got to understand the domain a bit.

Jun Rao: (26:40)

Yeah, I think you think you have to be aware of that, the concept of topic and any simplification and then make a judicial choice how you would design your application around that. Just like how you would design a database table and schema. You wouldn't have a million tables typically unless you have to. But sometimes, there are middle grounds that you can bridge to do things better.

Kris Jenkins: (27:08)

Okay. So where does partitioning come into that picture? Let's say I've petitioned my topic by the customer ID. Does that get me into any better place?

Jun Rao: (27:20)

You could. I think if you have partitioning, you could definitely use more... You typically can use more consuming processes or threads to do things in parallel that often speed things up. I think if you partition them by customers, maybe it also gives you the ability, maybe you have a better chance of grouping data that you want to send to a particular customer. So sometimes that could help. Right?

Kris Jenkins: (27:55)

Yeah. Okay. So I've got some work to do thinking about my use case. Fair enough. Let's move on from that. So this is a big, wide question, but what advice do you have for anyone getting started? What advice have you had over the years for getting started with Kafka?

Jun Rao: (28:20)

Yeah. Well, I think one of the things to get started, I think lot of places do ask this question is, okay, how many clusters of Kafka do I need? So in some of the cases, I think it's convenient for people to say, oh, I have a particular application I want to use Kafka for. I will just have my own cluster to manage. I think it makes it probably easier to adopt, maybe it'll be easier to get started. There's no one there I need to coordinate with.

Jun Rao: (29:02)

But I think over time, I think we just rely... I think a big benefit of Kafka, it actually allows you to... Sort of have this central nervous system where you can integrate a data from all the places in your organization in real time together. I think the more you can put things together in a shared environment, is the more you can share their usage across applications. Maybe now multiple applications can share the same topics data, you can start joining different topics data together for building more sophisticated applications.

Jun Rao: (29:47)

So there, I think a lot of concern for people to have this shared environment is mostly on two fronts. I think one is probably around security, access control. Who have access to what? If I have my own thing, of course, I can control who can access that. But in a shared environment, how can we make sure the data is only accessed by the team or the application that really need to access it?

Jun Rao: (30:20)

The second issue is probably a little bit on the isolation control, right? If it is a shared environment, how do we make sure if a runaway application affects me, right? So both are of valid concerns, but I think over time at Kafka, I think we have made quite a lot of advancement to make this kind of sharing more possible. So on the security side, we have added pretty sophisticated authentication mechanism. Now you can authenticate using various industry standard mechanism, whether it's ORs, TRS, right? Or Kerberos, we support all those.

Jun Rao: (31:08)

And then we have pretty fine granularity of access control. Once you have those authenticated user, we can say, okay, for this authenticated user, you can only read from this topic or maybe write to this topic. Maybe you can even specify, you have to use this particular consumer group, right? So you have this pretty fine granularity control on who can access what. So that probably mostly addresses the first concern. The second thing for isolation is I think over time, we have added various quota support, which is pretty convenient for an administrator to provide this good isolation.

Jun Rao: (31:54)

So you typically can say, okay, what's the amount of resources a particular application can have in terms of CPU? Maybe network, these probably are the two big things. And once you set up, I think if a particular application is abusing the system, I think their load will be throttled automatically by the server. So then it's only affecting that particular application. Everyone, theirs will be protected,

Kris Jenkins: (32:33)

Which sounds fair to me.

Jun Rao: (32:34)

Yeah. So because you are the one who are violating the contract. So I think with both cases, I think it actually now makes it much easier I think for the applications to share the data in a much bigger probably [inaudible 00:32:55]. It's not like you have to put everything in a single Kafka cluster, but I think being able to think through that and then share as much as possible gives you this bigger advantage. Which is now, you can allow more applications to leverage this integrated data in real time.

Kris Jenkins: (33:16)

Yeah. It kind of makes me think of relational databases where there are legitimate reasons to have two or three or however many different Postgres databases. But I wouldn't by default, put every table or every few tables in a different database. I'd try and centralize it into a single thing I have to manage and then use access controls to manage who can access what. Is that a fair comparison?

Jun Rao: (33:46)

Yeah, I think it is a fair comparison. I think the database has the same issue. I think ideally, you want people to share on fewer of those databases. Although in reality, I think you just sum up the transactional database, I think wasn't designed to be particularly scalable. So at some point, you can't put everything in there. But in Kafka, it is a little bit different. It is actually designed as a distributed system from ground up. So it actually can scale out pretty easily. So from the scalability perspective, there's actually a much stronger reason for you to share because it's actually... It's possible to do that and then it's probably also more economical from a operational perspective to manage fewer clusters than many small individual clusters.

Kris Jenkins: (34:41)

Yeah. And you shouldn't get the sort of downsides you get of overloading one database, right?

Jun Rao: (34:47)

That's right. And then you get this isolation, right, protection for sharing applications.

Kris Jenkins: (34:54)

Okay, makes sense. But give me the counter argument. I'm asking you to argue with yourself here. When is it a good time to split off into a different cluster?

Jun Rao: (35:07)

Yeah. I think sometimes it depends, also it depends a little bit on the business domain, right? I think sometimes for business data that are never accessed together, there's no strong reason you have to put them together. Especially if they have different characteristic in terms of some of the durability guarantees on criticalness of the data. For example, maybe you have some internal operational data that probably is a bit different from your business metric. The business metric probably is more mission critical and you probably never want to lose it.

Jun Rao: (35:48)

So where the internal operational metric tend to be maybe higher volume. But I think maybe you just want to have some best effort to preserve as possible. So there, maybe having different classes give you even better isolation, and then you can configure and then maybe evolve them differently. And then chances are there aren't that many use cases where you need to access them together in a single application.

Kris Jenkins: (36:26)

Right. So you might actually start splitting them by SLA almost?

Jun Rao: (36:33)

It could be. It could be based a bit on SLA. It can also be a bit based on business needs. Because in this case, they are from different business units. I think one is more the core production business, the other is probably more internal.

Kris Jenkins: (36:56)

Right. Yeah, that makes sense. You touched on this in that answer. Storing data for how permanent that data is, which leads me onto wondering about... I've always found it weird coming to Kafka, the default for storing data is like seven days, is that right?

Jun Rao: (37:22)

Yeah.

Kris Jenkins: (37:24)

As opposed to forever, until I say otherwise.

Jun Rao: (37:28)

Yes.

Kris Jenkins: (37:29)

So first, why? Why is it seven days and when should I change it and what do I do about managing the life cycle of all that data?

Jun Rao: (37:38)

Yeah. So I think what we realized in a lot of the... Because just I think applications definitely are moving to this sort of event driven way because everyone wants to be able to leverage the data in more real time as new information are coming in. So that's sort of the value of Kafka. It sort of gives you this feed of data in real time. Now, why do you need to keep the data for seven days? Well, I think the thing is, I think it depends on the application. I think some applications I think you have to do deal with maybe deal with a little bit maybe with the external system.

Jun Rao: (38:29)

Might be most of the time your real time. But if your external system is under some maintenance, it's not available, that could be down over the weekend like a lot of traditional database in this case. Of course, you can't consume those events if you want to write to those databases, but you definitely don't want to lose this, right?

Jun Rao: (38:51)

Having this longer buffer allows us to resume the consumption when those external systems are now back available again. A second use case is really some of the bootstrap use cases. Because a lot of the... Think of the application you are building, which reads some of the click stream event, build some stage right so that you can do things like recommendation better. So most of that is pretty real time, but sometimes maybe you say, oh, my application load increases, right? I really want to have a separate instance for that system.

Jun Rao: (39:40)

So you want to start another instance for doing that exactly the same thing, but initially, that instance need to build its initial state. So where does it get that initial state? So that's where I think having a little bit longer buffer also helps, because now, you can just go back based on whatever data you need to build up this initial state. And then after that, you can just switch to the incremental real time consumption. So that's also pretty convenient.

Kris Jenkins: (40:14)

But from those cases, you're kind of saying this isn't just a queue where events get used up and then dropped. You want to keep the data around a bit longer. But I'm also wondering when do you say, actually, I'm keeping the data around forever? When is that a good choice?

Jun Rao: (40:36)

Yeah, that's also another interesting capability in Kafka. By default in Kafka, we keep the data by time, but we also have another flavor of topic called compacted topic, which doesn't keep the data by time, it keeps the data by key. So in this case, when you publish each event, you'll have a key and a value, and then we'll be retaining... We'll be guaranteeing that we'll be retaining the latest value associated with the key as you are sending those events. So I think that's actually one of the things I find super useful as I was talking to some of the customers. Is it's especially useful for capturing those type of data that are updateable. For example, maybe from a change data capture log from a database.

Kris Jenkins: (41:37)

Right.

Jun Rao: (41:39)

So for example, I think I was at the retailer once, so they had this issue, they have a pretty decent catalog about its product. This could be millions of items they are selling. So they have lots of those different type of information related to this catalog that they need to update and then keep refreshing. This could be as simple things as a description of the product, but could be other things like the pricing, the promotion, and the various other aspects about this. So what they want is just as this information about a product is updated, they of course want to distribute those information to all other places where this information need to flow through.

Jun Rao: (42:38)

This could be all the local retail stores, could be the online data stores, could be the distribution center, could be other geographic location. There are quite a few places this information would need to be flowed through. So they want to be able to capture that and then Kafka's a great fit for propagating that information. But they also have this information which is also the same bootstrap issue. Occasionally, you have a new store, or you have a new instance of your online search index. So where do those systems get the initial data? Because initially, it has nothing.

Jun Rao: (43:27)

So a traditional way of doing this is okay, maybe if you want to do bootstrap, you go somewhere else. Maybe you do a scan of the database to get everything, then you switch to Kafka to do a more incremental thing to capture the subsequent changes. But just means each of those applications need to deal with two systems that they need to deal with two configs, they have to manage those. A lot of databases actually don't want to do a big scan for them because that impacts their other performance. So you always have this complexity that you have to deal with, but with compacted topic in Kafka, you actually source this bootstrap problem with the realtime consumption problem nicely in a single system.

Jun Rao: (44:20)

Because in this, case what you can do, you can publish all the attributes you have associated with each product in a compacted topic. And then the key would be just a product ID. The value would be whatever, the latest value, whether it's description or pricing, promotion associated with that product. So it naturally gives you a snapshot of the latest value of each of the product, but it also keeps track of the newer changes. So now if you want to do the bootstrapping for application, right, it's pretty easy because you can just use that Kafka based application you have been building for the real time consumption.

Jun Rao: (45:07)

You only need to make one change, which is from where you need to start consumption. So you just set that to offset zero, now you will be starting to consuming that topic from the very beginning. You will naturally just be getting the latest value for each of the product you have. And that sort of can be used to build your initial state. And after that, you switch to the incremental consumption as new changes are coming in. So now you have simplified the architecture. There's a single system you have to deal with whether you are doing bootstrapping or the real time incremental consumption.

Kris Jenkins: (45:50)

Got you. So let me check, I've got that, I think. So you've got this system that's where the prices of things are changing all the time and you want to spread that information around the company. You stick every single change in a topic and then they can get real time information. And by being a compacted topic, it will throw away the older versions of each product price for you as you go, right?

Jun Rao: (46:17)

Yeah.

Kris Jenkins: (46:18)

So it's just going to hold onto the latest versions and gets rid of that whole problem?

Jun Rao: (46:24)

That's right, because for a lot of applications, they don't necessarily care about the older value, they mostly care about the latest value associated with each key.

Kris Jenkins: (46:34)

Yeah, yeah. Okay. I follow that. Good.

Jun Rao: (46:37)

Good.

Kris Jenkins: (46:40)

Okay. So I have a couple more general questions before I let you go. One, I'll save the fun one for the end. So let me grill you just on one more.

Jun Rao: (46:52)

Yeah, yeah.

Kris Jenkins: (46:53)

I mean, you've seen Kafka from the start, from the only days of LinkedIn to today. How has it changed recently for people using it? What's your general feeling of the evolution of the project?

Jun Rao: (47:07)

Yeah. I think in the last few years, I think okay, the biggest change is probably the cloud. So early, I think we have the software we build on Kafka. We have some additional software from Confluent which help people make the adoption a bit easier, but people still have to operate it themselves. And this can be a pretty big challenge for a lot of the enterprises because their core competence and core business is not managing infrastructure.

Jun Rao: (47:47)

And so sometimes, it's actually not wise to use the engineers for managing things like Kafka. But with a SaaS service, like Confluent, we provide Confluent Cloud and then the service is available in all the three major public cloud. I think that's a pretty big game changer you see. Now Confluent will be the operator of the system. So if you are adopting this event streaming technology, you don't have to manage the infrastructure.

Jun Rao: (48:24)

We can... At Confluent, because we do this as a focused area, we have a lot of experience in terms doing things like managing the load, doing expansion, upgrading, balancing the data for you so you don't have to worry about this. You can just use it and then you can just focus on your business need, which probably is a better use of your engineering resources time. So I think that's a pretty big, big change. The second thing I think what I realize is I think by building, by moving Kafka into this cloud world, especially building it as a more like a cloud native system, we actually are creating more opportunities for using Kafka.

Jun Rao: (49:28)

So one of the significant change we have is this infinite storage in Confluent Cloud. Because by integrating Kafka with the cloud storage resources better, we actually can provide the storage capability that's truly infinitely extendable at reasonable cost. Because we can leverage cheaper object stores like S3 in AWS in the cloud vendor. So this actually provided quite a few benefits. I think it's beneficial in terms of cost, in terms of elasticity because we separate out the storage from the compute.

Jun Rao: (50:22)

But another thing it really opened up is it actually, you can now... Earlier, we said okay, you can keep the data in Kafka maybe for seven days, but now, you can actually truly keep it forever if you want. So why does this matter? I think it matters in I think in a lot of places, doing data integration is kind of hard. So you want to clean it, you want to make sure you get enough coverage and you also want to do this in real time. So a lot of things is already happening Kafka.

Jun Rao: (51:00)

So that often is your source of truth. So as a source of truth, then what happens just in most common cases, you want to consume just the new things that's coming in because pretty real time. But occasionally for various reasons, it's actually convenient to be able to access some of historical data. This can be for some of the bootstrap use cases and I mentioned that a little bit earlier. This can be used for doing some of the look back if you have some application mistake. You want to redo some of your business logic to fix those mistakes, can also be for some auditing use cases.

Jun Rao: (51:47)

In those cases, I think having the ability to keep the data in Kafka long allows you to have this capability in the single system within Kafka. Historically, people have this... A lot of places, people have this perception that Kafka is redesigned for real time, but you typically don't keep the data there for long. Now for a lot application, what if I have some occasional historical need? Where in that case maybe I have to build another system, maybe build on some object store to have this historical history, right?

Jun Rao: (52:32)

But then it just means, I think now you have to bridge the two separate system and then reason about that. Now with the infinite storage capability in Kafka, now potentially, you can just... Because all the data is already integrated flow through in Kafka in a lot of places. You can just say, oh, for some of the data, I want to have a little bit historical access. I can just keep them longer; for months, years based on your need. So then it just I think makes this sort of combined usage for the application, whether it's streaming for real time or streaming from some of the historical data much easier handled.

Kris Jenkins: (53:21)

Yeah, yeah. I think over the years, my two biggest problems with data storage I think are bad schema design and running out of disc space.

Jun Rao: (53:34)

Yeah.

Kris Jenkins: (53:34)

One of them is definitely on the developer, but to get rid of the other one would be nice.

Jun Rao: (53:40)

Yeah.

Kris Jenkins: (53:42)

Okay. Perhaps I should stop on the big picture, but there's one last question I really want to ask you. Wild card question.

Jun Rao: (53:49)

Yeah.

Kris Jenkins: (53:51)

Not counting anything you've already mentioned, what's your favorite feature of Kafka?

Jun Rao: (54:00)

Yeah. Yeah, I think there are quite a few interesting ones, but I would say probably the most interesting feature to me, I think especially accounting for all those use cases probably would be the compacted topic, that would be one. Because in terms of usage, I think it's a pretty innovative one as it's constructed. And then when people realized its capability, you definitely source a pretty large chunk of the use cases that otherwise would be harder to do. A second-

Kris Jenkins: (54:46)

Okay.

Jun Rao: (54:46)

A second thing I would say, if I can pick a second one, I think probably would be just a replication feature in Kafka. I think this is pretty fundamental in terms of adding higher durability and availability. And with that, I think now if you use Kafka cluster, pretty much everybody will be turning that on. Because it does give you this better safety net and it allows you to have more trust in the data you're keeping in Kafka so that you can build those mission critical applications on top of that.

Jun Rao: (55:35)

It also facilitated the building of some of the higher level data processing layer, like Kafka Stream and then KSQL. Because in those cases, you may want to publish data, right? Determined based on key. So in that case, you have to send data probably to a particular partition, and then as a result, you want that partition to be always available, even though there are individual server failures.

Kris Jenkins: (56:08)

Yeah. Yeah, you've got to have replication if you're going to shard things for scale, you've also got to have them replicated for availability, right?

Jun Rao: (56:17)

That's right. Yeah. For both availability and will cause durability as well.

Kris Jenkins: (56:22)

Okay. I like that as a favorite feature because it's a big chunky one.

Jun Rao: (56:27)

Yeah. And especially doing this in real time, right?

Kris Jenkins: (56:30)

Yeah.

Jun Rao: (56:31)

Because that also... The biggest prime is from Kafka.

Kris Jenkins: (56:35)

Yeah. Yeah. And that's sort of baked into the fundamental way it's built rather than an afterthought.

Jun Rao: (56:43)

That's right. Yeah.

Kris Jenkins: (56:45)

Well, on that, I should probably let you get back to engineering.

Jun Rao: (56:50)

Yeah.

Kris Jenkins: (56:51)

Jun, thank you very much for talking to us today. It's been a pleasure.

Jun Rao: (56:54)

Yeah, thanks, Kris. Yeah, thanks for taking time with me.

Kris Jenkins: (56:58)

We'll see you again. Bye.

Jun Rao: (56:59)

Yeah, yeah. Bye-bye.

Kris Jenkins: (57:01)

And there we leave it with my brain and hopefully your brain filled with bits of Jun's brain. We could have taken more bits of brain, but that direction only leads to the zombie apocalypse. So let's stop there. If that whole conversation left you wanting to know more about Kafka in the real world, then take a look at Confluent Developer. It's full of useful information. There's going to be a deep dive course led by Jun himself. You'll find it all at developer.confluent.io. And meanwhile, if you do spin up a cluster on Confluent Cloud, use the code PODCAST100 and we'll give you $100 of extra free credit. When you use that code, they know we sent you. And that sets up a lovely positive feedback loop.

Kris Jenkins: (57:44)

That's not the only way you can be part of our feedback loop. Of course, you could always just get in touch. My Twitter handle's in the show notes. You can leave a comment, or a review, or a thumbs up and we'll use your feedback to help guide future episodes. And lastly, if you want to be on a future episode, if you've got knowledge to share with us or a story to tell, let us know about that too. We'd love to hear from you. And with that, it just remains for me to thank Jun Rao for joining us and you for listening. I've been your host, Kris Jenkins. We'll catch you next time.

What are some recommendations to consider when running Apache Kafka® in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases.

Here are 6 recommendations for maximizing Kafka in production:

1. Nail Down the Operational Part
When setting up your cluster, in addition to dealing with the usual architectural issues, make sure to also invest time into alerting, monitoring, logging, and other operational concerns. Managing a distributed system can be tricky and you have to make sure that all of its parts are healthy together.  This will give you a chance at catching cluster problems early, rather than after they have become full-blown crises.

2. Reason Properly About Serialization and Schemas Up Front
At the Kafka API level, events are just bytes, which gives your application the flexibility to use various serialization mechanisms. Avro has the benefit of decoupling schemas from data serialization, whereas Protobuf is often preferable to those practiced with remote procedure calls; JSON Schema is user friendly but verbose. When you are choosing your serialization, it's a good time to reason about schemas, which should be well-thought-out contracts between your publishers and subscribers. You should know who owns a schema as well as the path for evolving that schema over time.

3. Use Kafka As a Central Nervous System Rather Than As a Single Cluster
Teams typically start out with a single, independent Kafka cluster, but they could benefit, even from the outset, by thinking of Kafka more as a central nervous system that they can use to connect disparate data sources. This enables data to be shared among more applications.

4. Utilize Dead Letter Queues (DLQs)
DLQs can keep service delays from blocking the processing of your messages. For example, instead of using a unique topic for each customer to which you need to send data (potentially millions of topics),  you may prefer to use a shared topic, or a series of shared topics that contain all of your customers. But if you are sending to multiple customers from a shared topic and one customer's REST API is down—instead of delaying the process entirely—you can have that customer's events divert into a dead letter queue. You can then process them later from that queue.

5. Understand Compacted Topics
By default in Kafka topics, data is kept by time. But there is also another type of topic, a compacted topic, which stores data by key and replaces old data with new data as it comes in. This is particularly useful for working with data that is updateable, for example, data that may be coming in through a change-data-capture log. A practical example of this would be a retailer that needs to update prices and product descriptions to send out to all of its locations.

6. Imagine New Use Cases Enabled by Kafka's Recent Evolution
The biggest recent change in Kafka's history is its migration to the cloud. By using Kafka there, you can reserve your engineering talent for business logic. The unlimited storage enabled by the cloud also means that you can truly keep data forever at reasonable cost, and thus you don't have to build a separate system for your historical data needs.

EPISODE LINKS

Continue Listening

Episode 241November 3, 2022 | 48 min

Security for Real-Time Data Stream Processing with Confluent Cloud

Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale.

Episode 242November 9, 2022 | 43 min

If Streaming Is the Answer, Why Are We Still Doing Batch?

Is real-time data streaming the future, or will batch processing always be with us? In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should).

Episode 243November 15, 2022 | 38 min

Decoupling with Event-Driven Architecture

In principle, data mesh architecture should liberate teams to build their systems and gather data in a distributed way, without having to explicitly coordinate. Data is the thing that can and should decouple teams, but proper implementation has its challenges. In this episode, Kris talks to Florian Albrecht (Solution Architect, Hermes Germany) about Galapagos, an open-source DevOps software tool for Apache Kafka® that Albrecht created with his team at Hermes, a German parcel delivery company.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free