Do you have questions? My co-host, Gwen Shapira, has answers. We collect questions from across the internet. Confluent Community Slack, Twitter, YouTube—wherever we can find them—on topics like Apache Kafka, Confluent Platform, Confluent Cloud, and on this show, Gwen takes them on one at a time. Let's join her now for another episode of Ask Confluent.
Hey everyone. Welcome to Ask Confluent where we answer questions from the internet. I am your host, Gwen Shapira, and this is the first work from the home edition of Ask Confluent. We have Anna here with us on full social distancing mode.
As you may have noticed, it's been around over six months. I think it's been eight months since I've done my last Ask Confluent. The last one was filmed before the pandemic but aired after, so I got a gazillion questions. "Hey, Gwen, why are you and the team not socially distancing? Hey Gwen, why are you and the team not wearing masks?" We filmed that one before the pandemic, this time it's during the pandemic, and therefore, Anna here is being a good citizen wearing a mask. We are all socially distancing and for the first time ever on Ask Confluent, my actual office, and Anna's actual office.
Anna here, I'm super excited to have you, now you can see her face and.
She's a technical account manager at Confluent. She spoke at the Kafka summit. She's an all-around amazing Kafka expert, very active with Twitter. I don't even know how many meetups. It's just, so exciting to have you on the show, Anna.
Thank you very much, Gwen. I'm stoked to be here.
Let's get started with questions from the internet. First question. Okay. I need to preface that by saying that I slightly cheated. I went on Twitter and said Anna is going to be on the show and she's one of the smartest people I know. So I want really hard questions, please. So if you see hard questions, it may have been intentional.
Dominique Evans, who is, and I think, old Twitter friend, we've been chatting pretty much forever asked, "If you could pick any one KIP from the backlog that hasn't yet been implemented and have it immediately available. Which one would you pick?"
Yeah. So this one, I'm going to actually call it out. I would pick KIP-629, which is the one that was started to use racially neutral terms in our code base. That one to me is incredibly important. One of the,
[crosstalk 00:02:45] implementing it?
No, it's open right now. It's getting implemented though.
What are we looking for?
Yeah, I know that's what I'm saying. Exactly. It's even the perfect answer, but my mom actually is a linguist. And so language has always been very important to me. Not only what you say, but how you say it and there's absolutely no reason why someone should have to read those types of terms when you're just trying to do your job. So I am so stoked that this is happening in the Kafka community and in our code base. That would be the KIP that I would pick.
Wow. That's very socially responsible. Let's keep talking big. I would still keep KIP-500, anytime.
KIP-500 is actually made up of 500 KIPs. So is that a fair ask? I don't know.
Yes. That is a very fair thing to say. There are a lot. What about KIP-405? The tiered storage one.
Yeah. There's one thing about that that excites the heck out of me. And that's the fact that I think you brought this up on Twitter, is that now, like one of the advice we always give people is: Do not collocate and don't multi-tenant when you have a batch consumer and you have real-time because it blows out your page cache, and then dump your SLA.
I think you had brought it up that it's going to be implemented as just a network read. And so what that will end up doing, I think, is opening some of those use cases where you can support multi-tenant real-time without blowing out your page cache, which is really cool.
[inaudible 00:04:18] elasticity it's like, this is what [inaudible 00:04:23] is all about. So yeah, I'm super stoked.
Next one, from Gosam. Are we able to arrive at any formula for identifying consumer-producer, a super trait in Kafka with a given hardware specification? So I give you, I have that much CPU, I have that much RAM, here's my network, here's my desk. How much producer consumer [inaudible 00:04:45] will it be able to get?
Yeah. So for this, it comes down to partition, right? I think there's bit like that. We get questions like this all the time and people want an easy answer and easy formula and they get very discouraged when we go it's hand-wavy. Right?
[inaudible 00:05:00] formula, it's just so complicated. There are so many different factors RAM into it, Q1 series, nonlinear. So it can kind of be a pretty complicated formula.
It would be. But, but I do think like maybe what this question is asking more, and this is kind of how I try to reframe it. You know, when I talk to people is how do I tell, right? How do I tell what my throughput can be? Right. Given this configuration, right? So like producer perf tests. So the question that this really comes down to, right? Someone comes to you and they say, this is the throughput that I need. Right?
That's what, that's what ends up happening, right? This is my throughput. That's where you start. And so once you have that, if you know what the throughput is for a single partition, then you can extrapolate out to get whatever type of SLA or whatever type of throughput you desire. So I think you, you start with benchmarking a single partition.
You started with, so first of all, you started with benchmarking. Yes, no matter what, but I'm thinking that in addition to the partition scale-out, other factors would be how many connections you have because you have a lot of clients doing the same strip. You'll have those tiny itty bitty messages. They take up a lot of CPU. Do you use compression? And that would be a big one. Like obviously that's a way of cheating on the network.
In general, if given just those, I can't really tell you your throughput. I can tell you the best case and they can tell you the worst case, I guess, but it would be like a huge wrench like the best case would be like one megabyte per second. And then like the best case would be maybe a hundred times, right?
Kaylee Shine said, may I ask if incremental co-operative rebalancing also works for general Kafka consumer in addition to Kafka connect rebalancing?
This is so fun, but can we first just gave a shout out to Kafka streams, which has used the sticky assigner for a very long time before everyone else. Shout out
And nobody knows about this also [inaudible 00:07:08] topic assignment, more featured also, nobody knows about that. Makes a lot of things way better. Yep.
So a hundred percent. And I don't like, I call it, you know, Kips are like amazing, wonderful sources of information. They're super fun to read. And the consumer incremental remanence protocol is KIP-429. That goes over this. And it even has, I love this, it even has pictures. I'm like, this is great. It's very visual. I know. And so yes, the answer to the question is yes, first of all, and the details you can read about, of at your delight in KIP-429 which goes over what was needed, metadata updates, how we do it, what the edge cases are. I mean, it's a fantastic source.
Confluent actually has two blog posts. One specific for Connect and then another one that's kind of expanded into both streams and a couple of consumers. I believe that today, awesome Sophie, who is a huge contributor to Kafka streams, she wrote the blog posts, so [crosstalk 00:08:16]
She's comfortable. I love Sophie.
A hundred percent.
Okay, that's there. It's one of those episodes of just keep on giving. We had Jason Gustafson explain Exactly Ones and ever since we keep getting more, more and more questions about Exactly Ones. We are going to try really hard, Exactly Ones is a bit difficult. And so the question from Mazin Esadeen, "Exactly One processing is guaranteed under full tolerance. For instance, say one of the servers in the producer, a Kafka cluster, a Kafka stream cluster, or consumers is done, can Exactly One guarantee recovery? With Exactly, One's a guarantee?" And I love it because he actually thinks about the entire chain as servers, like including the producers and the consumers, which absolutely no one ever does, but it is absolutely the right way to think about this. So it's really nice. So, Anna what do you think? If I crash a random server from the list, will I still get Exactly Ones?
I think, yeah, it comes down to the idea and it was the Bay Area meetup that Jason did, I think it was maybe last summer, is amazing. That is such a good talk. It's fantastic. And he talks in there about the idea of these operations being atomic, right?
When you're talking about transactions. What ends up happening is that isn't going to be like, the whole operation is atomic, both the right and also the commit of, yes, I wrote this. Here's my point where I'm up to. Here's my point where this is a safe thing. So once you have an atomic committed, like if both didn't get-go, then it won't say, yeah, this is accurate. So if you crash in the middle, right, it knows when to roll it back.
Not to say that there aren't edged cases and poorly, poorly implemented applications. It's up to the mercy. For example, like in Kafka streams, I think people commonly, and I spend my life saying this, do not introduce side effects. Because what they'll end up doing is they'll do something like call out to arrest service in the middle of their topology and do an insert or something, and then wonder it's exactly once. Why have I just reinserted the same thing, 80 million times, right?
Also, why is it taking so long? Why is it blocking? There were a lot of features.
So it's important that you read and you understand what that guarantee is, right? That the guarantee for exactly once it guarantees as if the message is only read exactly. Once when you use a read committed consumer, right? If you throw something in the middle of that, there's no exactly-once guarantee. And in the middle of the processing, it's the end thing
He did the point where I think this is actually a tricky question because the Kafka stream cluster has the guarantee from the point consumes to the point of producers. But the producer and consumers that you have really depend on your implementation. Like you have to find the transactional ID. You have to make sure that if the symptoms producer goes away, it comes back, it comes back who is central election ID. And you have to do quite a bit of work to do that. Yeah.
Well, here's the fun part. This is what I've seen too. I've seen happen where if you have a standalone vanilla consumer, right. That's read committed. It's up to you to guarantee that no messages that aren't transactional are written into the topic it's consuming from. Because if a message comes in and that metadata, doesn't say it's transactional, that recommitted consumer will read it. And so.
I had no idea, really?
Yeah, well, the way that it works. Right? Yeah. This is at a very, very high level, is that there's a, it's fascinating when you go down this rabbit hole, it's also why you can't replicate transactions, right? Because what ends up happening is a message comes in. And it has a flag in the metadata that says, I'm a transactional message. So the consumer buffers those up until it gets a control messaging. That can be either an abort or a commit. If it's a commit message, it pulls them down. Now, if a message comes in and then the metadata, it says, I'm not transactional consumer, I'll just read it.
I saw that the consumer will just read it immediately and not buffer it.
Well, it will.
Now, I have to go try it out.
Well, it will read it immediately and not buffer that. And that's kind of the point is that if you mix transactional and non-transactional messages, right. In one topic and you have a read committed consumer people sometimes think, Oh, it will only read committed transactions. It'll read both. It'll read anything thatch not transactional and committed transactions. So I always tell people it's, doesn't mix. Don't ever mix.
Yeah. No, I agree. It's a better idea to make sense, becomes really hard to reason about, and what if you commit something, and then it's part of the transaction, not part of the transaction. It gets it very messy. So I'm with you just don't go there.
Okay. And another question from Mazin Esadeen, on the exact same topic. Given a stream of events, E-1, E-2, that traverses the whole pipeline, as we just described, at times T-1, T-2 under at least once guarantee and the string of events, we've traversed the whole pipeline, it's T1 plus a hundred milliseconds to two plus a hundred milliseconds. So I think what he asked is that will that delay or freed, committed basically delayed the entire stream in a uniform fashion, because you were waiting for the first transaction to get committed and then kind of everything else just streams after that. I haven't tried it. I'm kind of tempted, but just recently about it, I would say yes, like the first transaction, like you'd get high latency for the first transaction, but everything else will already be in flight. By the time you get the first transaction, you will not get more latency for anything else.
Yeah. I would, I would agree with that. And I think, yeah. A transaction one turns out another yes, I would agree with that. This question, I'm having trouble trying to figure out exactly what they're asking.
But it allows me to plug in my newest favorite [inaudible 00:14:23] Kafka. I don't know if you've seen that in our rough, rough draft implementation was merged this week. And one of the things it includes is the test framework, which allows you to move things in a very predictable way. So you can kind of inject events and it runs on one spread with one queue of events. You can say, do literally an election and now do a network partition. And now another leader election and now checks at the high watermark is all in place.
I'm thinking that this kind of test would kind of allow you to just check the scenario. Like you have one transaction that you put in the delays to make sure that it takes exactly how many seconds to commit. And then you see what happens to the next transaction line.
Yeah. I mean, and I love that. That's my favorite thing about it is that new testing framework. So being able to deterministically trigger certain cases is like huge. So I'm very excited about that.
I just became an evangelist for that test. You know that it came from the foundation [inaudible 00:15:23] .
No, I did not know that.
They didn't talk about it in a strange loop at some point. And apparently, Jason got really inspired.
That is so cool. I'm excited about that.
Yeah, so now I'm trying to tell everyone it's the same because I feel like it will just change the word of dist ability system testing.
Yes, I agree. I think it's, life-changing at least for me personally, under quarantine.
So as you may have noticed a conflict platform, 6.0 shipped today, with the preview, I believe of cluster linking. And as one does, Tim did have video. It looks like people are very excited about it. Sykrishna Boyina says, "fantastic, glad to see this new feature. It will save huge time and complexity for replicating a cluster of topics." And I say, yes, I am also glad to see this feature. It will save huge time and complexity for replicating cluster topics. And I think you have the most experience ever in these kinds of scenarios. So tell us how constant linking will change everything.
I think one of the things is the possibilities cluster linking opens up. So currently there is no way right now with the replicator, Mirrormaker, whatever you want to say to replicate and keep offset parody. And that causes a lot of issues when you try to reason between two topics when you're doing asynchronous replication. So for one being able to respond and have that sort of determination that okay, let's say for example, and I'm field personnel, which is such fun. A good example is somebody who is providing data right to another person. Then they're selling a data feed. For example, if someone calls them and they say, look, I didn't get this message. And they asked them. The first thing they're going to ask is, well, what's your lag. Let me go look. And the beautiful thing is now you can deterministically reason, okay, this offset in my sport.
Yeah. It's fantastic. That is really, really cool. I think to call out some of the things that I've seen in Kafka growing, this just fits in so well. One of those is, with like something like Casey Cole, right? Or you, we get data scientists now who want to play around and just, and I always say, just like you, wouldn't let marketing go nuts in your production Oracle cluster, probably not a good idea to let people go wild and crazy running anything they want, any queries or streams in your production [inaudible 00:17:56] a cluster, not a good flip.
And so the idea of having that type of almost real-time data science, like cluster sitting over here that you could use cluster linking with to just kind of go nuts and try to find insights. This is perfect for that.
And I mean, I let everyone run whatever they want on any one of my 600 production clusters, yours. [crosstalk 00:18:20]
Yeah. the, some of the industries I work in, right. They would frown on that. They're not, but then again, we are like the best multitenant people in the entire world, right? So if you have protections then maybe, but you still can't protect against everything. Even with the new KIPS that are going in, you can still have a consumer that'll tank, a node. Even if you ever request boat is bites in, bites out. Everything on the face of the sun. You could still have that happen. So I always say better safe than sorry.
Okay. Next question. This is on Vic Gummo's confront live streams. Anna Kovaleva and I just want to say that I absolutely did not pick this question just because we have two Anna's on the show, although who knows. Anna Kovaleva asked for an exact example for business metrics based on events. And I said, Oh, I have someone who works with customers. I can probably get some business metrics based on events.
I think like if somebody asked me this question, I would say, what do you mean by business metrics? Because I think, that can mean a lot of different things.
So, let's interpret it a certain way. What do you think when you hear business metrics?
I was also thinking exactly about that because I was having a tough time finding any business metrics that are really not based on events. I was thinking, like, for example, the Uber S1. They had a lot of metrics that they describe in their business. One of them was the proportion of Uber rides as part of all the hired car rights. Right. So you accumulate all, everything that is in taxis and all the Uber competitors, all the car rides in New York City, what proportion is Uber? That's a good business manager. Uber.
Yeah. And there's a ton of those types. So if that's how we're interpreting this question, I mean that then all the time, all day long, one of the things I see a lot of is like new sign-ups, right? So like we're tracking an event every time a customer signs up for this or that, or the other thing. And then we can get a rate of new signups. Campaigns, like when people roll out new campaigns, what's the engagement rate. Using an event like that, how long did they stay on this webpage? All of those kinds of business metrics are coming to play every day at customer sites that I go on.
Yeah, absolutely. I'm actually having a hard time thinking of any business metric that I couldn't translate into an event. Like even like the most financial ones, like revenue based on sales, sales as an event, like how
Not only that but the absence of events. Those are the best. Like, I always say that the absence of something can be just as valuable as its presence. Right. So if we track, okay, we have a series we see like, and you can think about it as maybe somebody who signs up for, let's say there's like four courses right. In, they sign up for three, right. And never sign up for the fourth part? Why? So tracking an absence of an event to can be very valuable [inaudible 00:21:23] Kafka streams, let's you do that
Basically what all the [inaudible 00:21:26]
Right. So events, absence of them, all of it you can do in Kafka and Kafka stream patrols.
I love it. I mean, we should do it episode just you explaining how to do it in Kafka streams.
I can, yeah, if we ever did like a typology with Anna episode, that'd be so fun.
I love typologies.
Okay. So Timberland had an incredibly popular [inaudible 00:21:53] one episode just on that. Like what is Apache Kafka video. Apparently, people got jobs based on watching this video. This is ended up being quite popular. Even Andrews said, "you've copyrighted Kafka, what is the point anymore?" And I think it's a misunderstanding that they just need to clarify and move on. I just couldn't let it go.
We copyrighted Apache Kafka as in the Apache software foundation absolutely has a trademark. And it has a trademark on the name, Apache Kafka, which is what you're seeing here. It's our registered trademark. It also has copyright, which is the C thingy over the entire Apache Kafka Code-based. So you place is the Apache software foundation. They always had a trademark over Apache Kafka. They've always had trademark over every single line of code in Apache Kafka. Copyright over every single line of code. It is under the Apache software license, but the code belongs to a foundation, which is a nice place for the code of an open-source project to be. It gives us our beloved governance structure with the PMC and committers and all that.
So, Confluent absolutely never could, never did cooperate anything with the word Kafka in it. This is like the most important rule in being a commercial player in the open source space. The trademark belongs to the foundation and the register trademarks are mean that it's a registered trademark of Apache software foundation. Always has been, always, will be.
We had a Kafka summit recently, Anna, what was your favorite talk?
Yes, I loved that.
I adored it. He even gave me a preview of it. Time semantics is something that's misunderstood wildly, like everywhere. I spend a lot of time explaining screen time versus wall clock time and, and how we tick windows, all of those things and the visuals, especially, and I want to call that out, in his presentation are fantastic. If you're a visual person, it isn't the Alice table kind of thing that we've all seen so many times. It's beautiful and it really does add like a layer of understanding. I think if people had they would be so much more successful.
I'm with you. I always got into trouble by being unable to [inaudible 00:24:23] about time semantics in sermons, especially around Jones. It can get incredibly tricky and Matias clarifying it was a fairly big deal.
I liked the talk from Bloomberg because they built this entire platform around Kafka and I really love seeing.. They went into a lot of details on how [inaudible 00:24:45] Kafka across [inaudible 00:24:47] platform. I like it when people do that. I like it when engineers take as a mission to enable good things across an entire organization. And I liked her approach. And then from Confluent, I really liked Anna Provenov's talk, but I'm very biased. We were working closely.
It was awesome. And she's the queen of Kafka multitenancy and I'm just lucky enough to know her. Absolutely.
And [inaudible 00:25:17] apparently was very inspired by Jason's morning, keynotes. And he's like, let's remove the zookeeper." They think we can all go and say, yay.
I agree, Gwen, can I ask you something about the zookeeper removal?
Do you think this will help drive, one thing I've been hearing more and more about from customers is footprint size. And removing zookeeper and being able to kind of co-located, to me opens up doors for a smaller footprint and that's not something we talk about a lot. We talk about the ease of configuration, getting more partitions for your buck. Being able to remove some of the partition limits and things of that nature. But I think it would be neat to see what happens in the future too, with being able to, decrease the size of like Kafka in a backpack that's like
Right. So on one hand, enabling IOT use cases and edge use cases is something I'm super passionate about. And I agree with you, like in those edge use cases where you never really need more than maybe two or three brokers, like having zookeeper, it can be a bit painful, especially if you're texting. If it's something that's fairly last report, lower utilization, there's just no need. But I'm also thinking that I wouldn't want people to think about old machines as the first things they think of. Because, to be honest, none of us still know how much resources the quarrel will take. And what if it takes more resources in zookeeper?
Yeah. I mean, no, that's a good point. That's a good point. I tend to get over-excited.
I think I am excited about it, but if being a backing zookeeper, with Kafka was a no-brainer in every case, we would have done it ages ago.
There were a lot of cases where it's quite risky. Like zookeeper basically dies if he doesn't get timely access to the CPO. Next breakout dice, it depends on it. Same thing for timely access to this. So collocating zookeeper with anything is very risky.
Now the quorum may end up being a lot less risky. Like it has this log of events. Maybe it will not turn in this very timely CPU access, but I haven't run it in production yet. And they would probably do it slowly and carefully. Okay. I think we are at the section of the show called backpacking a team and each other. Where we just pick up some compliments from the internet to make us all feel better about doing our jobs. So this is about the Apache Kafka goes global with was cluster linking, which we discussed earlier.
George Leonard says, "just about every time I watch a video of an announcement, I'm blown away. Awesome guys" I want to think that he's blown away in two ways, the awesomeness of the feature we delivered and the awesomeness of the video talking about it. Which do you think is more awesome?
I have to say delivery is everything. As I said, I think you could poorly deliver like the best thing in the world and vice versa to a point. But I think we always seem to hit the sweet spot and Tim is fantastic for that.
Fantastic. And cluster linking is world-changing. So definitely a very good combination. Tim Berglund gave the closing keynote at Kafka summit and Asking said, Tim never fails to disappoint with analogies and passion. And I think it's meant to be complimented, it read like a compliment when I first saw it.
I believe it. That's a high compliment. That's like the best compliment.
Never fails to disappoint is a compliment?
Yeah. Because he's never disappointed.
Oh, okay. So never disappoints is a compliment.
Oh, I see what you mean. I read that.
I read it a few times and now it seems a bit weird in my head.
So like, does he mean he's always disappointed or she, or they, or them
Is the point, right. But how can you disappoint if you have good analogies and old fashion,
It doesn't make any sense?
And the exclamation point is also another, yeah.
I think that guy was like too excited, and so excited that he had trouble phrasing it.
Or a lady. You don't know it's Asing or, or they, or them, one of the two. Whoever it was.
I picked up a compliment for myself. I also did the keynote on Kafka's architecture where I talked a lot about my favorite KIPs. Abishek Gupta said nice presentation and that tiered storage is a game-changer. And I think we just spent a good portion of the episode, agreeing with Abishek Gupta.
The same video about what is Apache Kafka. Some people are actually paying attention to the content and not misunderstanding the trademarks. To be fair, trademarks can be complicated, but we Wilma said very well explained, thank you for sharing it. And yeah, I basically have to agree. It's every word. But if you happen to be listening to this video and go to this far, and you feel like we're basically talking Chinese, because this is kind of like a very advanced episode with a lot of contents that if you're not super familiar with Kafka, you may feel a bit lost and this is the video you really want to watch.
Thank you for surviving with us and talking great to you so far. But go watch the video, you will understand a lot more about what Kafka is about. And then you can come back and understand why clustered and King is game-changing.
And that's all we have for today. So yeah, it was tons of fun. Thank you for all the answers and amazing questions from the community. I'm really glad that they call for extra, extra hard questions, work like this was literally one of her most advanced episodes. I really appreciate you, Anna, kind of stepping up to the challenge and never failing to disappoint.
I got it.
Not disappointing us and answering all those questions and having really shared those really great customer stories. I really loved it.
It's great to see everyone who wrote in who comments on our YouTube, all the people in the community, just learning more and more about Kafka. [inaudible 00:31:51] to watch the new video about what is Kafka at all or people with a lot of experience who want advanced topics like global Kafka. Keep up with the learning. We're all in this together.
I agree and stay safe everybody. We'll get out of this soon. One day at a time.
One day at a time, that's the new motto.
Hey, you know what you get for listening to the end, some free Confluent Cloud. Use the promo code 60PDCAST. That's 6-0-P-D-C-A-S-T to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021 and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit and there are limited number of codes available, so don't miss out.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack sign-up link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast, where ever fine podcasts are sold. And if you subscribed through Apple podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So, thanks for your support and we'll see you next time.
It’s the first work-from-home episode of Ask Confluent, where Gwen Shapira (Core Kafka Engineering Leader, Confluent) virtually sits down with Apache Kafka® expert Anna McDonald (Staff Technical Account Manager, Confluent) to answer questions from Twitter.
Find out Anna’s favorite Kafka Improvement Proposal (KIP), which will start to use racially neutral terms in the Kafka community and in our code base, as well as answers to the following questions:
They also answer how to determine throughput and achieve your desired SLA by using partitions.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us