Third time, the triumph of hope over experience on your part I suppose, but I'm glad that you're willing to return. Anna, last year's Halloween episode was your idea. It was like your first day on the job or something like that when it aired. It was all pretty cool and you had come up with some scary JIRAs to frighten us into having better knowledge of how Kafka works I think was the plan, and it seems you've done that again.
Yes, and this year, since I've spent a year working with real customers who are using Kafka, it's been the scary JIRAs, many of them are things customers have actually hit, so they're terrifying in practice.
It is actually your job to help people who are having trouble with something get past it, so you know literally where the bodies are buried. You know where the skeletons are, you know where the zombies are, the other sorts of other undead creatures, werewolves, any of these things, you encounter them-
You fight them. You are kind of like our own Van Helsing, I think [crosstalk 00:02:14]-
That's true [crosstalk 00:02:15]-
True story, I have Von Willebrand's disease, which is like a bleeding disorder, like a really light case, and for years-
Literally didn't know that.
I thought... It was also the vampire hunter. I don't know why. When I was much younger-
You thought the name of the disorder was the same as Van Helsing?
I thought it was after the vampire hunter because it had to do with blood.
Yeah. Wouldn't you kind of [crosstalk 00:02:37]-
Prefer that? Blood, right? Wouldn't you sort of prefer-
That they be the same?
I would because then it sounds much cooler than being like I have problems making the Von Willebrand Factor, which is kind of dull.
That, like nobody cares but, "I have Van Helsing's disorder," like, "Oh, well, you might have like some vaguely Steampunk-y crossbow that you're going to kill me with if-
I don't watch myself.
Exactly. Yes, it's all about the Steampunk.
I know I'm kind of like mixing literary traditions there, but I don't know, one imagines Van Helsing having some sort of weaponry that is period-
Like a crossbow.
Kind of high-tech in some surprising way, which actually is kind of a lot like you now that I'm really thinking about it. Anyway, before we get into the JIRAs, this is the second time you've done the very special Halloween episode of Streaming Audio. Is Halloween a big deal at the McDonald household?
Well, yes, for the kids. I have a standard costume I wear. I'm a burglar every year because you can wear like a black sweater and a knit hat, so I feel like [crosstalk 00:03:43]-
That scales. It's great, but yes, my kids tend to pick costumes that you can't buy in a store, so we do spend quite a lot of time on Halloween preparing.
Excellent. I'm kind of wondering how trick-or-treating is going to be this year with the pandemic. You figure this is a relatively easy day on which to cover your face, like your face might be covered anyway and probably see a lot of surgeons, sort of the convenient one for wearing a mask, and I'll figure out some sort of mask I can [crosstalk 00:04:14] wear, but I'm not sure what.
Well, I'm going to be honest, I think the mask compliance in some areas will be higher on Halloween than during a normal day.
I think you're right, I think you're right, and I've got like a pretty good Mal Reynolds costume. I've got a pretty decent Jack Burton, which is not seasonally appropriate. It's typically cold on Halloween in Denver, and I'm thinking of adding an Obi-Wan-
Sort of Attack of the Clones/Revenge of the Sith-era Obi-Wan. I've got the beard going on now, so I can kind of do that, but anyway, some Halloween thoughts.
Me, too. Me, too. It is your job at Confluent to help actual customers who are having trouble with something not have those problems. You guys on your team, you have all of these cool acronyms and, Roger, public apology, I just can't remember them. I need to make a cheat sheet or something. I don't do well with this. I still think of you as a TAM, or technical account manager, Anna, but your actual title is-
A CSTA, and I like to say we're CSTA because we help our customers be able to take naps.
That might help me remember, it might not, it might not to be fair, but CSTA, C-S-T-A, which is Customer Solutions Technical Architect?
That is awesome. Yeah, I don't think I get that right as much as you have.
Yeah, yeah. Imagine like if we were habitual cohosts and I 50% of the time I had introduce you as Customer Solutions Technical Architect. I wonder how often I'd get it right? Probably not much, but anyway. You want to start scaring us?
I do. Okay, so I've picked out some really good ones this year, and the one I'd like to start with is KAFKA-9144. I'm subtitling this Early Death Causes Epoch Time Travel.
That's my subtitle. The actual title is Early Exploration of Producer State Can Cause Coordinator Epoch to Regress.
Which is nothing if not prosaic. I think I like your version better.
Yeah, right? This is very interesting, so we have seen a lot of people adopt the upper-level Kafka APIs and just kind of more complex use cases with Kafka. Transactions help enable those use cases, so I'm kind of a little bit partial to anything that kind of bothers transactions. I think they're fascinating. True story, one of the only times that we diverge from the truth of the log in Kafka is in Producer State, we take snapshots and we do that because in Kafka streams, you expire things from the log very quickly in like a change log when you're doing a repartition.
Lo and behold, we were very surprised to find out how quickly producers were expired, so instead, they take a snapshot. That snapshot at any point in time doesn't have to match what's in the log, so if anyone ever asks you like, the truth is the log all the time in Kafka. You can be like not producer snapshots, boom.
Okay, and when producer snapshots taking? Just to make sure everybody's got the background, when, and for what purpose?
They're taking at increments, and really, it's because sometimes we like overeagerly cleanse the log. When that happens, you end up getting these warnings, and anybody who's used EOS and Kafka Streams will be aware of this. You'll see like these annoying warnings in the log that says like, "Unknown transaction ID," and it's a retryable error, but it's very annoying and it's a lot of overhead. They were like, "You know what? We're just going to snapshot it because we know it's still safe to know that this transactional producer is known to us, it's just because of how quickly we get rid of stuff in the log when we don't need it anymore for certain operations. It's just a good idea to keep it and it's taken at different intervals.
Gotcha. What happens in this terrifying JIRA?
Okay, so the bottom line is producer state expiration. When does that happen? What we ended up doing was we said, "Okay, we're going to expire producers like based on the last timestamp, which for all intents and purposes that seems pretty reasonable. You're going to expire a producer based on, "Well, it hasn't produced anything in this long." The issue comes in when there's a failed transaction or somebody goes rogue, so like you have a zombie coordinator because transactions, just like consumer groups, have coordinators. If a coordinator goes rogue, it's a zombie, or a producer goes rogue, it's a zombie. Then you start to introduce these very interesting edge cases, and this is one of them.
What ends up happening is the timestamp that's used to say, "Hey, should we expire producer state?" It gets set to -1 because there's nothing in the log for the last message because this transaction perhaps got aborted. The one that it goes over first in this JIRA is a producer writes transactional data and it fails. The coordinator sees that and is like, "Well, it failed, so it's not a committed transaction," and it writes these abort markers. Then, it bumps the epoch, and the epoch basically, and this is fascinating, too, I looked this up and I actually tweeted about it, too, because I just want to know if anyone's ever hit this. The epoch is just a monotonically-increasing number and we do that in order to fence zombie coordinators and zombie producers.
The maximum value for that is short max value minus one, so you can only have 32,766 epochs for a given producer ID, and I'm just fascinated [crosstalk 00:10:04]-
Yeah, like has anyone [crosstalk 00:10:07]-
Does it get put in a header or message or something somewhere? I mean, it's-
No, it's actually-
When you're producing?
Yeah, it's actually stored right in the producer's state locally, so it knows and it does send it. If you look, there is a really nice illustration of this issue in the JIRA by none other than Jason Gustafson, and so Kafka has these guarantees when it looks at epochs, and that's why I'll get into this, this bug caused some really bad things. For instance, one of the causes was a node that would not start up, so this is like a thing that impacted brokers, which is never a good thing, right?
That epoch is kind of assigned and kept and that's part of producer state, so as long as the producer state is constant, the epoch is kept. It's kind of assigned in the group metadata and assigned in the coordinator and it knows what it is.
Okay, and that epoch doesn't advance very frequently, then?
Well, yes, it can-
Oh, does it?
Basically, what happens is, and this is what we're talking about, if a producer dies in the middle of a commit, if something's gone rogue, we always wonder, "Is it a zombie? If I can't connect to this coordinator, I can't connect to this producer, it could be a zombie, in which case it doesn't know that it's dead. It's alive but not really-
So we bumped the epoch to make sure that it can't cause any funny business. We're kind of fencing it. Anytime we see that behavior, we bump the epoch, and when you bump the epoch, that means the old zombie has the earlier epoch and can't do any harm because we check that before we [crosstalk 00:11:43] commit a log, right? Yeah.
Detect it as the activity of an undead creature?
Exactly, totally, and that guarantee holds while that producer state is not expired. When do we expire producer state? Well, we look at the timestamp. All of this sounds really good and awesome and fencing is great and we love it. However, the problem is what we used do to is we used to use the timestamp from a message, the last message. Well, if there isn't the last message, it was setting the timestamp for this producer to -1, which means, "Oh, hello-
Tim Berglund : Okay.
"I'm the producer state cleanup. It's -1. I'm going to clean it up." Remember, now what we have is we have a situation where something has gone wrong. The coordinator has bumped... This is twofold. In the beginning, they thought this was like an... Basically, I think it was Jason who said this, it takes an alignment of planets to hit this bug, but later on we found an even more common case for it. This one is the coordinator, so a producer fails. Then, the coordinator will see that and go, "Oh, I'm timing out this transaction because it's failed, obviously, and I'm going to write an abort."
It bumps the epoch to make sure that that product is kind of quarantined, but then the coordinator itself becomes a zombie, so the coordinator's going to keep trying to send. The old coordinator, it's going to go, "Well, I don't know if it failed or not. I don't know if it failed or not." There's a new coordinator elected. Now, when a new coordinator is elected, if the producer state is still there, it should know what the epoch is. However, because it was -1, it says, "Oh, I don't have a new producer state," and resets the epoch. All of a sudden, the protection that you have from that monotonically-increasing number is gone.
What ends up happening is that abort, the new coordinator will actually bump that epoch and write a new abort message saying, "No." That's going to have the later epoch. I'll read an example with numbers in a one second so it will make a little bit more sense, but that zombie's still out there. The epoch has been reset to zero, so there's no protection anymore and all of a sudden, what you see is you see in the log, "Out of order epochs." That, before this fix went in, this change, is the death of everything, so-
It'll cause a broker not to start. It's not a consistent state. It will be like, "Whoa, whoa, whoa. No. Absolutely not." It was crashing brokers. The way that this shows up a little more commonly, it can result in a hanging transaction, and this is a great JIRA to read because the example here shows. You've got a producer and it opens a transaction, and let's say the epoch is 18. Then, it loses communication with the cluster. The coordinator sees this and it's like, "Oh, I'm aborting this transaction, and I'm going to bump the epoch because this producer has obviously gone out to lunch."
When it does that, the producer state, that timestamp gets set to -1, producer state is cleaned up, which means all of a sudden the epoch goes back to like zero. Then, that producer zombie, who's at 18, is like, "I'm back from the dead and there's no protection." It can accept the write and what you end up seeing in the logs, which you should never, ever see in Kafka producer ID is you see a producer epoch of 17, one of 19, and then 18 when the zombie backs in. That [crosstalk 00:15:54] cause-
There's nothing you can do.
There's nothing you can do at that point. That's really bad, and so it shows up when you have zombie coordinators and when you have like a zombie producer for transactions. It's just not good at all. It all comes down to the fact that the timestamp that we're using was based on a message, and if the message isn't persistent, you've got no timestamp. Can you guess what the fix is? It's-
No, I can't.
Use that timestamp. The timestamp, what the fix is, we use timestamps from the transactional markers as well because, remember, in both cases, we're bumping the epoch and we're committing an abort. When you use that transaction, it prevents early expiration of producer state, which continues that monotonically-increasing guarantee. Is it nice to have? Jason Gustafson also changed validation logic, so like coordinator epochs are only verified for new requests. What that means is people could patch and bring up those nodes. We used to fail out for that, but now we only log a warning, so it was a way that this patch both fixed the issue and allowed recovery for people who had hit it.
All right. That is scary. It's only our first year, and by the way, we have five, folks. There are five of these that we're talking about today, and I'm already a little off, I think, but let's go to... Let's see, the next one, KAFKA-9184, famously.
Attack of the Clones. That's what I'm calling this. I would so excited when I thought of that, and this one hit so many people, and it was-
It's kind of similar, isn't it?
Yeah. Well, no, this one is not. This one is bad in different ways, but just as scary.
The thing with Connect is historical there have been stop-the-world rebalancing for unrelated items as I like to say, so like if you-
Like I want to reconfigure a connector and-
Or deploy a new connector that's not [crosstalk 00:18:06]-
That's not like that [crosstalk 00:18:06]-
Related, like, "Why do we have to-
"Oh, goodness", yeah-
"What were you thinking?"
Oh, you know what? People improve. As we go along, we think better. When you know better, you do better, right? Great idea. Let's do Incremental Cooperative Rebalancing. Fantastic. That means that these tasks and workers can keep their assignments, so you only rebalance what you need to when you reassign. You don't have to pay the penalty for computing assignments again. Just a great idea, fantastic. People were excited. The world was on fire. It was awesome.
I remember it, yeah. No, I was-
Setting some of those fires.
Right. We were excited, and unfortunately, there was a problem, two problems, actually, which was fantastic because I didn't even know about the second one. The first problem and this came up in Replicator I saw this first. I got a call. They said, "People were seeing duplicates," and I was like, "Okay, duplicate what?" Then, the first question I asked, as I always ask, is, "Is it for everything? Or is it only for certain partitions?" It was only
For certain partitions. I went... Then, we did some more research on it internally and discovered that we had multiple tasks that were assigned the same partitions. Not a good thing. When you're doing like a-
Not at good thing at all.
Yeah. No, when you're doing a sink connector and you're pulling data down and you're putting it somewhere, you should not be running more. Tasks should be assigned a partition. We're not replicating the partition 0 eight times. That's generally bad.
Right, right. You will, for sure, duplicate data.
Right, of course, and so what happened is when the JVM didn't restart, a zombie worker was able to keep it tasks running. When they rejoined the group, because of the new Incremental Cooperative Rebalancing, they still had their assignment. All of a sudden now, the coordinator just saw a new member that already had assignments and was like, "Go for it, bud. You're still running." It knew when that fell off, "I had to reassign these partitions," so all of a sudden now, you've got two tasks running with the same partitions, which is not good, obviously.
Got it. No, not at all. Okay, so I'm tracking... By the way, we didn't give the full title of the JIRA. You said it was Attack of the Clones. Title of the JIRA is Redundant Task Creation and Periodic Rebalances After Zombie Worker Rejoins the Group, and that's by Apply the Connect Group.
Anna McDonald : Brains. I just want to say brains because it's a zombie.
Brains, yeah, absolutely. Brains.
Brains. Okay, go on.
Right, and so the fix was that each individual worker tracks its connectivity now in an unblocking manner. It basically tells the worker and the worker becomes sentient and it knows now that the coordinator's unreachable. Prior to this, the workers were kind of like, "I got my assignments. My JVM didn't restart. I'm cool." Now, it knows enough that it needs to actually communicate, which it does when you're using Cooperative Rebalancing. You got to cooperate. If the connection is inactive with the coordinator, it will proactively... the worker now owns the responsibility of stopping its connectors and tasks and not just hanging onto those assignments.
Here's the other part that surprised me because I did not know this. When this happened, it's like a double whammy. When this happened, the assigner that kind of says... basically what it does is it says, "Okay, here's the delay. Then, we're going to try to rebalance again." That delay was also not getting set to zero, so all of a sudden you also had periodic rebalances for no reason, which is obviously not the point of introducing Incremental Cooperative Rebalancing.
Right, right. That's going to be much worse.
Yes. This is terrifying. I love this one. That was also fixed, obviously.
Of course. Okay. Are you ready for the next one?
I am ready for the next one.
Okay. This is a ZooKeeper JIRA.
What do you call it?
I threw a curveball. I call this Why Can't I Upgrade, I'm in Hades, but yeah.
Less evocatively known as Fails to Load Database With Missing Snapshot File But Valid Transaction Log File.
Yeah, right. One of the things-
I like your title better.
Right, yeah, and this happened to a couple of people and there may be people listening to this podcast who are like, "How could you have a ZooKeeper without a snapshot file?" Well, believe me, you can. I've seen it with several customers, so anytime you don't have a lot of activity on ZooKeeper, there's this thing called snapCount. That snapCount says, "This many transactions have to occur before I take a snapshot." As one might imagine, pretty self-explanatory, and if you don't reach-
That snapCount, it doesn't take a snapshot. ZooKeeper, in its infinite wisdom, when you need to upgrade it, it decided to require a snapshot, so when people try to upgrade without a snapshot, it failed. My favorite thing about this JIRA is there is a Noel snapshot file attached to it, and it's like, "Just download this Noel snapshot file to your [crosstalk 00:23:52]-
It's 400 bytes.
Yeah. Sure. Just download it. I'm sure it's cool. It's legit.
Right, and so it's just an empty snapshot file, and then they say, "Snapshot trust MD equal true." This works apparently for some people, but it doesn't work for a lot of people, and I've seen at least one case where someone did this and for whatever reason, it did not rebuild off their valid transactions and they lost data. What did they lose? Metadata, things like topics. What's your topic names? What are your partitions? Right? All of the stuff that's durably stored in ZooKeeper still, which it's not good, and so-
Working through this, and there isn't like a fix for this, but there's a workaround that I recommend. I'm just going to tell people because I think people, for example, who want to use CLS, they want to upgrade ZooKeeper because that was a long time coming. What I recommend people do is, believe it or not, snapCount is configurable, so I recommend you go in, you lower your snapCount to something that's reasonably low, and then force it to take a snapshot before you upgrade.
What's reasonably low?
I don't know, like 50. It depends on what your flow is, right? Depends on how much data you got-
Running at topic. Again, this is really kind of a case that hits people in lower environments where there's not a lot of... which is where you should be upgrading first, thank you, please and thank you. It's 50, 25, and then you can even force the creation of it by running a script, doing configuration changes, prefix dummy topics and then delete. Whatever you need to do to get that amount of activity going to ZooKeeper.
Got it. Okay. The other fix to this is removing ZooKeeper altogether covered under [crosstalk 00:25:40]-
It's so exciting.
Not a JIRA but a KIP-500 and, boy, I bet y'all CSTAs are going to be pretty happy about that.
Yes. Yes we are, and the thing I'm most excited about is the new testing framework, the DES Testing Framework, where we'll be able to actually do like deterministic event... testing network partitions. Right now, it's only implemented in the new like Raft algorithm, but they're going to put it in for like coordinator and put it in for all of the nodes and all of these kinds of things, so we'll be able to test all of these situations where we want to make sure that in the event of a network partition between this and this, we can make sure that the right leaders are elected, that ISR looks good.|
I'm very, very excited about that, and so I pinged Jason I think earlier this week and I was like, "So, hey, it's really awesome and cool. When can we have it in the controller? Can we have it in the controller and all of the coordinator stuff? Like soon, hopefully?"
Yeah. Kind of just for [crosstalk 00:26:43]-
I mean, I don't think it's [crosstalk 00:26:44] going to be soon [crosstalk 00:26:44] but that is going to be so... it's going to be so much better in terms of testing coverage and being able to actually test deterministically specific scenarios and Kafka that we know we need to validate. It makes me feel much more confident about removing ZooKeeper, much more confident, so I'm actually really, really excited.
Good, good. All right. I wouldn't have guessed you wouldn't be. Now, continuing the horror, we have KAFKA-9729. What is your name for this?
So 9729, I actually went almost true to the title. I said Shrinking In Right Lack to Avoid Naming Cluster Performance.
Okay, I want you to name JIRAs basically from now till the end of time. Ladies and gentlemen, the actual title of KAFKA, pardon me, of JIRA KAFKA-9729 is Shrink In Right Lock Time in Simple Authorizer-
That's less awesome. Anyway, Anna, tell us about why this is terrifying.
Well, one of my favorite things is like I was a Principal Software Developer for like 16 years, and so I used to think a lot like, "Hypothetically, this could cause a problem." Well, now, I'm in the field and when I read something in a JIRA, it makes me giggle. Any kind of words, and my favorite one in here is "Blocking produce and fetch request for two seconds would cause apparent performance degradation for the whole cluster." When I read it, it makes me giggle because I'm like, "You think?"
It's like, "Well, yeah, it might."
Yes. We have people who have an SOA of 400 milliseconds, like-
You know, not to mention
Two seconds could be a problem.
Right. I know, but I love it because it's like you can see like... I don't know who actually authored this, but I love them. Thank you for writing this up in this fashion because it just made me smile. It made me remember when I used to like postulate like, "That could probably be an issue." It's just fun to see the other side as I have like over this past year, to see like, "Oh, no, it is an issue." Oh yes it is. That was fun. I enjoyed... That's my favorite line from this entire year.
This is interesting. ACLs are one of the last remnants of things we actually have to have direct connections from the nodes to ZooKeeper for. They don't go through the controller at all. What ended up happening is the way that Authorizer is written to remove and add ACLs, you need a write lock, which makes sense. There's absolutely no doubt that seems like a good idea. However, when you do get ACLs, which you need to do and produce and fetch requests to say, "Hey, can I read this?" We were using a read lock. Those read locks cannot be obtained when the write lock is in effect.
Okay. We have a word for this. I mean, that's not going to be performance degradation. That sounds like a dead lock. Am I hearing you correctly?
Well, yeah. It eventually clears, but yeah, it's blocking. I think it would be fair to call it blocking. I don't know if I would-
Necessarily say it was like totally murdered and deceased.
It's definitely blocking, and so what ends up happening is all of those produce requests that need to make sure that they can produce to that topic, they just get backed up, right?
Behind this write lock, and so this really comes into effect because ZooKeeper is very sensitive to latency. They did a simulation and if there's a 500-millisecond delay, which by the way, is a hundred percent realistic in something like a stretched cluster or something where you've got network latency. That's just a law, the network is never perfect, ever. For 500 milliseconds delay on the ZooKeeper side in write lock would add from two to 2.5 seconds.
Right, and that's where the apparent performance degradation for the whole cluster [inaudible 00:30:57].
A little bit.
I mean, it could hypothetically, or for real, like that yes, yes, again.
What they did, very sensible, is they just created an immutable structure and used a reference to that to obtain a consistent view. Then, you don't have to do a read lock. You don't have to a lock at all, and by the way, that's only in 2.6, so most people probably aren't on 2.6 yet, so-
This is as current. This is hip. This is a hip JIRA that's scary.
Right, so something to check, something to... I was surprised when I found this one. It was like the shrink that got me. I'm like, "Shrink is a very interesting word to use in a JIRA. Let me investigate this."
You think of like shrunken heads, right? That's kind of Halloween-y. I've seen decorations with those.
Exactly. No, it's incredibly Halloween-y. I think perhaps nothing more Halloween-y. Okay. Are you ready for the next one?
I am ready for the next one.
Okay. This is good old KAFKA-9232, and I like your names first because your names are better. What are you calling 9232?
Okay, I think I said Failure to Revive Older Generations. It's kind of like DNR, Do Not Resuscitate.
That's dark. Okay.
Well, I mean-
The formal title-
Yeah. Read the formal title.
Coordinator New Member Heartbeat Completion Does Not Work for Join Group V3. Is this consumer group coordinator?
Yeah? Okay, so Coordinator New Member Heartbeat Completion Does Not Work for Join Group V3, KAFKA-9232, and you said something about previous... Say your title again?
My title was Cannot Resuscitate Older Groups because if you can't-
I love it. Oh sure.
Right, you can't-
Complete a heartbeat, you're dead.
Right. Right, right. You're going to die, and that's very scary. Okay, so walk us through this and give us whatever little bit of background we need on the consumer group coordination protocol to make this make sense.
Right. One of the things that I think people know is like when you look at the protocol version that's used between a consumer and a broker, it's your lowest common denominator. You could be using like the latest clients, but if your brokers are an older version, you're going to downgrade to whatever version the coordinator that's sitting on that node supports. It's always important for people when you look at these things to go, "Oh, well, I'm using the latest clients, so this doesn't apply to me." Well, depending on how old your broker version is, yeah, it might. That's always important and you can actually see that in the broker logs. It'll tell you what the effective version is you're using, so that's just a pro tip and it's super fun to see that. Like, "Oh, all right. I guess that's the agreed-upon version we're using." I do wish-
Okay, where the default is to think, "I've got the client library and my consumers are up to date, therefore, I'll have all the latest behavior."
That's... Okay. That is a mistake.
Yeah, and we... Indeed. We should link Gwen's talk on consumer groups from Strange Loop. That is a money talk. It is fantastic. If you want to know more about the Join Protocol and how consumer groups work and how that happens, definitely watch Gwen's Strange Loop talk.
It is actually... can confirm I use that as source material for, a little brief diversion here, the last conference talk I gave before the pandemic. I won't say the name of the conference, but it turned out like a couple of weeks later there were a couple of COVID-positive people there, it was in the UK, and they sent out an email, this big thing. They had like bouncers at the front door with hand sanitizer stations. People were just kind of grappling with the reality that something was happening, but not shutting the whole world down.
The last conference talk I gave before the pandemic has shut down in-person conferences, I used that Strange Loop talk of Gwen's as research material so I could better understand the consumer group protocol. What I did in that talk is I had people get on stage and physically act out, among other things, consumer group rebalance, and-
I had to do a few other...
Is that recorded?
You know... It is, and maybe I'll put it in the show notes, and I'm going to agree with you, Anna. I'm going to accept those words of affirmation. It actually was awesome. Mixed reviews, it was very polarizing. Some people were like, "Oh, this is great," and some were like, "Oh, you suck. Why don't you just have slides?" So-
I'll do it again.
That's the visual learner divide, though. You reached people that are hard to reach, Tim. I would be like, "You know what? I did it and I'm not sorry."
I'm not even sorry and I'm sorry if you didn't have a good time if you were there. There was a non-zero chance for somebody who was in the room there who is listening to this podcast and maybe you're one of the people who was a little salty walking out. I'm sorry that that happened to you, but I'm probably going to do this again someday. Anyway, yes to Gwen's talk, that goes in the show notes. Maybe my talk goes in the show notes.
This is not [crosstalk 00:36:21]-
I did, you were there when I did that talk when we were in Sweden.
Yeah. Oh, wait a second-
Wait a second. Am I-
No, no. Okay, Sweden was a month before this one that I'm thinking of-
Right. I was getting a little bit like [crosstalk 00:36:32]-
I was absolutely there.
Yeah, people were talking about it. I remember that. I was actually... The last place I went before COVID, I was in Jersey City like the night they enacted curfew and I felt like one of those people you see playing in the ocean during a hurricane. I was like, "Am I that-
"Person? Am I that idiot that you see on The Weather Channel? Like, 'Whoo, yeah.'" I felt
You know, like-
And I'm pretty risk tolerant, so it took me a while to-
Me, too. I was like, "Well, I guess everything's okay. Offices are open, so it should be fine." Then, they were like, "Yeah. No, we're [crosstalk 00:37:03]-
Should be fine.
"Enacting a curfew." I was like, "Well, something is bad. I should probably get on a train."
And here we are months later, but-
I know, and Halloween
I'm glad I could be a part of your last pre-pandemic conference talk. That was an exciting talk for any number of reasons, and anyway, we're at Halloween 2020. Go on. I'm going-
To put some notes in the show notes to make sure these two talks get linked there, but tell us about how our heart might stop beating because somebody is old and somebody else is young.
Okay, so what I've learned from researching this JIRA is shut the front door, joining groups is incredibly complicated.
It is incredibly complicated and I love this JIRA because it involves my absolute favorite thing in Kafka, Purgatory,
I love Purgatory. It's great. Best name ever because it's where things go to wait, and so the question really becomes during a join, when someones joining a group, when all of this stuff is going on, how do you deal with keep alive? How do you deal with heartbeats? What do you do with them? When do you schedule them? How do you handle them? There have been a lot of fixes in the newer join group protocols that makes this a nonissue. However, as I started out as an application developer or somebody who's running Kafka consumers, you may not be responsible for what version you use, right? It's the lowest-
Because it's diverted [crosstalk 00:38:47]-
By the broker.
By the broker usually. What ends up happening here is the heartbeat is blocked because of the sequence of events that happen until the static timeout of five minutes for older consumers.
No big deal.
Right. Or is it?
Which it is, again, right? Theoretically-
Which it actually really is.
Theoretically, this could cause a problem. Yes, it does. Right. If a member falls out of the group due to expiration, then the heartbeat's going to expire and that's going to trigger another rebalance. That's what we don't like because rebalances take time. They're not free, which is what I always tell people when they say like, "Can we have a consumer group that's like a thousand people?" I'm like, "I don't know. Do you want things to work? Do you really need that many?" There are ways that you can obviously support more members of a consumer group, but you pay a penalty because of this type of thing, rebalance, assignments, all of these fun things.
What ended up happening here, best as I can tell because, again, I was tracing down in the code like the callbacks and how this is done and it is very complex, which is why I think there's a lot of nuances here. What ended up happening is we basically try to force complete a heartbeat, which pulls it out of Purgatory. When a new member joins a group, there's a static timeout of five minutes and we throw a heartbeat in Purgatory. We say, "Okay, you stay there until this awaiting join, all that stuff's done. Then, we're going to force complete that heartbeat and schedule it."
Well, the way that the logic was structured, that heartbeat getting forced complete, which is what you want. As soon as the join is done you want to be like, "All right, heartbeats four is complete and we're ready to go. Now just treat it like a normal member." That wasn't happening. It was just sitting there in Purgatory for five minutes, which is not ideal. What we ended up doing is restructuring that logic and we involved the join, and then we kind of say, "Is this member new?" If it is, then you kind of do one thing, but if it's awaiting join or awaiting sink, you don't remove it. You just keep returning true until it's done with the join, which simplifies the logic for older versions that had perhaps different timeouts and different settings and all of that stuff.
If it's not new and it's not in the middle of a join, you just check the last heartbeat and session timeout. That's how we determine whether or not a consumer instance is alive, so that's really what this is doing. It's saying, "Okay, there's a nuance when you're joining a group in the protocol for determining if you're alive." It was broken and so what we did is we kind of broke it down into... It's much more readable now, at least to me, to follow it, so like-
Do you mean?
Yeah. Well, yeah, just the logic. The code for sure.
We kind of say, "Well, if it's new, okay," then you consider the group coordinator and new member join timeout, which is configurable because there are different use cases for how long you want to let a new member take to join because, again, remember during the rebalance for some protocols, it's a stop the world. If it's in the middle of a join, just wait. Don't remove that new member. Just wait till we know it's either joined or not, and then if it's not, then you just use the session timeout plus the last heartbeat. It kind of simplified it and made it work for those older versions.
I see that, and there's also that pseudocode in the JIRA itself that-
I was just reading while you were talking through that and can confirm and was accurately describing the new June join heartbeat completion logic as of KAFKA-92-
And can we call out that Sophie did this and she just became an Apache Kafka Committer. Go Sophie-
Go Sophie. What? So cool.
Nice job, Sophie.
Yeah. She's amazing.
Let's link to Sophie in the show notes.
I don't know what link to Sophie means, but we're going to find that out.
She's got a couple of dynamite blog posts. Amazing-
So yeah, link to-
Anything Sophie's got-
Maybe Twitter accounts, typically a Twitter account. There's going to be some Sophie links in here because she fixed this JIRA. All right, we're coming up on time. We've got I think one more, right?
Okay. We got KAFKA-9065. Your title is?
I think I said Infinite Ghosts... See, I didn't write that one down. Oh, wait, yes I did, Anna.
You know, you're going to have to send me this afterward because they're going to have to be in the show notes-
This one is
I'm not going to
This one is Ghost Partition Haunts Forever, and specifically technically it's Ghost Log Segment, but that didn't sound as good as partition, so I went
Yeah. You get to log segment, you're like, "Whatever, nerd. Come on. Just get off it." You know? Loading Offsets and Metadata Loops Forever. Kind of frankly already a little, I don't know, somehow Newmanal in its orientation in the non-Anna title, but as usual, yours is better. Tell us about the JIRA and how it got fixed.
Okay, so in Kafka, we assume that we can reach the log end offset by starting at the beginning from the log segment and incrementing the offset of like the next fetched record, which makes sense. It's a durable log. It's durable. It's there. You start right from the beginning and then you increment the offset from the first record you read from the next log segment. That makes sense. That's reasonable. That's [crosstalk 00:44:59] rational.
That's what I would do.
Right, except it's not in some cases. If you read like the pull requests for this, it's a great conversation. Absolutely great conversation, and there's an example in here. What Jason explains in this, he says, "Okay, so if you're rolling a broker and it crashes, it's possible that you can get an empty active segment," and that's what this hits upon. Basically what happens is it goes into an infinite loop. If it hits an empty log segment, there are several legitimate reasons this could happen, right? It never loads consumers offsets. It never loads any of that. It backs up everything for that partition, which is, obviously, very, very bad. What we ended up doing is putting in a break, which is a good idea because you really don't want to have an infinite loop. That's just like never a good idea.
Yeah, but you're not just sitting there, right?
Comes in, you're going to do something.
You're doing things.
Right. Nobody's going to go into that empty log segment and put some stuff in there. You know what I mean? There's no condition, I guess, that can free you from this torment. How's that [crosstalk 00:46:26] for spooky? That's spooky, right? Yeah.
That is pretty spooky.
Right, and so what they ended up doing is that they break from empty segments, so they don't count on the first record in every log segment to contain the next offset. This was fascinating to me because I have seen weird behavior in consumer groups and consumer group offsets and loading, so knowing this is kind of... I love this because if a customer hits... By the way, again, this fixed version is like 241, so there are people out there that are still running older versions.
It's just so nice to be able to go, "Hey, can you do me a favor and check to make sure that you have no zero-byte log segments?" If you're in an earlier version. That's really why this is kind of my favorite thing in the world to do is to deep dive on these type of JIRAs, and you can understand more about like how these are loaded and how things are work in Kafka internals, which I think just makes... It helps me at my job, but it also I adore Kafka, so the more I know about it, just the happier I am.
Exactly, and that's one of the reasons why honestly doing spooky ones for Halloween is good. It should probably be like a quarterly feature of just digging through some strange recent JIRAs because especially things like this that are interesting edge cases like the connect rebalance one or the consumer rebalance one. These are things that if you're kind of an intermediate Kafka user, you're definitely aware that those things exist, that connect rebalance, consumer groups rebalance. Do you know how those things work? You know, most of us don't unless you've had to go through the trenches on one of these problems. You probably don't know the details, and so digging into these things forces you to learn those details.
You have to for your job. You're sort of a world-class expert on these things, but everybody whose job is not that who's just building applications on Kafka, it's still I think intellectually salutary to go through these things and understand them the best you can because you get really good internals knowledge. Ideally, somebody else is dealing with operational concerns, you're running in the cloud and all of that kind of stuff, but all of these things... I'm just looking back through these. All or almost all of them have a client component, so you still got to know stuff and I'm glad we have Anna to teach us stuff.
Yeah. I mean, I would agree 110% of that and I would also say that the thing I don't like, and when I get busy and, obviously like everybody else, it's quarantine. You're working from home. I think me personally, I think that tends to have people work more than less. I've been a remote worker for years and years and years and years and years and years and years and years and years and years.
Yeah, you and me both.
If you're not disciplined about it, you can tend to try to work more, and I think finding time to get the satisfaction from truly understanding a process, it's just like you can't compare it to anything else. When I first truly understood Purgatory, for example, I was just so happy because all of a sudden, I could draw all of these connections and that's why this is my favorite show to do. I was so happy last night when I discovered that the maximum epoch for a producer state was 32,766. I was like, "What? That's wonderful." It's really-
It’s Halloween again, which means Anna McDonald (Staff Technical Account Manager, Confluent) is back for another spooktacular episode of Streaming Audio.
In this episode, Anna shares six of the most spine-chilling, hair-raising Apache Kafka® JIRAs from the past year. Her job is to help hunt down problems like these and dig up skeletons like:
If JIRAs are undead monsters, Anna is practically a zombie slayer. Get a haunting taste of the horrors that she's battled with as she shares about each of these Kafka updates. Keep calm and scream on in today’s special episode of Streaming Audio!
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us