Get Started Free
October 28, 2020 | Episode 126

Most Terrifying Apache Kafka JIRAs of 2020 ft. Anna McDonald

  • Transcript
  • Notes

Tim Berglund:

Third time, the triumph of hope over experience on your part I suppose, but I'm glad that you're willing to return. Anna, last year's Halloween episode was your idea. It was like your first day on the job or something like that when it aired. It was all pretty cool and you had come up with some scary JIRAs to frighten us into having better knowledge of how Kafka works I think was the plan, and it seems you've done that again.

Anna McDonald:

Yes, and this year, since I've spent a year working with real customers who are using Kafka, it's been the scary JIRAs, many of them are things customers have actually hit, so they're terrifying in practice.

Tim Berglund:

It is actually your job to help people who are having trouble with something get past it, so you know literally where the bodies are buried. You know where the skeletons are, you know where the zombies are, the other sorts of other undead creatures, werewolves, any of these things, you encounter them-

Anna McDonald:

Correct.

Tim Berglund:

You fight them. You are kind of like our own Van Helsing, I think [crosstalk 00:02:14]-

Anna McDonald:

That's true [crosstalk 00:02:15]-

Tim Berglund:

To say.

Anna McDonald:

True story, I have Von Willebrand's disease, which is like a bleeding disorder, like a really light case, and for years-

Tim Berglund:

Literally didn't know that.

Anna McDonald:

I thought... It was also the vampire hunter. I don't know why. When I was much younger-

Tim Berglund:

You thought the name of the disorder was the same as Van Helsing?

Anna McDonald:

I thought it was after the vampire hunter because it had to do with blood.

Tim Berglund:

Yeah. Wouldn't you kind of [crosstalk 00:02:37]-

Anna McDonald:

Years-

Tim Berglund:

Prefer that? Blood, right? Wouldn't you sort of prefer-

Anna McDonald:

Yeah.

Tim Berglund:

That they be the same?

Anna McDonald:

I would because then it sounds much cooler than being like I have problems making the Von Willebrand Factor, which is kind of dull.

Tim Berglund:

That, like nobody cares but, "I have Van Helsing's disorder," like, "Oh, well, you might have like some vaguely Steampunk-y crossbow that you're going to kill me with if-

Tim Berglund:

I don't watch myself.

Anna McDonald:

Exactly. Yes, it's all about the Steampunk.

Tim Berglund:

I know I'm kind of like mixing literary traditions there, but I don't know, one imagines Van Helsing having some sort of weaponry that is period-

Anna McDonald:

Like a crossbow.

Tim Berglund:

Kind of high-tech in some surprising way, which actually is kind of a lot like you now that I'm really thinking about it. Anyway, before we get into the JIRAs, this is the second time you've done the very special Halloween episode of Streaming Audio. Is Halloween a big deal at the McDonald household?

Anna McDonald:

Well, yes, for the kids. I have a standard costume I wear. I'm a burglar every year because you can wear like a black sweater and a knit hat, so I feel like [crosstalk 00:03:43]-

Tim Berglund:

Good.

Anna McDonald:

That scales. It's great, but yes, my kids tend to pick costumes that you can't buy in a store, so we do spend quite a lot of time on Halloween preparing.

Tim Berglund:

Excellent. I'm kind of wondering how trick-or-treating is going to be this year with the pandemic. You figure this is a relatively easy day on which to cover your face, like your face might be covered anyway and probably see a lot of surgeons, sort of the convenient one for wearing a mask, and I'll figure out some sort of mask I can [crosstalk 00:04:14] wear, but I'm not sure what.

Anna McDonald:

Well, I'm going to be honest, I think the mask compliance in some areas will be higher on Halloween than during a normal day.

Tim Berglund:

I think you're right, I think you're right, and I've got like a pretty good Mal Reynolds costume. I've got a pretty decent Jack Burton, which is not seasonally appropriate. It's typically cold on Halloween in Denver, and I'm thinking of adding an Obi-Wan-

Anna McDonald:

Nice.

Tim Berglund:

Sort of Attack of the Clones/Revenge of the Sith-era Obi-Wan. I've got the beard going on now, so I can kind of do that, but anyway, some Halloween thoughts.

Anna McDonald:

I'm excited.

Tim Berglund:

Me, too. Me, too. It is your job at Confluent to help actual customers who are having trouble with something not have those problems. You guys on your team, you have all of these cool acronyms and, Roger, public apology, I just can't remember them. I need to make a cheat sheet or something. I don't do well with this. I still think of you as a TAM, or technical account manager, Anna, but your actual title is-

Anna McDonald:

A CSTA, and I like to say we're CSTA because we help our customers be able to take naps.

Tim Berglund:

That might help me remember, it might not, it might not to be fair, but CSTA, C-S-T-A, which is Customer Solutions Technical Architect?

Anna McDonald:

That is awesome. Yeah, I don't think I get that right as much as you have.

Tim Berglund:

Yeah, yeah. Imagine like if we were habitual cohosts and I 50% of the time I had introduce you as Customer Solutions Technical Architect. I wonder how often I'd get it right? Probably not much, but anyway. You want to start scaring us?

Anna McDonald:

I do. Okay, so I've picked out some really good ones this year, and the one I'd like to start with is KAFKA-9144. I'm subtitling this Early Death Causes Epoch Time Travel.

Tim Berglund:

Ooh.

Anna McDonald:

That's my subtitle. The actual title is Early Exploration of Producer State Can Cause Coordinator Epoch to Regress.

Tim Berglund:

Which is nothing if not prosaic. I think I like your version better.

Anna McDonald:

Yeah, right? This is very interesting, so we have seen a lot of people adopt the upper-level Kafka APIs and just kind of more complex use cases with Kafka. Transactions help enable those use cases, so I'm kind of a little bit partial to anything that kind of bothers transactions. I think they're fascinating. True story, one of the only times that we diverge from the truth of the log in Kafka is in Producer State, we take snapshots and we do that because in Kafka streams, you expire things from the log very quickly in like a change log when you're doing a repartition.

Anna McDonald:

Lo and behold, we were very surprised to find out how quickly producers were expired, so instead, they take a snapshot. That snapshot at any point in time doesn't have to match what's in the log, so if anyone ever asks you like, the truth is the log all the time in Kafka. You can be like not producer snapshots, boom.

Tim Berglund:

Okay, and when producer snapshots taking? Just to make sure everybody's got the background, when, and for what purpose?

Anna McDonald:

They're taking at increments, and really, it's because sometimes we like overeagerly cleanse the log. When that happens, you end up getting these warnings, and anybody who's used EOS and Kafka Streams will be aware of this. You'll see like these annoying warnings in the log that says like, "Unknown transaction ID," and it's a retryable error, but it's very annoying and it's a lot of overhead. They were like, "You know what? We're just going to snapshot it because we know it's still safe to know that this transactional producer is known to us, it's just because of how quickly we get rid of stuff in the log when we don't need it anymore for certain operations. It's just a good idea to keep it and it's taken at different intervals.

Tim Berglund:

Gotcha. What happens in this terrifying JIRA?

Anna McDonald:

Okay, so the bottom line is producer state expiration. When does that happen? What we ended up doing was we said, "Okay, we're going to expire producers like based on the last timestamp, which for all intents and purposes that seems pretty reasonable. You're going to expire a producer based on, "Well, it hasn't produced anything in this long." The issue comes in when there's a failed transaction or somebody goes rogue, so like you have a zombie coordinator because transactions, just like consumer groups, have coordinators. If a coordinator goes rogue, it's a zombie, or a producer goes rogue, it's a zombie. Then you start to introduce these very interesting edge cases, and this is one of them.

Anna McDonald:

What ends up happening is the timestamp that's used to say, "Hey, should we expire producer state?" It gets set to -1 because there's nothing in the log for the last message because this transaction perhaps got aborted. The one that it goes over first in this JIRA is a producer writes transactional data and it fails. The coordinator sees that and is like, "Well, it failed, so it's not a committed transaction," and it writes these abort markers. Then, it bumps the epoch, and the epoch basically, and this is fascinating, too, I looked this up and I actually tweeted about it, too, because I just want to know if anyone's ever hit this. The epoch is just a monotonically-increasing number and we do that in order to fence zombie coordinators and zombie producers.

Anna McDonald:

The maximum value for that is short max value minus one, so you can only have 32,766 epochs for a given producer ID, and I'm just fascinated [crosstalk 00:10:04]-

Anna McDonald:

Yeah, like has anyone [crosstalk 00:10:07]-

Tim Berglund:

Does it get put in a header or message or something somewhere? I mean, it's-

Anna McDonald:

No, it's actually-

Tim Berglund:

When you're producing?

Anna McDonald:

Yeah, it's actually stored right in the producer's state locally, so it knows and it does send it. If you look, there is a really nice illustration of this issue in the JIRA by none other than Jason Gustafson, and so Kafka has these guarantees when it looks at epochs, and that's why I'll get into this, this bug caused some really bad things. For instance, one of the causes was a node that would not start up, so this is like a thing that impacted brokers, which is never a good thing, right?

Tim Berglund:

Yeah.

Anna McDonald:

That epoch is kind of assigned and kept and that's part of producer state, so as long as the producer state is constant, the epoch is kept. It's kind of assigned in the group metadata and assigned in the coordinator and it knows what it is.

Tim Berglund:

Okay, and that epoch doesn't advance very frequently, then?

Anna McDonald:

Well, yes, it can-

Tim Berglund:

Oh, does it?

Anna McDonald:

Yes, because-

Tim Berglund:

Okay.

Anna McDonald:

Basically, what happens is, and this is what we're talking about, if a producer dies in the middle of a commit, if something's gone rogue, we always wonder, "Is it a zombie? If I can't connect to this coordinator, I can't connect to this producer, it could be a zombie, in which case it doesn't know that it's dead. It's alive but not really-

Tim Berglund:

Right, right.

Anna McDonald:

So we bumped the epoch to make sure that it can't cause any funny business. We're kind of fencing it. Anytime we see that behavior, we bump the epoch, and when you bump the epoch, that means the old zombie has the earlier epoch and can't do any harm because we check that before we [crosstalk 00:11:43] commit a log, right? Yeah.

Tim Berglund:

Detect it as the activity of an undead creature?

Anna McDonald:

Exactly, totally, and that guarantee holds while that producer state is not expired. When do we expire producer state? Well, we look at the timestamp. All of this sounds really good and awesome and fencing is great and we love it. However, the problem is what we used do to is we used to use the timestamp from a message, the last message. Well, if there isn't the last message, it was setting the timestamp for this producer to -1, which means, "Oh, hello-

Tim Berglund : Okay.

Anna McDonald:

"I'm the producer state cleanup. It's -1. I'm going to clean it up." Remember, now what we have is we have a situation where something has gone wrong. The coordinator has bumped... This is twofold. In the beginning, they thought this was like an... Basically, I think it was Jason who said this, it takes an alignment of planets to hit this bug, but later on we found an even more common case for it. This one is the coordinator, so a producer fails. Then, the coordinator will see that and go, "Oh, I'm timing out this transaction because it's failed, obviously, and I'm going to write an abort."

Anna McDonald:

It bumps the epoch to make sure that that product is kind of quarantined, but then the coordinator itself becomes a zombie, so the coordinator's going to keep trying to send. The old coordinator, it's going to go, "Well, I don't know if it failed or not. I don't know if it failed or not." There's a new coordinator elected. Now, when a new coordinator is elected, if the producer state is still there, it should know what the epoch is. However, because it was -1, it says, "Oh, I don't have a new producer state," and resets the epoch. All of a sudden, the protection that you have from that monotonically-increasing number is gone.

Tim Berglund:

Right.

Anna McDonald:

What ends up happening is that abort, the new coordinator will actually bump that epoch and write a new abort message saying, "No." That's going to have the later epoch. I'll read an example with numbers in a one second so it will make a little bit more sense, but that zombie's still out there. The epoch has been reset to zero, so there's no protection anymore and all of a sudden, what you see is you see in the log, "Out of order epochs." That, before this fix went in, this change, is the death of everything, so-

Tim Berglund:

Yeah.

Anna McDonald:

It'll cause a broker not to start. It's not a consistent state. It will be like, "Whoa, whoa, whoa. No. Absolutely not." It was crashing brokers. The way that this shows up a little more commonly, it can result in a hanging transaction, and this is a great JIRA to read because the example here shows. You've got a producer and it opens a transaction, and let's say the epoch is 18. Then, it loses communication with the cluster. The coordinator sees this and it's like, "Oh, I'm aborting this transaction, and I'm going to bump the epoch because this producer has obviously gone out to lunch."

Anna McDonald:

When it does that, the producer state, that timestamp gets set to -1, producer state is cleaned up, which means all of a sudden the epoch goes back to like zero. Then, that producer zombie, who's at 18, is like, "I'm back from the dead and there's no protection." It can accept the write and what you end up seeing in the logs, which you should never, ever see in Kafka producer ID is you see a producer epoch of 17, one of 19, and then 18 when the zombie backs in. That [crosstalk 00:15:54] cause-

Tim Berglund:

There's nothing you can do.

Anna McDonald:

There's nothing you can do at that point. That's really bad, and so it shows up when you have zombie coordinators and when you have like a zombie producer for transactions. It's just not good at all. It all comes down to the fact that the timestamp that we're using was based on a message, and if the message isn't persistent, you've got no timestamp. Can you guess what the fix is? It's-

Tim Berglund:

No, I can't.

Anna McDonald:

Use that timestamp. The timestamp, what the fix is, we use timestamps from the transactional markers as well because, remember, in both cases, we're bumping the epoch and we're committing an abort. When you use that transaction, it prevents early expiration of producer state, which continues that monotonically-increasing guarantee. Is it nice to have? Jason Gustafson also changed validation logic, so like coordinator epochs are only verified for new requests. What that means is people could patch and bring up those nodes. We used to fail out for that, but now we only log a warning, so it was a way that this patch both fixed the issue and allowed recovery for people who had hit it.

Tim Berglund:

All right. That is scary. It's only our first year, and by the way, we have five, folks. There are five of these that we're talking about today, and I'm already a little off, I think, but let's go to... Let's see, the next one, KAFKA-9184, famously.

Anna McDonald:

Attack of the Clones. That's what I'm calling this. I would so excited when I thought of that, and this one hit so many people, and it was-

Tim Berglund:

It's kind of similar, isn't it?

Anna McDonald:

Yeah. Well, no, this one is not. This one is bad in different ways, but just as scary.

Tim Berglund:

Okay.

Anna McDonald:

The thing with Connect is historical there have been stop-the-world rebalancing for unrelated items as I like to say, so like if you-

Tim Berglund:

Like I want to reconfigure a connector and-

Anna McDonald:

Correct.

Tim Berglund:

"Hey, everybody

Anna McDonald:

Or deploy a new connector that's not [crosstalk 00:18:06]-

Tim Berglund:

That's not like that [crosstalk 00:18:06]-

Anna McDonald:

Related, like, "Why do we have to-

Tim Berglund:

"Oh, goodness", yeah-

Anna McDonald:

"Rebalance everything?"

Tim Berglund:

"What were you thinking?"

Anna McDonald:

Correct.

Anna McDonald:

Oh, you know what? People improve. As we go along, we think better. When you know better, you do better, right? Great idea. Let's do Incremental Cooperative Rebalancing. Fantastic. That means that these tasks and workers can keep their assignments, so you only rebalance what you need to when you reassign. You don't have to pay the penalty for computing assignments again. Just a great idea, fantastic. People were excited. The world was on fire. It was awesome.

Tim Berglund:

I remember it, yeah. No, I was-

Anna McDonald:

Yeah.

Tim Berglund:

Setting some of those fires.

Anna McDonald:

Right. We were excited, and unfortunately, there was a problem, two problems, actually, which was fantastic because I didn't even know about the second one. The first problem and this came up in Replicator I saw this first. I got a call. They said, "People were seeing duplicates," and I was like, "Okay, duplicate what?" Then, the first question I asked, as I always ask, is, "Is it for everything? Or is it only for certain partitions?" It was only

Tim Berglund:

Okay.

Anna McDonald:

For certain partitions. I went... Then, we did some more research on it internally and discovered that we had multiple tasks that were assigned the same partitions. Not a good thing. When you're doing like a-

Tim Berglund:

Not at good thing at all.

Anna McDonald:

Yeah. No, when you're doing a sink connector and you're pulling data down and you're putting it somewhere, you should not be running more. Tasks should be assigned a partition. We're not replicating the partition 0 eight times. That's generally bad.

Tim Berglund:

Right, right. You will, for sure, duplicate data.

Anna McDonald:

Right, of course, and so what happened is when the JVM didn't restart, a zombie worker was able to keep it tasks running. When they rejoined the group, because of the new Incremental Cooperative Rebalancing, they still had their assignment. All of a sudden now, the coordinator just saw a new member that already had assignments and was like, "Go for it, bud. You're still running." It knew when that fell off, "I had to reassign these partitions," so all of a sudden now, you've got two tasks running with the same partitions, which is not good, obviously.

Tim Berglund:

Got it. No, not at all. Okay, so I'm tracking... By the way, we didn't give the full title of the JIRA. You said it was Attack of the Clones. Title of the JIRA is Redundant Task Creation and Periodic Rebalances After Zombie Worker Rejoins the Group, and that's by Apply the Connect Group.

Anna McDonald : Brains. I just want to say brains because it's a zombie.

Tim Berglund:

Brains, yeah, absolutely. Brains.

Anna McDonald:

Exactly.

Tim Berglund:

Brains. Okay, go on.

Anna McDonald:

Right, and so the fix was that each individual worker tracks its connectivity now in an unblocking manner. It basically tells the worker and the worker becomes sentient and it knows now that the coordinator's unreachable. Prior to this, the workers were kind of like, "I got my assignments. My JVM didn't restart. I'm cool." Now, it knows enough that it needs to actually communicate, which it does when you're using Cooperative Rebalancing. You got to cooperate. If the connection is inactive with the coordinator, it will proactively... the worker now owns the responsibility of stopping its connectors and tasks and not just hanging onto those assignments.

Anna McDonald:

Here's the other part that surprised me because I did not know this. When this happened, it's like a double whammy. When this happened, the assigner that kind of says... basically what it does is it says, "Okay, here's the delay. Then, we're going to try to rebalance again." That delay was also not getting set to zero, so all of a sudden you also had periodic rebalances for no reason, which is obviously not the point of introducing Incremental Cooperative Rebalancing.

Tim Berglund:

Right, right. That's going to be much worse.

Anna McDonald:

Yes. This is terrifying. I love this one. That was also fixed, obviously.

Tim Berglund:

Of course. Okay. Are you ready for the next one?

Anna McDonald:

I am ready for the next one.

Tim Berglund:

Okay. This is a ZooKeeper JIRA.

Anna McDonald:

I know.

Tim Berglund:

ZOOKEEPER-3056.

Anna McDonald:

I threw-

Tim Berglund:

What do you call it?

Anna McDonald:

I threw a curveball. I call this Why Can't I Upgrade, I'm in Hades, but yeah.

Tim Berglund:

Less evocatively known as Fails to Load Database With Missing Snapshot File But Valid Transaction Log File.

Anna McDonald:

Yeah, right. One of the things-

Tim Berglund:

I like your title better.

Anna McDonald:

Right, yeah, and this happened to a couple of people and there may be people listening to this podcast who are like, "How could you have a ZooKeeper without a snapshot file?" Well, believe me, you can. I've seen it with several customers, so anytime you don't have a lot of activity on ZooKeeper, there's this thing called snapCount. That snapCount says, "This many transactions have to occur before I take a snapshot." As one might imagine, pretty self-explanatory, and if you don't reach-

Tim Berglund:

Right, right.

Anna McDonald:

That snapCount, it doesn't take a snapshot. ZooKeeper, in its infinite wisdom, when you need to upgrade it, it decided to require a snapshot, so when people try to upgrade without a snapshot, it failed. My favorite thing about this JIRA is there is a Noel snapshot file attached to it, and it's like, "Just download this Noel snapshot file to your [crosstalk 00:23:52]-

Tim Berglund:

Snapshot.0-

Tim Berglund:

It's 400 bytes.

Anna McDonald:

Yeah. Sure. Just download it. I'm sure it's cool. It's legit.

Tim Berglund:

It's fine.

Anna McDonald:

Right, and so it's just an empty snapshot file, and then they say, "Snapshot trust MD equal true." This works apparently for some people, but it doesn't work for a lot of people, and I've seen at least one case where someone did this and for whatever reason, it did not rebuild off their valid transactions and they lost data. What did they lose? Metadata, things like topics. What's your topic names? What are your partitions? Right? All of the stuff that's durably stored in ZooKeeper still, which it's not good, and so-

Tim Berglund:

No.

Anna McDonald:

Working through this, and there isn't like a fix for this, but there's a workaround that I recommend. I'm just going to tell people because I think people, for example, who want to use CLS, they want to upgrade ZooKeeper because that was a long time coming. What I recommend people do is, believe it or not, snapCount is configurable, so I recommend you go in, you lower your snapCount to something that's reasonably low, and then force it to take a snapshot before you upgrade.

Tim Berglund:

What's reasonably low?

Anna McDonald:

I don't know, like 50. It depends on what your flow is, right? Depends on how much data you got-

Tim Berglund:

Sure.

Anna McDonald:

Running at topic. Again, this is really kind of a case that hits people in lower environments where there's not a lot of... which is where you should be upgrading first, thank you, please and thank you. It's 50, 25, and then you can even force the creation of it by running a script, doing configuration changes, prefix dummy topics and then delete. Whatever you need to do to get that amount of activity going to ZooKeeper.

Tim Berglund:

Got it. Okay. The other fix to this is removing ZooKeeper altogether covered under [crosstalk 00:25:40]-

Anna McDonald:

It's so exciting.

Tim Berglund:

Not a JIRA but a KIP-500 and, boy, I bet y'all CSTAs are going to be pretty happy about that.

Anna McDonald:

Yes. Yes we are, and the thing I'm most excited about is the new testing framework, the DES Testing Framework, where we'll be able to actually do like deterministic event... testing network partitions. Right now, it's only implemented in the new like Raft algorithm, but they're going to put it in for like coordinator and put it in for all of the nodes and all of these kinds of things, so we'll be able to test all of these situations where we want to make sure that in the event of a network partition between this and this, we can make sure that the right leaders are elected, that ISR looks good.|

Anna McDonald:

I'm very, very excited about that, and so I pinged Jason I think earlier this week and I was like, "So, hey, it's really awesome and cool. When can we have it in the controller? Can we have it in the controller and all of the coordinator stuff? Like soon, hopefully?"

Tim Berglund:

Yeah. Kind of just for [crosstalk 00:26:43]-

Anna McDonald:

I mean, I don't think it's [crosstalk 00:26:44] going to be soon [crosstalk 00:26:44] but that is going to be so... it's going to be so much better in terms of testing coverage and being able to actually test deterministically specific scenarios and Kafka that we know we need to validate. It makes me feel much more confident about removing ZooKeeper, much more confident, so I'm actually really, really excited.

Tim Berglund:

Good, good. All right. I wouldn't have guessed you wouldn't be. Now, continuing the horror, we have KAFKA-9729. What is your name for this?

Anna McDonald:

So 9729, I actually went almost true to the title. I said Shrinking In Right Lack to Avoid Naming Cluster Performance.

Tim Berglund:

Okay, I want you to name JIRAs basically from now till the end of time. Ladies and gentlemen, the actual title of KAFKA, pardon me, of JIRA KAFKA-9729 is Shrink In Right Lock Time in Simple Authorizer-

Tim Berglund:

That's less awesome. Anyway, Anna, tell us about why this is terrifying.

Anna McDonald:

Well, one of my favorite things is like I was a Principal Software Developer for like 16 years, and so I used to think a lot like, "Hypothetically, this could cause a problem." Well, now, I'm in the field and when I read something in a JIRA, it makes me giggle. Any kind of words, and my favorite one in here is "Blocking produce and fetch request for two seconds would cause apparent performance degradation for the whole cluster." When I read it, it makes me giggle because I'm like, "You think?"

Tim Berglund:

It's like, "Well, yeah, it might."

Anna McDonald:

Yes. We have people who have an SOA of 400 milliseconds, like-

Tim Berglund:

Yeah

Anna McDonald:

You know, not to mention

Tim Berglund:

Two seconds could be a problem.

Anna McDonald:

Right. I know, but I love it because it's like you can see like... I don't know who actually authored this, but I love them. Thank you for writing this up in this fashion because it just made me smile. It made me remember when I used to like postulate like, "That could probably be an issue." It's just fun to see the other side as I have like over this past year, to see like, "Oh, no, it is an issue." Oh yes it is. That was fun. I enjoyed... That's my favorite line from this entire year.

Anna McDonald:

This is interesting. ACLs are one of the last remnants of things we actually have to have direct connections from the nodes to ZooKeeper for. They don't go through the controller at all. What ended up happening is the way that Authorizer is written to remove and add ACLs, you need a write lock, which makes sense. There's absolutely no doubt that seems like a good idea. However, when you do get ACLs, which you need to do and produce and fetch requests to say, "Hey, can I read this?" We were using a read lock. Those read locks cannot be obtained when the write lock is in effect.

Tim Berglund:

Okay. We have a word for this. I mean, that's not going to be performance degradation. That sounds like a dead lock. Am I hearing you correctly?

Anna McDonald:

Well, yeah. It eventually clears, but yeah, it's blocking. I think it would be fair to call it blocking. I don't know if I would-

Tim Berglund:

Okay.

Anna McDonald:

Necessarily say it was like totally murdered and deceased.

Tim Berglund:

Okay.

Anna McDonald:

It's definitely blocking, and so what ends up happening is all of those produce requests that need to make sure that they can produce to that topic, they just get backed up, right?

Tim Berglund:

Right.

Anna McDonald:

Behind this write lock, and so this really comes into effect because ZooKeeper is very sensitive to latency. They did a simulation and if there's a 500-millisecond delay, which by the way, is a hundred percent realistic in something like a stretched cluster or something where you've got network latency. That's just a law, the network is never perfect, ever. For 500 milliseconds delay on the ZooKeeper side in write lock would add from two to 2.5 seconds.

Anna McDonald:

Right, and that's where the apparent performance degradation for the whole cluster [inaudible 00:30:57].

Tim Berglund:

Apparent, yeah.

Anna McDonald:

It's apparent-

Tim Berglund:

A little bit.

Anna McDonald:

I mean, it could hypothetically, or for real, like that yes, yes, again.

Tim Berglund:

Yes.

Anna McDonald:

What they did, very sensible, is they just created an immutable structure and used a reference to that to obtain a consistent view. Then, you don't have to do a read lock. You don't have to a lock at all, and by the way, that's only in 2.6, so most people probably aren't on 2.6 yet, so-

Tim Berglund:

Right.

Anna McDonald:

This is as current. This is hip. This is a hip JIRA that's scary.

Tim Berglund:

Extremely, yes.

Anna McDonald:

Right, so something to check, something to... I was surprised when I found this one. It was like the shrink that got me. I'm like, "Shrink is a very interesting word to use in a JIRA. Let me investigate this."

Tim Berglund:

Yes.

Anna McDonald:

You think of like shrunken heads, right? That's kind of Halloween-y. I've seen decorations with those.

Tim Berglund:

Exactly. No, it's incredibly Halloween-y. I think perhaps nothing more Halloween-y. Okay. Are you ready for the next one?

Anna McDonald:

I am ready for the next one.

Tim Berglund:

Okay. This is good old KAFKA-9232, and I like your names first because your names are better. What are you calling 9232?

Anna McDonald:

Okay, I think I said Failure to Revive Older Generations. It's kind of like DNR, Do Not Resuscitate.

Anna McDonald:

Yeah.

Tim Berglund:

That's dark. Okay.

Anna McDonald:

Well, I mean-

Tim Berglund:

The formal title-

Anna McDonald:

Yeah. Read the formal title.

Tim Berglund:

Coordinator New Member Heartbeat Completion Does Not Work for Join Group V3. Is this consumer group coordinator?

Anna McDonald:

Yes.

Tim Berglund:

Yeah? Okay, so Coordinator New Member Heartbeat Completion Does Not Work for Join Group V3, KAFKA-9232, and you said something about previous... Say your title again?

Anna McDonald:

My title was Cannot Resuscitate Older Groups because if you can't-

Tim Berglund:

I love it. Oh sure.

Anna McDonald:

Right, you can't-

Anna McDonald:

Complete a heartbeat, you're dead.

Tim Berglund:

Right. Right, right. You're going to die, and that's very scary. Okay, so walk us through this and give us whatever little bit of background we need on the consumer group coordination protocol to make this make sense.

Anna McDonald:

Right. One of the things that I think people know is like when you look at the protocol version that's used between a consumer and a broker, it's your lowest common denominator. You could be using like the latest clients, but if your brokers are an older version, you're going to downgrade to whatever version the coordinator that's sitting on that node supports. It's always important for people when you look at these things to go, "Oh, well, I'm using the latest clients, so this doesn't apply to me." Well, depending on how old your broker version is, yeah, it might. That's always important and you can actually see that in the broker logs. It'll tell you what the effective version is you're using, so that's just a pro tip and it's super fun to see that. Like, "Oh, all right. I guess that's the agreed-upon version we're using." I do wish-

Tim Berglund:

Okay, where the default is to think, "I've got the client library and my consumers are up to date, therefore, I'll have all the latest behavior."

Anna McDonald:

Correct, right.

Tim Berglund:

That's... Okay. That is a mistake.

Anna McDonald:

Yeah, and we... Indeed. We should link Gwen's talk on consumer groups from Strange Loop. That is a money talk. It is fantastic. If you want to know more about the Join Protocol and how consumer groups work and how that happens, definitely watch Gwen's Strange Loop talk.

Tim Berglund:

It is actually... can confirm I use that as source material for, a little brief diversion here, the last conference talk I gave before the pandemic. I won't say the name of the conference, but it turned out like a couple of weeks later there were a couple of COVID-positive people there, it was in the UK, and they sent out an email, this big thing. They had like bouncers at the front door with hand sanitizer stations. People were just kind of grappling with the reality that something was happening, but not shutting the whole world down.

Tim Berglund:

The last conference talk I gave before the pandemic has shut down in-person conferences, I used that Strange Loop talk of Gwen's as research material so I could better understand the consumer group protocol. What I did in that talk is I had people get on stage and physically act out, among other things, consumer group rebalance, and-

Tim Berglund:

I had to do a few other...

Anna McDonald:

Is that recorded?

Tim Berglund:

You know... It is, and maybe I'll put it in the show notes, and I'm going to agree with you, Anna. I'm going to accept those words of affirmation. It actually was awesome. Mixed reviews, it was very polarizing. Some people were like, "Oh, this is great," and some were like, "Oh, you suck. Why don't you just have slides?" So-

Tim Berglund:

I'll do it again.

Anna McDonald:

That's the visual learner divide, though. You reached people that are hard to reach, Tim. I would be like, "You know what? I did it and I'm not sorry."

Tim Berglund:}

I'm not even sorry and I'm sorry if you didn't have a good time if you were there. There was a non-zero chance for somebody who was in the room there who is listening to this podcast and maybe you're one of the people who was a little salty walking out. I'm sorry that that happened to you, but I'm probably going to do this again someday. Anyway, yes to Gwen's talk, that goes in the show notes. Maybe my talk goes in the show notes.

Tim Berglund:

This is not [crosstalk 00:36:21]-

Anna McDonald:

I did, you were there when I did that talk when we were in Sweden.

Tim Berglund:

Yeah. Oh, wait a second-

Anna McDonald:

How weird-

Tim Berglund:

Wait a second. Am I-

Anna McDonald:

Yeah.

Tim Berglund:

No, no. Okay, Sweden was a month before this one that I'm thinking of-

Anna McDonald:

Yes-

Anna McDonald:

Right. I was getting a little bit like [crosstalk 00:36:32]-

Tim Berglund:

I was absolutely there.

Anna McDonald:

Yeah, people were talking about it. I remember that. I was actually... The last place I went before COVID, I was in Jersey City like the night they enacted curfew and I felt like one of those people you see playing in the ocean during a hurricane. I was like, "Am I that-

Tim Berglund:

Right.

Anna McDonald:

"Person? Am I that idiot that you see on The Weather Channel? Like, 'Whoo, yeah.'" I felt

Anna McDonald:

You know, like-

Tim Berglund:

And I'm pretty risk tolerant, so it took me a while to-

Anna McDonald:

Me, too. I was like, "Well, I guess everything's okay. Offices are open, so it should be fine." Then, they were like, "Yeah. No, we're [crosstalk 00:37:03]-

Tim Berglund:

Should be fine.

Anna McDonald:

"Enacting a curfew." I was like, "Well, something is bad. I should probably get on a train."

Tim Berglund:

So yeah-

Anna McDonald:

So yeah.

Tim Berglund:

And here we are months later, but-

Anna McDonald:

I know, and Halloween

Tim Berglund:

I'm glad I could be a part of your last pre-pandemic conference talk. That was an exciting talk for any number of reasons, and anyway, we're at Halloween 2020. Go on. I'm going-

Anna McDonald:

Yes.

Tim Berglund:

To put some notes in the show notes to make sure these two talks get linked there, but tell us about how our heart might stop beating because somebody is old and somebody else is young.

Anna McDonald:

Okay, so what I've learned from researching this JIRA is shut the front door, joining groups is incredibly complicated.

Tim Berglund:

Dude, right.

Anna McDonald:

It is incredibly complicated and I love this JIRA because it involves my absolute favorite thing in Kafka, Purgatory,

Tim Berglund:

Yes.

Anna McDonald:

I love Purgatory. It's great. Best name ever because it's where things go to wait, and so the question really becomes during a join, when someones joining a group, when all of this stuff is going on, how do you deal with keep alive? How do you deal with heartbeats? What do you do with them? When do you schedule them? How do you handle them? There have been a lot of fixes in the newer join group protocols that makes this a nonissue. However, as I started out as an application developer or somebody who's running Kafka consumers, you may not be responsible for what version you use, right? It's the lowest-

Tim Berglund:

Because it's diverted [crosstalk 00:38:47]-

Anna McDonald:

Exactly-

Tim Berglund:

By the broker.

Anna McDonald:

By the broker usually. What ends up happening here is the heartbeat is blocked because of the sequence of events that happen until the static timeout of five minutes for older consumers.

Tim Berglund:

No big deal.

Anna McDonald:

Right. Or is it?

Tim Berglund:

Right.

Anna McDonald:

Which it is, again, right? Theoretically-

Tim Berglund:

Which it actually really is.

Anna McDonald:

Theoretically, this could cause a problem. Yes, it does. Right. If a member falls out of the group due to expiration, then the heartbeat's going to expire and that's going to trigger another rebalance. That's what we don't like because rebalances take time. They're not free, which is what I always tell people when they say like, "Can we have a consumer group that's like a thousand people?" I'm like, "I don't know. Do you want things to work? Do you really need that many?" There are ways that you can obviously support more members of a consumer group, but you pay a penalty because of this type of thing, rebalance, assignments, all of these fun things.

Anna McDonald:

What ended up happening here, best as I can tell because, again, I was tracing down in the code like the callbacks and how this is done and it is very complex, which is why I think there's a lot of nuances here. What ended up happening is we basically try to force complete a heartbeat, which pulls it out of Purgatory. When a new member joins a group, there's a static timeout of five minutes and we throw a heartbeat in Purgatory. We say, "Okay, you stay there until this awaiting join, all that stuff's done. Then, we're going to force complete that heartbeat and schedule it."

Anna McDonald:

Well, the way that the logic was structured, that heartbeat getting forced complete, which is what you want. As soon as the join is done you want to be like, "All right, heartbeats four is complete and we're ready to go. Now just treat it like a normal member." That wasn't happening. It was just sitting there in Purgatory for five minutes, which is not ideal. What we ended up doing is restructuring that logic and we involved the join, and then we kind of say, "Is this member new?" If it is, then you kind of do one thing, but if it's awaiting join or awaiting sink, you don't remove it. You just keep returning true until it's done with the join, which simplifies the logic for older versions that had perhaps different timeouts and different settings and all of that stuff.

Anna McDonald:

If it's not new and it's not in the middle of a join, you just check the last heartbeat and session timeout. That's how we determine whether or not a consumer instance is alive, so that's really what this is doing. It's saying, "Okay, there's a nuance when you're joining a group in the protocol for determining if you're alive." It was broken and so what we did is we kind of broke it down into... It's much more readable now, at least to me, to follow it, so like-

Tim Berglund:

Do you mean?

Anna McDonald:

Yeah. Well, yeah, just the logic. The code for sure.

Tim Berglund:

Okay.

Anna McDonald:

We kind of say, "Well, if it's new, okay," then you consider the group coordinator and new member join timeout, which is configurable because there are different use cases for how long you want to let a new member take to join because, again, remember during the rebalance for some protocols, it's a stop the world. If it's in the middle of a join, just wait. Don't remove that new member. Just wait till we know it's either joined or not, and then if it's not, then you just use the session timeout plus the last heartbeat. It kind of simplified it and made it work for those older versions.

Tim Berglund:

I see that, and there's also that pseudocode in the JIRA itself that-

Anna McDonald:

It is.

Tim Berglund:

I was just reading while you were talking through that and can confirm and was accurately describing the new June join heartbeat completion logic as of KAFKA-92-

Anna McDonald:

Ooh-

Tim Berglund:

Anna McDonald:

And can we call out that Sophie did this and she just became an Apache Kafka Committer. Go Sophie-

Tim Berglund:

We can.

Anna McDonald:

Go Sophie. What? So cool.

Tim Berglund:

Nice job, Sophie.

Anna McDonald:

Yeah. She's amazing.

Tim Berglund:

Let's link to Sophie in the show notes.

Tim Berglund:

I don't know what link to Sophie means, but we're going to find that out.

Anna McDonald:

She's got a couple of dynamite blog posts. Amazing-

Anna McDonald:

So yeah, link to-

Anna McDonald:

Anything Sophie's got-

Tim Berglund:

Maybe Twitter accounts, typically a Twitter account. There's going to be some Sophie links in here because she fixed this JIRA. All right, we're coming up on time. We've got I think one more, right?

Tim Berglund:

Okay. We got KAFKA-9065. Your title is?

Anna McDonald:

I think I said Infinite Ghosts... See, I didn't write that one down. Oh, wait, yes I did, Anna.

Tim Berglund:

You know, you're going to have to send me this afterward because they're going to have to be in the show notes-

Anna McDonald:

Yes.

Anna McDonald:

This one is

Tim Berglund:

I'm not going to

Anna McDonald:

This one is Ghost Partition Haunts Forever, and specifically technically it's Ghost Log Segment, but that didn't sound as good as partition, so I went

Tim Berglund:

Yeah. You get to log segment, you're like, "Whatever, nerd. Come on. Just get off it." You know? Loading Offsets and Metadata Loops Forever. Kind of frankly already a little, I don't know, somehow Newmanal in its orientation in the non-Anna title, but as usual, yours is better. Tell us about the JIRA and how it got fixed.

Anna McDonald:

Okay, so in Kafka, we assume that we can reach the log end offset by starting at the beginning from the log segment and incrementing the offset of like the next fetched record, which makes sense. It's a durable log. It's durable. It's there. You start right from the beginning and then you increment the offset from the first record you read from the next log segment. That makes sense. That's reasonable. That's [crosstalk 00:44:59] rational.

Tim Berglund:

That's what I would do.

Anna McDonald:

Right, except it's not in some cases. If you read like the pull requests for this, it's a great conversation. Absolutely great conversation, and there's an example in here. What Jason explains in this, he says, "Okay, so if you're rolling a broker and it crashes, it's possible that you can get an empty active segment," and that's what this hits upon. Basically what happens is it goes into an infinite loop. If it hits an empty log segment, there are several legitimate reasons this could happen, right? It never loads consumers offsets. It never loads any of that. It backs up everything for that partition, which is, obviously, very, very bad. What we ended up doing is putting in a break, which is a good idea because you really don't want to have an infinite loop. That's just like never a good idea.

Anna McDonald:

Yeah, but you're not just sitting there, right?

Anna McDonald:

Comes in, you're going to do something.

Tim Berglund:

You're doing things.

Anna McDonald:

Right. Nobody's going to go into that empty log segment and put some stuff in there. You know what I mean? There's no condition, I guess, that can free you from this torment. How's that [crosstalk 00:46:26] for spooky? That's spooky, right? Yeah.

Tim Berglund:

That is pretty spooky.

Anna McDonald:

Right, and so what they ended up doing is that they break from empty segments, so they don't count on the first record in every log segment to contain the next offset. This was fascinating to me because I have seen weird behavior in consumer groups and consumer group offsets and loading, so knowing this is kind of... I love this because if a customer hits... By the way, again, this fixed version is like 241, so there are people out there that are still running older versions.

Anna McDonald:

It's just so nice to be able to go, "Hey, can you do me a favor and check to make sure that you have no zero-byte log segments?" If you're in an earlier version. That's really why this is kind of my favorite thing in the world to do is to deep dive on these type of JIRAs, and you can understand more about like how these are loaded and how things are work in Kafka internals, which I think just makes... It helps me at my job, but it also I adore Kafka, so the more I know about it, just the happier I am.

Tim Berglund:

Exactly, and that's one of the reasons why honestly doing spooky ones for Halloween is good. It should probably be like a quarterly feature of just digging through some strange recent JIRAs because especially things like this that are interesting edge cases like the connect rebalance one or the consumer rebalance one. These are things that if you're kind of an intermediate Kafka user, you're definitely aware that those things exist, that connect rebalance, consumer groups rebalance. Do you know how those things work? You know, most of us don't unless you've had to go through the trenches on one of these problems. You probably don't know the details, and so digging into these things forces you to learn those details.

Tim Berglund:

You have to for your job. You're sort of a world-class expert on these things, but everybody whose job is not that who's just building applications on Kafka, it's still I think intellectually salutary to go through these things and understand them the best you can because you get really good internals knowledge. Ideally, somebody else is dealing with operational concerns, you're running in the cloud and all of that kind of stuff, but all of these things... I'm just looking back through these. All or almost all of them have a client component, so you still got to know stuff and I'm glad we have Anna to teach us stuff.

Anna McDonald:

Yeah. I mean, I would agree 110% of that and I would also say that the thing I don't like, and when I get busy and, obviously like everybody else, it's quarantine. You're working from home. I think me personally, I think that tends to have people work more than less. I've been a remote worker for years and years and years and years and years and years and years and years and years and years.

Tim Berglund:

Yeah, you and me both.

Anna McDonald:

If you're not disciplined about it, you can tend to try to work more, and I think finding time to get the satisfaction from truly understanding a process, it's just like you can't compare it to anything else. When I first truly understood Purgatory, for example, I was just so happy because all of a sudden, I could draw all of these connections and that's why this is my favorite show to do. I was so happy last night when I discovered that the maximum epoch for a producer state was 32,766. I was like, "What? That's wonderful." It's really-

It’s Halloween again, which means Anna McDonald (Staff Technical Account Manager, Confluent) is back for another spooktacular episode of Streaming Audio.

In this episode, Anna shares six of the most spine-chilling, hair-raising  Apache Kafka® JIRAs from the past year. Her job is to help hunt down problems like these and dig up skeletons like: 

If JIRAs are undead monsters, Anna is practically a zombie slayer. Get a haunting taste of the horrors that she's battled with as she shares about each of these Kafka updates. Keep calm and scream on in today’s special episode of Streaming Audio!

Continue Listening

Episode 127November 2, 2020 | 49 min

Distributed Systems Engineering with Apache Kafka ft. Apurva Mehta

What's it like being a distributed systems engineer? Apurva Mehta explains what attracted him to Apache Kafka®, the challenges and uniqueness of distributed systems, and how to excel in this industry.

Episode 128November 12, 2020 | 52 min

Why Kafka Streams Does Not Use Watermarks ft. Matthias J. Sax

Matthias J. Sax is back to discuss how event streaming has changed the game, making time management more simple yet efficient. He explains what watermarking is, the reasons behind why Kafka Streams doesn’t use them, and an alternative approach to watermarking informally called the “slack time approach.”

Episode 129November 18, 2020 | 50 min

Distributed Systems Engineering with Apache Kafka ft. Roger Hoover

Roger Hoover, one of the first engineers to work on Confluent Cloud, joins Tim Berglund to chat about the evolution of Confluent Cloud, all the stages that it’s been through, and the lessons he’s learned on the way.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free