Multi-DC Kafka has come a long way in the last few years and Anna McDonald has been a part of that. So much so that there is an eponymous pattern, the Anna Pattern, for how to get that done. And she's going to explain that to us today in terms of hanging a curtain rod, and also tell us about automatic observer promotion in Confluent Platform 6.1 and above. It's all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Hello and welcome to another episode of Streaming Audio. I am as usual, your host, Tim Berglund and I am joined in the studio again by my friend and co-worker, Anna McDonald. Anna, welcome back to the show.
Thank you and yet again I'm surprised I'm allowed back. It's just every time it's [inaudible 00:54:00]
It is, as I've said recently, I think on your part, the triumph of hope over experience.
It is. That's right.
Yeah. I'm always delighted to have you on the show. You're famous, of course, now for your two Halloween episodes and we were thinking, how can we do an Easter episode? And I don't know that it's obvious. The Halloween thing, that makes a lot of sense. I just thought this is a holiday that we should observe. And to have you on as a special thing. To be honest, I don't think you need a holiday to be on the show. I think you should just be a regular feature. But here we are and we are going to talk about, today, what I like to call the Anna Pattern. There actually is, I can tell you in Confluent Slack, there is a whole channel called Anna Pattern.
That's Matthias J. Sachs.
Yeah. That is Matthias J. Sachs.
Thank you for that. Dr. Matthias J. Sachs.
It is actually Dr. Matthias J. Sachs and it is typical for Computer Science. One of my best friends is a professor of Mechanical Engineering and so they actually call each other doctor because students and things like that, because they're professors. It's funny in Computer Science, if you have a PhD, you don't talk about it, right? It's this weird thing.
It's true.
He's Dr. Sachs, and there's Dr. Posner, and there's all these doctors running around. You don't even know it.
It is. That, very true.
The Anna Pattern, specifically this has to do with automatic observer promotion. And this is the thing, recently as of this recording, Confluent platform 6.1 was released and automatic observer promotion was talked about there. I talked about the Anna Pattern by my fire pit.
It looked cozy.
It was, actually. Yeah, it was nice.
Awesome.
But tell us about this. What are the circumstances that give rise to it? What's the new feature? You're the one that this is named after, so [crosstalk 00:02:53]
Yeah. The first thing, I just do a level set. In many industries, you want to be resilient and sometimes that's because you're a good citizen of your data, but more often than not, it's because you're required to by law, right? You're regulated. So, we're talking about banking industries, insurance, places where the law has stepped in and said, "Look", right? If you stop working, there are going to be economic impacts that are not fun.
Yeah. It's not just a good idea but ultimately there are people with guns who will make it so that you can just-
Yeah. I don't know if the Fed... I could see Janet Yellen though, literally stepping in there. She's pretty cool. But anyhow, yes. And so, one of the things that I noticed early on was that most people in the U.S. had two DCs. They only had two. And to explain this and why it matters and I kind of wanted to do this because there's some confusion when we try to explain why this is a problem, right?. And it can actually happen with any number of DCs. And I'm going to explain this at a high level, in terms of curtains, because I'm famous for my incredibly-
The basic problem though is-
...spot on analogies.
...high availability.
Well, right.
And you are, amazing.
Yeah. And I'm taking a little detour here to explain the problem, right? So let's say that your system is a curtain rod. And this happened because my six-year-old pulled down her curtains the other day. So, let's say your system is a curtain rod. Your curtain rod will stay up as long as you have at least two brackets. That's what you need, and in terms of DCs, that's how you're in compliance. Your compliance if you've got at least two brackets, your curtain rods up, and you're good to go.
Mine's the tension kind where you screw it.
No, [crosstalk 00:04:46] the banking industry is not allowed to use those. They must have all of the screwed brackets. That's their regulation.
Got it.
Yes.
Okay. I'm thinking of a shower curtain rod. Anyway, you're probably thinking of a window.
No. Curtains like windows. Yeah, on windows.
Yeah, okay. The bracket kind. I'm with you. [crosstalk 00:05:03] I don't have that screw kind.
Yeah, so screwed in. And so, if you've only got two brackets, I don't know if you've ever seen any of those money fancy windows that people have where they're super long, and then they actually have to have a third bracket in the middle to hold up the curtain rod because it's so heavy money and bougie?
Yep!
Yeah. So, if you have only two brackets because you don't have a bougie curtain rod, and your kid comes along and yanks one side of the curtain and it falls out of the wall because maybe you were in a hurry and you didn't [crosstalk 00:05:36],
Hypothetically.
I'm not saying that happened at my house, and then I got a funny look like, "You didn't put that in the stud". I was like, "Dude, I was..." Anyway, so it falls down. If you only have two, it falls down and one dies basically. So then, you have to get your husband or somebody like my husband, Luke, to hold up the curtain rod while you fix the bracket.
Cool hand, Luke.
Right. That's right. Cool hand by Luke. Underscore, because we're both wrapped in the underscore. But the thing is, if you've got one of those bougie windows and you've got three brackets on your curtain, rod and your kid comes along and pulls the curtain down and there's one bracket left, your rod still stays up because you've got two brackets. Now, if they like sucker punched the middle one and you're left with only one bracket, it's still coming down and you need a loop to like hold it up while you fix it. So in this analogy [crosstalk 00:06:29] you need it, right. The brackets are your DCS. The number of DCs you have. And Luke is the observers who can temporarily hold up the curtain while you screw it back in. So this is the analogy, right? So it's all about,
I know how observers work and I actually didn't see that coming where Luke was the observer [crosstalk 00:06:49]
No, I blew your mind.
Yeah. Okay. Go on.
Right. And the reason I say this is because I just gave a talk about this yesterday and people get really caught up in, this is only for a 2.5 DC scenario where you've got two DCs and you've got a zookeeper stored somewhere else, like thrown in the closet just hiding. But it's not. It's any time you're left with one zone. So you could have like that bougie curtain rod with three zones or three DCs, if your resiliency is that I have to still be operational, if two of those DCS die, then you need observers and you need a loop to hold that up.
So it's really any number of DCs. It's whether or not what your resiliency tolerance is. And that's how I want to state this problem. The problem is I've got multiple brackets. I need to make sure that those are there for resiliency requirements, but I also need to make sure if all but one go down, I can still produce while I'm fixing them. I can still my curtains still up basically. So that's how I'm doing this. This could be the world's worst analogy, which wouldn't shock me. But I think it's pretty good. I think it's pretty good.
No, it might be the world's best. This is the best analogy that is in there. Better than the other one.
It is better than the Cantaloupe one. That left everyone feeling melancholy?
Yeah.
Folks, she actually can do that all day. That's the funny thing, like she doesn't get this stuff.
So, knowing what we know now about the numbers, why does this a problem for Kafka? Why is this a problem in Apache Kafka? And the reason is the way that you assure when you have a stretch cluster, and I always do this and for showing video, I always do DC 1, DC 2, right? Or zone 1, zone 2. When you have a stretch cluster there like this: there's one cluster for MRC, I'm going to have to like pop up and down. And so, when I'm producing to this one cluster, I don't want to say yes, I've acknowledged a message till it hits this one and this one, until I hit some bolts. And in order to do that, we have these two concepts in Kafka. One is replication factor, and the other is min. ISR. Replication factor is how many copies of this partition do I have?
Min. ISR is how many copies do I have to have that are in sync. That means that they're available. Right? They're all up to date in order for me to want to produce. And the reason people do that is because if I only have one left, I might lose it. So I only want to produce if I know that I had that it's going to go to multiple copies basically. That's kind of what the idea of min ISR is. So you set X equal all, you've got a min. ISR that is a protection. And if all goes well, and you meet those criteria, your produce request works. So what you do, and this is because Kafka basically has, you can either produce, your acts can be like none, one, or all. Those are the only options
Or all. And all the in-sync replicas.
That's correct? Yep. And so what ends up happening in a 2.5 DC sitch is, when I produce I want to make sure it hits both DCs. Otherwise, I'm not getting any resiliency. If I don't have I produce and a copy of that message isn't sitting in both DCs, what am I doing? Why am I here? Right. Because that's the requirement.
Lose the DC, I lose the data.
Right. And so in order to do that, what you have to do is you have to say, I have to have my min. ISR one greater than the number of replicas or copies located in one DC. So if I've got two DCs and I've got like two replicas over here, and two over here, I have to have a min. ISR of three.
I need three.
Because that way I know that it hit at least two here and one here. And if your min ISR drops below three, which can happen, let's say this DC fails, now I'm below my min ISR. I'm not allowed to produce. My curtain rod has fallen down.
The application is down.
That's right. My curtain is down.
Yes. There's no Luke, there's nothing.
The neighbors are looking, it's uncomfortable. So it's just not good. So then, and this is the situation that we were in before we had the Anna Pattern for automatic observer promotion. You basically had to intervene. You basically had to lower your acts on your producer, or you had to lower your min ISR in every topic. And that's like an operator thing where an operator would have had to go in and do that, or developer would have had to go in and redeploy their applications. That's not fun.
Which lead why application change configuration.
During an outage. Who wants to do that? I mean, it's just right from sticks. So the question then is how do you get a loop? How do you get somebody to temporarily hold up ISR while you fix your curtain, while you don't your DC backup. And so when I was thinking about this, when we first introduced MRC, we also introduced observers. And observers are an asynchronous replica. The thing about them is they don't count in the ISR because they're asynchronous. They're just sitting out there, there a copy of a data
Asynchronous with respect to the produced-
Correct.
...of logic, right?
Yes.
They're taking replicas, but I'm not going to wait for them.
Correct. Right. They may or may not be in-sync. And so when I was looking at this early on and I was like, well, why don't we do this? I said, in order to get around this problem, why don't we auto promote? Like if you fall below min ISR, so if you're in this situation where you used to have two replicas and two replicas min in-sync of three, this dies. If you had an observer sitting over here, why don't we have the leader say, "Hey, I'm below min ISR. I need to be three I'm only two. Come help me. Hold up this curtain rod while I bring this back up." And so that's what we did. That's the Anna Pattern. An observer sits over here, and as soon as you drop below min. ISR, The leader will go, "Hey, get over here." And it's like, okay. And then it comes back up and you can produce until you can get this DC backup. When that happens, this observer will automatically be demoted. And you'll go back to your happy path, two and two. You're good replica placement policy.
So they're really a temporary fix. They come in to help you until you recover, and then they go right back out. And that's kind of the entire idea behind this. And again, it was designed for a situation where it's like a 2.5 DC, where you have two DCS. But that's why I started out with that like bracket example, because you could have three DCs. If your requirement is that you have to be still up. If two of those die, you're in the same area, you're in the same bucket- [crosstalk 00:13:54].
Yeah. Somebody's got to come hold the curtain rod.
...You could, you can use this pattern for that as well. Exactly. Because the curtain rod will still fall down. So that's kind of like the long and short of this pattern.
But historically, with observer promotion, historically, it's like you had to call Luke from downstairs or something and maybe he was busy and it took him a minute to get there. And you were hands on in getting that person to prop up the rod for you. [crosstalk 00:14:20]
I'm going to be honest, I'm going to add to that. It was like he might've also been drunk and cause damage. Because the thing was, is you had to enable unclean leader election. You had to go eyeball and make sure that these observers were caught up per partition, which in a large cluster, that's thousands of partitions. You don't want like a buzzed Luke holding an expensive curtain rod above your China cabinet. It's just a dangerous situation. You want [crosstalk 00:14:48] on their game. That's right. So it was even a little bit worse than that. And now you don't have to think about it. That's what I love about this. You just don't even have to think about it. It just happens. And so that's the long and short of it.
Nice.
Thank you.
With the automatic observer promotion. Now, do you know the logic inside the observer that, how do we get away with not having to have unclean leader election? Like I understand why that had to be enabled and really need to promote. How does that work now?
So the coolest thing. Yeah. So again, I think I've said this before, my favorite class ever is partitioned at Scala. You haven't looked at it. You should. It's a beautiful thing. And so in there, there's a method called maybe expand ISR. It's because really, well replicas have the same issue. So you can have replicas that fall out of sync, right? We really call those under replicated partitions. We hate those. And that's when ISR shrinks and expands because something's wrong with one of the replicas.
So what ends up happening is now there's a check and that check says "Oh, are we under min ISR?" Well then go check observers too. Is there an in-sync observer? Because those observers will also report back whether or not they're in-sync. They still report back. They just don't count an ISR. So then the leader is going to go "Oh, Hey, I see you. I need your help. Come on." And so that's all handled and [crosstalk 00:16:21]
You're in-sync. You're eligible to be promoted.
Right. And what I love about that is we've been running, right. Kafka has been running that code base, that algorithm forever. So it's battle tested. It's not something new. It's basically making use of an existing process that we know it works great. And just letting observers become eligible. If they're in-sync if you're under min ISR. it's actually very elegant.
Right. Okay. So, that makes total sense. And it's obvious that it's battle-tested sort of older code because it's still in scholar. Right?
Yeah.
See parts of the code base here at Scala and you're like this is sort of the- I mean no, not Scala is bad. I mean clearly insert Scala joke here, this would be the time to do that. But if Victor were on the show, he would definitely make some Scala right now and they'd be funny. But what was I going to say? Oh yeah, that's sort of like the reptile that the brainstem of Kafka, the early layers laid down in Scala and now you've got all this Java. The medial prefrontal cortex of Kafka is not written in Scala to use suddenly very obscure neurological.
I like it. I'm a fan.
Yeah. I have to, I'm going to credit him. But I feel like I have to tell Victor's, my favorite Scala of Victor's, which is off topic, but slightly-
I would like to hear it.
...since we're talking about scholar, okay. This was out of Kafka summit a couple of years ago, he was showing some Kotlin code. This was like in the early days of him doing Kotlin, which is good, I'm glad he does that. And by the way, I'm referring to our colleague Victor Gamma, he's a developer advocate and an occasional guest on this show.
He has great taste in high tops.
Yeah, no strong, strong shoe game in that man. So he's showing some Kotlin code and he said, okay, this is Kotlin. And if you're a Java developer, you probably find that you're able to just read this in contrast to Scala code, where even if you're a scholar developer, you can't read it.
It was funny. I liked it
It was funny. Utterly gratuitous, right. There was no reason he had to go there, at all.
No, and I like that about it.
But he did.
I do. I like that about him. That's funny.
It was good. Yeah.
I do enjoy that.
Anna, are we done? I feel like you've explained it.
I have. Yeah. I mean, I think I have explained it now. I've explained like it an awesome level. Configuration wise, there are still things that you're going to need to do to take advantage of automatic observer promotion. There's monitoring. And then there's my favorite. I'm actually working right now on a tuning guide. Just in general for like low-level latency for this pattern. So there's there's other things you need to do to make this whole ecosystem awesome. But as far as like resurrecting your ISR automatically, that's it, man.
See what you did there. Yeah.
I do. I did. Did you see what I did there as good, right?
That was good.
Happy Easter.
That's a theme I appreciate.
That's right. So, so yeah. I mean, that's about it, man.
I felt like it was going to take longer, but you made it make sense in 20 minutes. Like how do you do that?
I don't know. I think that's why I was so excited about the curtain rod thing. Because before this, it took a lot longer. I was just trying to, it's a numbers game. And it's hard to explain that in a fashion that doesn't get you way too deep into like racks and ACS and all of this like lingo. And so at a high level, I was.
And you got to whiteboard that.
Yeah. Yeah. And at a high level, I was just.
It doesn't work well in podcast.
No, it does not at all. It really doesn't. Can we talk, you know what, if we've got time, like I'm excited for Kafka summit as an attendee.
Let's talk about a little bit.
Okay.
I'm excited.
I'm excited to see Neil's talk. I'm so excited because we got all these new Kafka streams metrics and he had made these incredible dashboards that are going to help people so much. And I cannot wait to see his talk. I absolutely cannot wait. I'm so excited.
Yes. Referring to Neil Buesing, our friend. Yeah.
Is there another Neil that we should really-
The number extraordinary?
Do you know another Neil? Cause I don't know.
I know several other Neil's. There's the late Neil Postman who certainly has had, I think I would say a formative intellectual influence on me-
Neil Diamond?
...his book, Amusing Ourselves to Death. Less of an influence on me in the case of Neil Diamond. There's my friend, Neil Ford, who have made a big impact on me as a communicator. There's Neil Avery. Our once one time coworker and awesome guy.
There's my friend Neil who lives in Boston. He's pretty cool.
Yeah?
Yep. I don't have a lot of Neil's. You got better? Yeah. I mean, I've got my Neil. That's why, I guess I don't ever qualify and say, Neil. You were saying?
I'll make sure I link. I link to Amusing Ourselves to Death in the show notes has nothing to do with Kafka, but just, I think super good book for thinking about media and sort of like Marshall McLuhan applied.
I like it.
But yeah. Yeah. That's for this podcast. That'd be for a different podcast to really deep dive into that.
I'm listening to a podcast Speaking why Oxford university has published all of their like lunchtime lecture series.
Oh nice.
Yeah. And they're hilarious and amazing. Like there's one, that's like 800 years of the history of mathematics at Oxford university. It is hilarious and awesome.
I am currently getting my phone out because I needed to recall the name of a podcast. I just about done with an episode of legends of sales and marketing by people.ai, because our own president of worldwide field operations, Erica Schultz was just interviewed in that. That was interesting. Just like I'm not a sales guy, but I've been reading just some blogs and sort of like what's enterprise SAS sales. How does it, what are all these words that people use? It's just good stuff to know. And of course the good old lexicon Valley and econ talk. Those are my go to.
Yeah. You're always on your game, Tim.
Well, and thank you Anna. But in the spirit of my long-time favorite podcast, econ talk, I will say my guest today has been Anna McDonald. Anna, thanks for being a part of Streaming Audio.
Thank you.
Hey, you know what you get for listening to the end, some free Confluent Cloud, use the promo code 60PDCAST that's, 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available, so don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter at @tlberglund that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign up links for those things in the show notes, if you'd like to sign up. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five star review and we think that's a good thing. So thanks for your support and we'll see you next time.
As most developers and architects know, data always needs to be accessible no matter what happens outside of the system. This week, Tim Berglund virtually sits down with Anna McDonald (Principal Customer Success Technical Architect, Confluent) to discuss how Automatic Observer Promotion (AOP) can help solve the Apache Kafka® 2.5 datacenter dilemma as a feature now available in Confluent Platform 6.1 and above. Many industries must have a backup plan not only to do the right thing by the data that they collect but because they are regulated by law to do so.
Anna has a knack for preparing operations that makes replication of data possible both synchronously and asynchronously. To avoid roadblocks in stretch clusters, she’s found that you need both a replication factor and a minimum in-sync replica (ISR). There needs to be a consideration for not just one but multiple copies for the protection of your data criteria. Not replicating the correct number on the datacenter can mean that your application is down, and there’s no way to retrieve vital information during this outage. The presence of observers enables asynchronous replicas that don’t count towards that minimum ISR.
These ISRs work because they help recover data without invalidating any other standards. Architects should try to maintain topic availability during an event in a two-zone configuration. This assures that the writes go to both zones during normal operation without compromise. With the newest version of Confluent, you can get data in sync and within the minimum ISR. AOP is an excellent solution for developers who want to prepare for the unexpected and maintain accessibility across zones. When you can avoid manual interruption, you’re more likely to avoid errors and tedious operations, which would otherwise lead to a higher probability of data loss.
In other exciting news, Anna shares about discovering patterns in order to make the entire Confluent ecosystem more automatic.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us