Hello, you're listening to the Streaming Audio podcast. And today we're talking about the realities of going into production with an old friend of Streaming Audios and an old friend of mine, Jason Bell. One of the things that makes Jason interesting is he spent a lot of time with Kafka from an unusual angle to me. He doesn't so much use it as he makes sure it's available to be used. He's the operations guy. He spends his time planning clusters and setting them up, maintaining them as they live and grow. And understandably, he's developed some hard one knowledge about where it can go wrong and what you need to watch out for and what you should plan in advance.
So we thought we'd get him in for some advice. And he has battle scars, yes indeed. He has scars for the whole system from brokers and topics to Kafka streams and connectors and ksqlDB. And we start with one of my personal favorite access to grind, understanding your data model. We should probably actually ask him to write a blog post about that one day. It would make a good guest post on Confluent Developer, which is our education site. We put everything from blog posts and old episodes of this podcast to free courses covering Kafka internal, security, stream processing, and lots more. You can check it out at developer.confluent.io, but for now let's get the episode started. I am your host, Kris Jenkins. This is Streaming Audio. Let's get into it.
My guest today is Jason Bell. Jason, welcome back to Streaming Audio.
Thank you, Chris. Thanks for having me.
Good to have you. You're a return guest to Streaming Audio.
This is my third time.
Third time. But your first time with me, so behave.
Let's try. Let's see where we go. Is this the reason that Tim left? He just ran off?
There is rumor that Tim Berglund was broken by [crosstalk 00:02:03].
We'll find out if I'm made of more resilient stuff on the Jason Bell access and we'll see how it goes.
Considering how much time we're already into this, in the edit, we're doing well.
We're doing well. But you and I have a background.
It goes beyond Kafka. Because we first met years ago at Clojure Exchange.
2016, if I believe.
16, that's a long time in internet years.
It is a very long time in internet years. And it was around about the time when I started using Kafka.
Yeah, of course. I'm quite old.
No. I was a bit too slow on that. No, you're not old, honest.
Stop. You're old. So 2016, I was doing the talk at ClojureX about a technology called Onyx, which was a peerless distributed system.
And it could read from Kafka and that's why I was interested in it. And it was built in pure Clojure, which was really, really, really, really interesting to me. And as you are well aware, I was working with a lovely gentleman and still a dear friend of mine, Bruce Durling.
That would be amazing.
Yes, the amazing Bruce Durling. We're trying to get that changed by [inaudible 00:03:33].
If you're listening to this and you don't know Bruce, he has been the beating heart of the London Clojure scene for a long time and he's a great guy.
There's a good group of people around all of that stuff. And I still use Clojure on a daily basis for my own stuff. So I'm still involved, kind of. So that's how I know... and you were sat in the front row, not heckling. That came later.
I'm a front row pay attention kind of guy.
In all offense, most of that audience were because that was the first time I did that talk. And then three years later they were heckling me. It was hilarious. Because I actively encourage heckling in all my talks as you are fully aware. So it's all good fun.
So take me on the journey from 2016, you first discovered Kafka to where you are today?
So it was actually working with Bruce that I got into Kafka. So we were supporting a customer who was sending quite a lot of data through and then using Kafka for it. So it's just one of those things I picked up and this was early Kafka. This was log messages still being stored in Zookeeper Kafka.
Oh right. Okay.
So we're going back. And there was no streaming API, there was no Kafka streams. There was no Kafka connect. There was no KSQL or anything like that. It was just here's Brokers, here's Zookeepers. Messages didn't even have keys at that point.
That's how far back we're going. So it's been interesting. My journey through Kafka... and I think Tim and I spoke about this as well. I've watched these technologies come and get added to the ecosystem and it's been really interesting. Considering things like the streaming API was really important to me. From the streaming API, it was one of the first jobs that I built. And I actually did another ClojureX talk on it. It was, How I Bled All Over Onyx, I think is... that's what we called the talk. Because I was pushing way too much data through Onyx to fair. I was breaking it left right center. And Kafka Connect didn't exist at that point. So I was using the streaming API for writing persistence out to sync stores like S3 or databases, that kind of thing. So they were all handwritten at the time. And then Kafka Connect came out and then KSQL came out. And I saw Michael Molyneaux, the KSQL demo strata data conference, the O'Reilly conference in London. And he managed to write off 80% of my streaming API jobs in one talk. I was like, that's my career over.
That's a mixed blessing when those things come along. It's like, all my work's going to disappear and that's good and bad.
Interestingly, I became more known for being able to run Kafka than necessarily do development work on it. So I came to a fork in the road where I was doing less development and more advising on putting the clusters together. And that's how I landed the gig with Digitalis in 2019 as a Kafka expert. So I've been working with them for the last three years and I just finished last week. And I'm now working for a company called DataWorks, who don't use Kafka they use Pulsar.
Oh interesting. So we're going to stay on the topic of Kafka for now, unsurprisingly.
Well it's contractually the way isn't it?
It's not a hundred percent guaranteed, but we do have a natural gravity towards that topic.
I think the shareholders may approve of that.
In the long run. But the question is, do the listeners approve? That's the big question, that we always focus on. So we're going to talk about Kafka. Tell me about Digitalis first. Why were they using Kafka? What did they use it for?
They were a managed services company, so a lot of their clients were using Kafka.
So that's how I landed with them.
So you must have ended up planning a lot of different clusters for them?
Yeah and a lot of throughput conversations and a lot of design work around data. Once we started talking, and you'll probably get the picture that I'm quite a pen on paper guy and I need to know the process from one side to the other. Where does the data start and where does it end and where does Kafka fit in the middle? That's the way I see things. So we planned a lot of things around that. We'd done customer development work as well. And just as I was leaving Debezium was coming up in conversation an awful lot as the next thing for change data capture stuff as well. So I had a great time with Digitalis, they're a great bunch, but I decided it was time for a new challenge and time to move on.
We've already hit that point in our careers. The best thing you can get is you're moving on for good reasons.
Absolutely. And I am on this one. Definitely, definitely. I love the team at Digitalis, they're a fantastic bunch. And they're one of the rare companies where I cannot say... hand on heart, I never had a disagreement with anyone in three years. It was like everyone was professional, knew their stuff and just put the customers first. It was brilliant. And that's rare, I've found in my career anyway.
It can be a roller coaster at times.
It can be. I've worked with some very good companies, I've worked with some bad ones as well. That's the way it goes.
Let's not go onto that because-
Let's not, let's not.
... whenever you get enough-
Because you'll get me to start naming names and then I'll have lawyers and things like that.
Programmers have those conversations a lot, but they're not recorded and they're not broadcast.
There should be a speakeasy for developers somewhere.
There probably is, somewhere.
There probably is. We don't know about it then.
If we did, we couldn't say.
Or if you know about it, what have you said about me?
Moving on, moving on. Because this is what I really want you to tell me. I want you to teach me about planning a cluster.
Ah right. Haven't [crosstalk 00:10:07].
No, come on.
A great many questions. Thank you very much. Where do you start? New client comes along, they've got to plan a cluster, where do you even begin?
I deep dive on actually what the data is. I go a step before and say, What is it we're actually dealing with here?" And that's actually quite a relevant conversation now based on just what I said about Debezium. If we want to acquire data from the database then, Kafka Connect's coming first then with Debezium obviously being a connector. And I go down, I drill down to the very core components of, what's the message, what is it? Is it adjacent thing? Is it just a straight piece of text? It could be an image. Nothing to stop you sending an image through Kafka. And have those conversations first. And then it becomes, how many you're sending? Because I've noticed... and I did a Cleveland meetup during lockdown because Dave Klein kept asking me to do meetups in Cleveland. Just so I could say, "Hello Cleveland."
Of course that's why you did it, of course.
Had to. I just have to, it was great. No, I think three people got the joke, but it was great. I was happy.
With a bit of luck on this podcast another three people just got that joke.
We'll put a link in there.
Which brings your annual audience total up to six. Seven if you include your wife.
I did that talk in Northern Ireland. Someone fell off the chair. There was one person in the room who got it and they fell off. It was hilarious. Anyway. So data, what is it? What size is it and how is it shaped? How many are you sending? And at that point you start getting a picture. And one thing I have learned is there's a lot of instances where the first question is, and please don't take this the wrong way, "Do you actually need Kafka?"
It's a very fair question.
Because volume... we're talking about a volume game here. And we're usually talking megabytes per second. And I know a lot of institutions, companies and clients that have not gone through that volume at all. And the question then remains, why are you doing it like this? What's the base reason for doing it? And some of them are legitimate. It's like, we have this source of data, but we have all these different departments that want to consume this data. Or there is the argument of, there's this data, we want to transform it. We want to do X, Y, and Z with it and we want to persist out here. So that's where I start. I try and get an end to end of what's actually involved. Because like I was saying at the start, Kafka's evolved over the last seven, eight years, especially with the amount of tooling that's been built around the core concepts of the brokers. So when I came on board originally it was, there's these brokers and there's these messages and they go through and you monitor Zookeeper with your life and then that's it. You may laugh. When it goes wrong it goes wrong.
No, I can believe.
So when things like streaming API came along, it was like, well that means we can do this. And then there's transforms and filtering. Now it gets interesting. We can [crosstalk 00:14:04]. And then with Kafka Connect it's like, we can sync out to this. That's great but that comes with a whole different set of considerations to make. Especially when the side effecting things like database connections and hasty TPM points with rate limiting and all that kind of thing. So there's lot of things that people don't talk about. We had clients that would persist out to Splunk, for example, via HTTP. And we used to just DDOS Splunk.
I can see how you'd manage that.
It's not the done thing. It's quite impolite. You lose friends very quickly. It's a lot of Christmas cards to write later on in the year, that kind thing.
I think there's a connector for that.
There probably is now.
A Christmas card sync. That would be perfect.
Christmas card sync, because of the source connector. The source connector of how many people I've offended this year will automatically create the Christmas card for me and send it out. I like the sound of that already. There's probably a talk in that somewhere.
Yeah actually. No, actually you probably technically could do that with something like Moonpig.
Hey, that'd be fab.
That would be a really good curve ball conference talk.
Keep talking. I've got a VC in mind already. That comes... anyway. So going back to cluster planning.
Start with the context. That's your number one tip.
Absolutely. And start with the customer, start with the customer context and be honest, if you think it's not a thing for Kafka then say so.
There should be a good reason for all the technology we use.
Absolutely. And I think pre Confluent Cloud, obviously it's now a lot easier in order to scale up something on the cloud and be metered gigabyte per month or however, but I'm so used to OnPrem stuff and cost to factor. So even a base cluster would be three Zookeepers, so you've got quorum, three or four brokers and then you'd have two category connect nodes, one active, one back, distributed standby to take the load. Same with schema registry if you're in to schema registry, you'd have those behind a low balancer, one active, one on standby. And then there's Prometheus monitoring, Grafana, alerting. And so the next thing, what seems to be a fairly simple thing of one or two, where you think is, we'll spin up a Kafka cluster, has now become 15 nodes and is costing X amount a month.
If you want the whole thing to be redundant, then-
It's going to cost.
... those numbers add up.
They do. I'm not going to say it's something we don't talk about. It's just something, I don't think, we highlight enough. Put it in the cloud. The five words that I hate most of all are, put it in the cloud.
You got to justify that, come on.
It's just a bunch of servers, isn't it? How can I put it lightly? No, the clients I've worked with have all got sensitive data, so it's not going to go onto the cloud. It has to be run on prep with big concrete walls and men stood next to it and people, not men. With spanners and axes and spades and stuff just protecting the data. I don't think that actually works in banking, but it's [crosstalk 00:17:51].
I don't think that's how data security is generally done. I think you're thinking of Dungeons and Dragons, but that's another podcast.
So anyway, we rolled the D20 to see how many brokers we need and that was swift. You've got to admit that was swift.
I like that. I've heard of worse planning strategies.
It's not a bad one.
I'm going to push you onto some serious planning strategies, with broker numbers.
Four's always a good start. Three's a good starting point. Obviously you need a leader and a follower.
And then add another one because if you've got two and the follower goes or the leader goes, you've got one leader, but no follower. It's not going to end well, it won't end well. So three is obviously your magic number to start with. Same with Zookeeper, quorum, three at the minimum
Same with the monarchy, same with the monarchy, heir and a spare. That's what they say.
I'd never heard that before.
Have you not?
That used to be the rule for having Royal children, heir and a spare.
So that's why he went to California. Where were we before you said that?
Three being the magic number.
Three being the magic number. This is going so well isn't it?
We're roughly on the rails.
We are, I'm against them at the minute, but anyway. So three, three, back to three, that was it.
Leader. Two followers can take a follow out and still maintain a cluster. It can still work. That's what we're after here. Things like stretch clusters, 2DC's. So data center A, data center B is fine. You can have two brokers on each and take a data center out and you'd be okay. There's topic considerations to take into account, minimum [inaudible 00:19:59] replicas of two and all that kind of thing. Which when we go through the planning, we go through all of that as well. The conversation then really... when you have two data centers is, what happens with Zookeeper? So you do actually need a third data center for your quorum node. So you can have two zookeepers on the one data center, two on the other, and then a third, the fifth or a third, or have your quorum one in a third data center somewhere. And whether this is OnPrem or on the cloud doesn't really matter it's just the numbers have to work out in that way. So that's the conversations we end up having first. And that [crosstalk 00:20:37].
Do you always do it that way? Would all your clients want that level of high availability?
Always. Fair enough.
It's the nature of the data. And this is it. So going back when I was working on Bruce, for example, we put a project together for a company. It was a proof of concept. I've done a talk about it actually. I've done another confluent meetup talk. I can't even remember what I called it. Anyway, this was originally the, How I Bled All Over Onyx, talk. This was a talk about putting this cluster together that completely failed. I rewrote everything in Kafka streams. I didn't tell Bruce about it till the day after because that's what you do with CTOs, you just don't tell them.
Need to know basis.
Absolutely. I'll tell you what you need to know, it's working now. Now I'll tell you.
I can neither confirm nor deny that I've employed that strategy.
That says it all really doesn't it?
We've all been there. We've all been there.
We've all been there. It's the nature of the job. It's just, we don't tell customers, oh, too late. So we had Onyx jobs running. They were failing and it was my fault entirely. I was just pushing too much data through and these files were Gzip files of CSV data. Now you don't know how much CSV data, because of the nature of the data. It was flight search data. So you don't know if it's a small airport or a big airport, so you might get three rows or 20,000 rows of Gzip. So therefore your message size concertinas in and out per message. And Onyx couldn't hack it and I'm not surprised Onyx couldn't hack it. And I spoke to Michael and Lucas about this quite a lot. And bless them, they spent the time with me going through everything. And I learned so much about Clojure and the air on binary transfer protocol, which is also well written. And there was all these little settings. This is how you get into the nitty gritty, the same with Kafka, is when you start for better, want for better and bleeding over everything and it's broken and you're trying to figure out what's going on, this is when you do most of your learning.
That's exactly how I learnt Nixxes by bleeding all over it and scar issue.
Exactly. And this is how I learned Kafka. It was Coface stuff. And like I say, at that time, the cluster revolved around the messages being stored on Zookeeper, which is also another complete pain point for Kafka 0.8 back then. So Zookeeper would fill up and you'd go, "What's going on? Why is it all tied?" And it was in MISys, so all we did was just tear the whole thing down, bring it all back up again. Anyway, the point I'm getting to is SLA's basically. The SLA for this proof of concept was only nine till five, Monday to Friday. Which is all very well. That's what they're paying for. It doesn't bother me. They just have to be aware of, what happens when something goes down? If the database connection or S3... because S3 drops, Amazon endpoints drop, they do, it's a fact of life. What do you want to do if this happens? And we sat down and went back to them. Bruce and I went back and said, the biggest period of time you've got between something failing at five o'clock is Easter. So you got Thursday... sorry, you got Thursday night, Friday, Saturday, Sunday, Monday and then we're back in the office nine o'clock Tuesday morning.
If it was failing between that point and that point, you want to save the data. And we calculated the amount of storage that they'd need in order to volume. And the volume they were predicting because of the nature of the data was 12 terabytes a day.
Yes. It was quite a lot.
This is all flight data?
Yes. It's all flight search data. So I won't say from whom, and I won't say for whom, but that's what it was. So it was... and then you've got replication to take into account and... the next thing is we need around about 280 terabytes for four days. That's going to cost.
That's in 2016.
Yeah. 2016, 2017.
Even today that's a cost, but back then.
It's a cost and it has to be taken into account. If you had a problem one minute past five on a Thursday night and you have 280 terabyte to deal with when you powered everything back up on Tuesday morning, you've then got the, this has all got to go through again before... so you've then got this back pressure problem as well. So it's all those kinds of considerations. We've moved on now. Obviously topic data's held on volumes, not in Zookeeper, but you still have to plan that kind of thing. I have alerts set up at 50% disc volume because I'm paranoid. Because what we found at Digitalis is if someone does stress testing, they don't know what they're stress testing against. If you default to seven days retention, you can fill up brokers really quick just by stress testing. I'm going to send a million messages through, even though my prediction is 4,000 a day. Fine, but you're going to fill my test brokers up and I'm not going to be too chuffed about that. And it has happened.
I suppose it's in the nature of stress testing that you're trying to find out what breaks first?
Yeah. Usually my resolve. I'm good for a quote today. It's those kinds of things you need to take into account. And the stress testing does come with its own set of problems. And now we know in Kafka there's the producer performance test tool and the producer... sorry, and the consumer performance test tool. And they're great, but something I said in the very first podcast, which I think might have surprised Tim at the time when we were talking about it is, not to assume that everyone knows how Kafka works. You have production teams, software production teams that are building apps to produce and consume from Kafka but they may not necessarily know how it works. You don't need to know how it works. It's like, here's my producer, it sends a message. I want to send 20 million and see what happens. Do I get X back for all of those or is it just find and forget, and those kinds of things. And that's all very well, it would be handy if you told us.
So you've got to do capacity planning for the people that don't know what they're actually sending?
Absolutely. So what we ended up doing at Digitalis was a sheet of message size. How many a day? Is there a peak time? This is my other giant slide from the meetup was this graph of volume over time or volume over... anyway. Volume against time. Let's call it that. And I think the misconception is when you're using Kafka, everything is just flat like that, it's streaming data platform. It does this. It's all over [crosstalk 00:28:25].
Yeah, of course.
It's all over the shop, unless it's [crosstalk 00:28:27].
And different industries have different natural peaks.
Exactly. And so I'm using the full width of my screen here as a graph, X and Y. So [crosstalk 00:28:39].
That's going to be great for people listening to this.
It is isn't it?
Jason is drawing a graph with his finger folks.
I'm very sorry. I'll stop doing that because that's just not really going to work. Retail, for example, you'll have nothing till nine o'clock in the morning, you'll have peak traffic at three o'clock or lunchtime or what have you and then it drops off at six o'clock in the evening.
Do know what I was thinking about recently, which I just don't know how they planned this, is the music festival Glastonbury?
It's immense isn't it?
Yeah. They sell nothing all year and then there's a 10, 20 minute window where they sell 200,000 tickets.
And it's the most bursty traffic I've ever heard of.
So I wrote a blog post in 2002 on that.
And I can't even remember what I said. It was about Ticketmaster and it might have been Michael Jackson tickets because people were struggling to get them, because obviously systems just break. Ever an advert for elastic computing is that.
2002 though, you didn't have that many options?
No, you didn't, you didn't, you really didn't.
I remember 2002 as some people still advocating shell scripts as your main web server. That's how dire things were.
I've never come across that before.
And your traffic was limited by the rate at which you could fork a new shell.
I've never come across that before.
Afraid so, afraid so.
Wow, I'm still a Pearl hacker at the end of the day as well. No, I'm serious. For big search and replaces in large files I will still crack up Pearl, I won't use Oracle. I still use Pearl, it's just in my head.
So anyway, so it's this elastics, in terms of volumes going through, banking's an interesting one. It's like a [inaudible 00:30:34] but it just goes whoop and that's it, it's [inaudible 00:30:36] for the day. A huge burst of traffic through batch times. And are we treating this as batch or are we treating it as streaming. And traditionally you'll find a lot of organizations will just dump data into Kafka in one massive swoop then let the brokers deal with the backlog and process everything through. Fine, it's fine. It works. It works. Things like high OT data, car data, is just a huge volume game streaming continually while the car is moving at high velocity. There's loads of it.
Not just the car, but the data coming out of the car is at high velocity.
Yeah. Telemetry data coming from cars is just insane. Loads of [crosstalk 00:31:24].
I can believe.
And there'll be more of it. [crosstalk 00:31:24].
And that's two very bursty rush hours, that's got to spike massively.
Bursty hour, edge computing basically.
So what have you done to deal with those sorts of spikes in different industries?
I usually end up weeping in a skip.
It's a sound strategy. I was hoping for something slightly more technical.
It goes back to testing. It does go back to testing. We sat down with teams and said, "You're going to have to stress test all of this stuff out." But what we do is we change the retention time of the topic so you weren't impacting volumes of the discs. So I don't delete topics I flush them out. What I'll do is I'll change the retention time down to three milliseconds, wait for Kafka to do its work and rebalance everything. And then bring the retention time back up to something sensible.
That's an interesting strategy. So for stress testing, you just make it a short lived but identical system?
An hour. Did the data go through, did the consumers on the other side pick it up? It's those kinds of testing strategies. And also what I've impressed on development teams is when they're writing... so what I've found is, especially in banking, obviously all these different departments, it might be credit, it might be mortgages, it might be... all this stuff that the department writing a producer will not necessarily be testing what's coming out the other side.
And it's the little things. It's not so much the technicalities of Kafka at this point it's just human contact. It's, "Jase, can you tell me how many messages have gone through?" No I can't. I can't give you a definitive number of this amount. I can do that per partition, yes, but I can't give you an absolute total. There's only one way really to do that. Rob Moffitt did a post on it, which was, use SKSQL to stream in and do an aggregated count on the messages going through.
Count them as they go through.
That's the way to do it, but I can't do that. Plus it's banking data, the nature of the messages. I'm actually not privy to the data and I should not be privy to the data. I can't be. I've signed NDAs, that kind of thing. So I can't see Mrs. so and so's mortgage payments go through. I just don't see that. I'm not allowed to. So the responsibility's on both teams to measure what goes out and what goes in and that's another conversation as well.
Because that is a tricky thing to watch for. There's a great thing that you can decouple producers and consumers. One department can write, many other departments can read, but it does introduce the problem that you've actually got to have someone in the writing department checking that it's useful.
Yes. Is the data correct? Did it deserialize properly? And once you start putting these things in place, then latency becomes interesting as well because then have you got performant consumers? Do you have enough consumers running against the partitions? If I want... and we don't think... while the main metric for Kafka is megabytes per second, I don't think developers think of it that way. They think of how many messages I can process, but they don't think in megabytes per second or something like that. So there's no real consideration about network traffic going through in terms of saying, I need to process three megabytes a second or a gigabyte per second or whatever. I need 50 consumers. No one really thinks like that. I can guarantee you, most people will only put 10 consumers up at the most and the partition count will be fairly low to start off with. Because obviously you can go up, but you can't go down. These all feel like old school things, but they're still really important to me.
Because I think as programmers, we have a tendency to assume that it's an entirely logical level analysis until we get problems. The network is perfect until you find it isn't and then you start thinking about it.
Yes. So I always think about latency in terms of, you've got compression on the producer side. If the compression doesn't match on the consumer side, then that adds latency. It's also fun from a broker's point of view because it has to obviously decompress the message, or invite the metadata [inaudible 00:36:15] key, add a timestamp, compress it back down again, then send it through. Anna McDonald and I have spoken quite a lot about this. And then you obviously got schema registry stuff. And then we get to Kafka Connect.
On the sync side is an interesting conversation completely.
And for those of you not watching, Jason has just completely broken down at the thought and fallen off his chair. And he's back.
Tell me about that?
I'm here. I'm okay. Source connectors are not a major problem. Never really had any problem with source connectors and things like the... the main connector I ended up working with obviously was the JDBC connector. People like reading stuff from databases and sending it to somewhere else. It seems to be a hobby.
It's an obvious use case.
It is, it is, absolutely. So Kafka Connect, sync side, there's a bunch of considerations to take into account. Not necessarily on the capacity side, just generally. If it's schema'd, then what do we do on message failure, what we do on conversion failure? Because obviously if you're running DLQ's, if you're running dead letter topics, [crosstalk 00:37:44] that's a different argument. What are we going to do with those messages if they end up in there? I know a lot of people that put messages in dead letter queues and then never read them again.
Find out six months later, what are these 200,000 messages doing?
And also they were quite important because it was a process of open account, send money to account, close account. We didn't realize they closed the account. It's really interesting. So the two conditions and I think the API may have changed since, I know there was a kit for it, I'm fairly sure there was. Conversion and transformation were kind of the two things where if there was an exception thrown, then the message would go to the DLQ, fine. Question one what are you going to do with the DLQ? Are you actually going to read it or are you just going to ignore it? We used to set up Prometheus alerts on DLQ so if more than 10 messages went in, it would actually alert us and say, there's a problem here. And the reason for that was if you have something side effecting, like writing to a database table, your connectors obviously connected to the database via JDBC. Networks fail.
They do, they very much do.
And they don't necessarily come back up the way you want them to. And therefore Kafka Connect is flailing around and going, "Yeah, yeah." Because error tolerance equals all within the Kafka configuration is great, but everything just carries on like nothing had happened. And there was no capture. Now this may have changed, hence me mentioning a kit because I remember seeing it. That when you push your message to the sync, if it threw an exception there, you really didn't have much choice on what was going to happen next. If your error tolerance was all, it would just process the batch and keep going. So if the database was throwing an SQL exception back, it didn't care, the offset would get updated and the data wasn't written to the database. So anyway, retry logic, yes, that's fine. But you can go through 10 or 20 retries pretty quickly and the whole thing's collapsed.
If you let your retries carry on over a minute, then obviously you're starting to get back pressure from the brokers. Not a big deal, but what I found is monitoring Kafka Connect is not everyone's biggest priority. They tend to just focus on the brokers and not the connectors. So when a connector fails, which is actually another interesting point, the login for Kafka Connect is central to the node, it's not to the worker. If we could branch out worker level metrics and worker level logging, it would be far more effective. Because then I can look at the logs for that worker and go, it failed because of this. Because what happens, say you have 20 workers running on Kafka Connect, your log files are massive at that point. And then to go and route through. I know people that won't pay for elastic search. So my superpower is to be able to read through raw log files with grip.
Everyone has that in their back pocket, sooner or later.
You need it. You need it. And my advice to anyone is to go with the tools that are available to you for free, without paying full and learn those first. Learn the CLI, learn to read the logs in a terminal window when you need to. Because when all the cloud services vanish and disappear and that's all you've got and you need to fix it, that's where you start. So I've always made those things my friend first. And me and Kafka Connect logs are really friendly now, we know where to look and we know where to go. We know where things are hidden. And bringing up connectors and just watching everything go through to make sure everything settles down is actually quite an important job. And it's one that not many people talk about that I'm aware of. Or if they do, they don't blog about it or they don't shout about it or anything like that. They're probably just weeping in the same way that I do.
Do you have any good tips for monitoring connectors?
Bash scripts do work. No, I'm not joking actually. I would write a bash script to go through the active connect and names. So was it slash connectors? So yeah, yeah. Curl to just slash connectors. And I'll tell you the connectors that are running. And then use Said to then parse out those names, the connector names themselves. So it meant if connectors were being added and removed, then you weren't having to update files all the time. That was the reasoning behind it. And then do a four loop on each of those connector names, do a curl again. So slash connector, slash status, sorry slash connector slash the name slash status and that would tell you how many tasks are running and the state of those tasks and the state of the worker. And they should say, "Running." If they say, "Failed," something's happened. So we had a job that would do that every two minutes. And then if it failed, we got alerted then we could go and have a look. And then if it was... that was the best time to go and have a look at the logs because otherwise you're 30,000 lines down the shoot at that point. Then figure out whether to halt the connector, tell the client, what do they want to do? Because interestingly we found in a couple of cases, once a year, this would happen that database tables fill up.
Yes, they do. I've had that a few times.
So your connectors are fine and writing data around, but the database isn't getting your data at all. So we used to get a request once a year about rolling back the offset to when the fault happened on the database side and then having to [crosstalk 00:43:58]. It does happen. And that's an art form.
And this is the thing, you can test Kafka, you can test the connector, but sometimes testing the external database having problems is really tricky.
Network failure, always test for network failure. Network failure on the brokers, network failure on Zookeeper, network failure on any side effect in databases, syncs, sources, whatever. This is it. And it goes back to what I said at the start. If we're talking about end to end, Kafka's are the bit in the middle, as far as I'm concerned. It's a tool, it's a great tool. I love it. It's done me well as a career for the last six, seven years. But we have to think about the end to end and everything that affects it. And even coming back to capacity planning, now we've got the transaction API, which obviously creates the topic which then is replicated and there's more partitions therefore it's more data. If it's a high volume topic with transactions, then you have to take that into account as well. RocksDB on state full transactions and streaming API and in KSQL. Global K tables copy across networks. Not like, yay. There's a lot not talked about that we learn the hard way. And I'm only scratching the surface based on this conversation, I think. And it is by a case by case basis. There has to be a bunch of KPIs to say, we need to know that... most important metric to monitor, consumer lag.
Why is that the most important?
I want to know how far behind my consumers are. Because if they are far behind... and we're talking, not difference between one or two messages, that I can tolerate. It's when it's 40,000 messages that usually says to me is the problem. So bash scripts for Kafka Connect, connectors, workers, but also run the consumer group list job and describe each consumer group and that will give you the clients running, the offset positions and the lag. And I used to do that every day. That's a chron job to run every day and it would email me so I could go-
You ran it in batch.
Well I'm not allowed to touch the cluster. My job is to make sure the cluster's up not write to it.
Black box testing of Kafka clusters.
My methods may seem archaic, but that's the only way we could do it.
I know there's someone who's mentally rewriting that script in Python right now. And there's someone else wondering why the API doesn't use GraphQL and you can get it all in one query. But in the meantime, a bash script will do the job.
I could say something really controversial, but I'm not even going to go there because I will never get left alone.
You've dangled it out now.
I have haven't I. GraphQL, really? Sorry, said it. Python really?
Oh, that's a much harder one to argue with when you're championing bash scripts. Let's not descend into a language flame war because that's a-
I learn whatever I need for the job. Data works I'm letting go. I won't go now, I really like it. It took me a bit of a left at the lights with my brain because it's so functional programming with Clojure for so long.
You've got to go back to imperative.
Yeah. It's like, oh this is a bit weird. I actually like it, really like it, anyway.
Just for balance, for journalistic balance, I'm going to say, Go, really?
Really. Rust, really?
I've had some fun with Rust. I haven't haven't dabbled in Go yet, but I've had some good fun with Rust.
You're a big TypeScript guy aren't you?
Most of the talks I've seen you do recently they've involved that kind of thing haven't they?
Do you want to know a secret?
I installed Emax on Windows yesterday.
Different listeners will find different parts of that to be sinful.
I don't care. Anyway, where were we?
We were about to talk about the problems with streams because that's another part of the picture you haven't talked about.
Streams are just as interesting as Kafka Connect really because their area of tolerance is defaulted to all. It's very difficult to stop a stream. You have to have a really, really good reason for doing so. So DLQ's work in the same kind of way. Now there's an exception mechanism to decide what you're going to do with a message but it's that long since I've looked at it. This is the interesting part of not doing a huge amount of development work for so long, is when we did Kafka Jeopardy, so that was Neil Buesing, Anna McDonald and myself. They knew all the development stuff and I knew nothing, weird. But they didn't know the port number to Zuki. So I came last, but I don't care.
This is why it's an interesting perspective though because you're coming at it from a, it's got to stay up, but it's almost a black box to you.
It is a black box to me. It is a black box to me. And it's like, can we have a look at this message? And it's like, no, go and consume it. Some support tickets, people just don't like me. They think I'm being difficult. It's like, no, this is really hard to do. Running Kafka with a set of blinkers on basically. It's like that. Sorry, for those watch... not watching, I'm putting my hands up in blinker fashion.
He's miming a horse.
Nay. That could have gone a bit left there, Chris, to be fair.
Oh, I'm not sure it could.
Go on, you can do it. I've totally lost my train of thought now. Streams, streams, that was it.
That was it. That was it. Stream's been difficult. I found difficult. Actually Neil Buesing's probably a really good person to ask and so is Anna, on how they would handle it. So Anna, Neil, how would you handle it?
We'll splice their answer in later.
Leave a comment.
Let me ask you this then, do you think... have you found that streams or KSQL work better from an operational point of view?
I prefer streams.
Not the answer I was expecting.
I see the logic and the appeal of KSQL, do not get me wrong. I like streams in small container services run separately. Because if you have a stream that's hogging the resources of a node, the other application tends to struggle, so you may have a network heavy streaming job. And then which also may be CPU heavy depending on the transformations and things that are going on. And if you have another one that's just doing some fairly basic aggregations on a fairly low level, low velocity topic, it gets put back at the queue.
So you're saying that it's easier to manage individual streams processes if you split them out that way?
Yes, I think so.
Because KSQL it's all sitting there in one lump.
And you have that same problem that if you have a KSQL queue that's really intensive, then it tends to hog the resources.
I had not thought of that, but there is that to be balanced with the convenience of KSQL?
It is there. I won't call it an issue it's just a consideration. And it all comes back to design again. It's knowing the data that's passing through the system. And if you can start with that and you know the domain. And actually good point in question, if you bring domain knowledge in and don't assume that you can have an easier ride with all this. If you're in retail, how many transactions come from a POS system per hour? Oh right. How many basket items is that? It's this many and it's this size. We can model that then, we can work it out. Even with a calculator you can work it out.
I would've thought a lot of customers don't actually know that data.
Some do some don't. But there must be a way of finding out. The data's got to come from some point. There's a starting point somewhere. And it's just a case of knowing which person to ask. Not a case of it just not knowing. Let's take an example. POS data is actually probably a good one. Within a basket there's N number of items. That could be two K's in size or 10K in size, but we've got a lower and up and bound that we can work with. I'm now doing dancing with my hands. Sorry, but we've got a lower and upper bound of size. And even if you then say, let's double it so we've got some contingency. This is another thing where people go, it's not going to be any more than 10K. That's fine and then we start seeing 30K messages being pushed through.
This is the old thing of order of magnitude planning.
Exactly. So I always put a 50% in minimum contingency for sizing, especially when it's high volume and across a large cluster. And the other thing to keep in mind is, I've probably said that a hundred times in this podcast so far, the other thing to keep in mind, the other thing to keep in mind.
There's a lot of things to keep in mind.
There are a lot. Something to write down then and not keep it in mind, because that's the worst place you can keep it. If it's a shared service, you have multiple departments writing to the same Kafka cluster. Some departments are going to write more data than others. And it's very difficult to know with the performance of the cluster when there's so many different tenants of that cluster. Not everyone has the luxury of having their own cluster basically. Quotas are important, but that's a post-production thing. You measure it afterwards. Anna might actually argue with me on that one because she's a quota. She's a quota queen. Let's call her the quota queen.
I'll check with her how she feels about that nickname.
She's going to throttle me after that one.
That seems like a quota issue, throttling. I went there.
You went there. You went there. That was really good. That's good. This is turning into The Mighty Boosh, isn't it? It's just like, oh. So quotas are important, but I find them very difficult to monitor in the first instance. You won't know until the team's gone live into production. And then you're looking at the network traffic. Hence Grafana is really important.
For getting those messaging graphs in real time.
Bikes in and out. And seeing how things actually look, that will govern the quotas. So topic level quotas don't exist anymore. Well they're being deprecated so it's really on the producer and the consumption side. Mainly on the producer side to stop all that stress testing volume going through. You do get bursts. You have to plan for bursts happening. At least with the quotas, it will throttle, it won't just shut down. You have to give it some fairly amount of data in order for a quota to completely block you out and throw an exception. But that's where the performance producer tool comes in on the command line. You can at least say, I want a million messages every 10 seconds of a hundred K in size, go at it and see how it performs.
To wrap this up into a TLDR, if we can, we're saying, as much as you can get the picture of your data and your data model before we even begin?
Absolutely, absolutely. It's an end to end process. It is a process. I like to think of it as a graph. Point to point to point to point to point. And a direct graph is nice in the sense that things will split out. You might have a process where the data splits out in KSQL into a new stream or streaming API job into another topic for that kind of thing. And they all have considerations down the chain. So things like RocksDB on the streaming side. And if you're creating a new topic out of streaming API, how's that replicated, how's that partitioned, what's the throughput of that, what's the volume of that going to be? Is it going to be a 10th of what's going through the streaming API job? We built some streaming API jobs to look at source data, because it was say 10 million messages today going in and we were only interested in seven of them. It does happen.
I can believe that.
There's two options at that point. You're either going to say, I want to filter it with a streaming API job or I'm going to let Kafka Connect do all the hard work. I wouldn't want Kafka Connect to be consuming like that. I'd let the streaming API job do the work for me and then only persist what I need to persist.
Until the day that you find that there's another type of message that you need to filter out as well.
But it's a new evolutionary process at that point. It always is. I don't think anything sits still for too long. Data [crosstalk 00:59:01].
The whole of software is reacting to changing requirements.
Precisely. Precisely. And these requirements can come... it might be a fortnightly sprint. I know some projects that take two years to turn around. It's the nature of the business really more than whether we do agile or anything like that.
So know your domain, plan it out as far as you can, load test it before you put it in production and write some bash scripts. Those are your four tips?
Now you've put it like that, I'm going to reconsider my career options. Thanks.
If the bash scripts don't work out, you can always play those Chapman sticks in the background.
I love playing them. They're great.
We'll get you back in to record the outro music, but for now Jason Bell, thank you very much for joining us.
Chris, it's been an absolute honor. It's good to see you.
Cheers to you.
And there we leave it. You know, I'm going to confess, I have this dream that one day, maybe at a future Kafka conference, Jason and I will be on stage playing a gig to close out the night and he'll be on the Chapman stick and I'll be on the Theremin. I don't actually play the Theremin yet, but I can buy one and I will learn it if there is that slot available, you got to have a dream. For now, we're going to have to leave Jason to his cluster maintenance. Before you go, if you've enjoyed this episode now is an excellent time to let us know. My Twitter handle's in the show notes or you can always leave a comment or a like or whatever. We always appreciate hearing from you.
Between now and the next episode, if you want to learn more about Kafka and how it can help you get your data moving Confluent Developer is here to help you. Head to developer.confluent.io, where you'll find everything from getting started guides to architectural patterns. It's all free and it's written by some truly great minds that I'm very happy to work with, so check it out. And if you feel ready to use Kafka in production, but you don't want to manage it yourself, head to Confluent Cloud, which is our managed service for Apache Kafka. You can sign up and get a cluster running in minutes. And if you add the code, PODCAST100, to your account, you'll get $100 of extra free credit to run with. And with that, it remains for me to thank Jason Bell for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
How do you plan Apache Kafka® capacity and Kafka Streams sizing for optimal performance?
When Jason Bell (Principal Engineer, Dataworks and founder of Synthetica Data), begins to plan a Kafka cluster, he starts with a deep inspection of the customer's data itself—determining its volume as well as its contents: Is it JSON, straight pieces of text, or images? He then determines if Kafka is a good fit for the project overall, a decision he bases on volume, the desired architecture, as well as potential cost.
Next, the cluster is conceived in terms of some rule-of-thumb numbers. For example, Jason's minimum number of brokers for a cluster is three or four. This means he has a leader, a follower and at least one backup. A ZooKeeper quorum is also a set of three. For other elements, he works with pairs, an active and a standby—this applies to Kafka Connect and Schema Registry. Finally, there's Prometheus monitoring and Grafana alerting to add. Jason points out that these numbers are different for multi-data-center architectures.
Jason never assumes that everyone knows how Kafka works, because some software teams include specialists working on a producer or a consumer, who don't work directly with Kafka itself. They may not know how to adequately measure their Kafka volume themselves, so he often begins the collaborative process of graphing message volumes. He considers, for example, how many messages there are daily, and whether there is a peak time. Each industry is different, with some focusing on daily batch data (banking), and others fielding incredible amounts of continuous data (IoT data streaming from cars).
Extensive testing is necessary to ensure that the data patterns are adequately accommodated. Jason sets up a short-lived system that is identical to the main system. He finds that teams usually have not adequately tested across domain boundaries or the network. Developers tend to think in terms of numbers of messages, but not in terms of overall network traffic, or in how many consumers they'll actually need, for example. Latency must also be considered, for example if the compression on the producer's side doesn't match compression on the consumer's side, it will increase.
Kafka Connect sink connectors require special consideration when Jason is establishing a cluster. Failure strategies need to well thought out, including retries and how to deal with the potentially large number of messages that can accumulate in a dead letter queue. He suggests that more attention should generally be paid to the Kafka Connect elements of a cluster, something that can actually be addressed with bash scripts.
Finally, Kris and Jason cover his preference for Kafka Streams over ksqlDB from a network perspective.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us