Get Started Free
November 16, 2021 | Episode 186

Handling Message Errors and Dead Letter Queues in Apache Kafka ft. Jason Bell

  • Transcript
  • Notes

Tim Berglund:

Delighted to welcome Jase bell, back to the show today, to talk about dead letter queues in Kafka. Do you need them? How should you use them? What are the parameters around using them in Kafka? We cover this. Jase is a guy who knows. He sees Kafka running in production all the time because that's his job. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io. Confluent Developer's website has, in my opinion, everything you need to get started learning Kafka and Confluent and everything in the related ecosystem. You've got great courses there to get you started, this podcast, you can get a list of episodes about this podcast, there's a library of streaming design patterns, all kinds of great things there. So check it out, developer.confluent.io. Now on to the show.

Tim Berglund:

Hello and welcome to another episode of Streaming Audio. I am your host, Tim Berglund and I am joined again by returning guest, Jason "Jase" Bell. He is of the famous conceptual Twitter series Morning Morning (Yes, I Have Tea), and senior DevOps engineer at Digitalis. Jase, welcome to the show.

Jason Bell:

Thank you. That's an intro. In all fairness, you don't have to use Jason and Jase, you just pick one of the two.

Tim Berglund:

Yeah, I know. I like it. The middle one's in quotes; Jason "Jase". Oh, you can't see the quotes. I didn't want to do air finger quotes because it's...

Jason Bell:

So, if I walk down the street, people are going to go, "Hi, Jase."

Tim Berglund:

Exactly. So for anybody who's watching on the YouTube version, seeing me do the air quotes, it's just, it's a bad look.

Jason Bell:

Is it?

Tim Berglund:

[crosstalk 00:01:41] It's a bad look. [crosstalk 00:01:42]

Jason Bell:

Oh, L-O-O-K, not L-U-C-K?

Tim Berglund:

L-O-O-K, yes. I don't know what the other one means. Okay, that might be a British English thing.

Jason Bell:

I'm glad we've covered that now.

Tim Berglund:

Yeah. No, it's important. We're going to cover other things. So a few links in the show notes, I'd like to link to several Morning Morning (Yes, I Have Tea) tweets. I think people need to get a sense of the grammar of the series. And...

Jason Bell:

Look, there's not much to it in all fairness.

Tim Berglund:

Ah, I think it's a deeper well than you think.

Jason Bell:

I'm starting to think that myself, because some people have actually said, "We now go to bed when you have tea."

Tim Berglund:

Right. I go to bed a little early. I'm usually asleep before you have tea. But I look in the morning and most mornings I catch it.

Jason Bell:

Interestingly, someone mentioned that it was the most regular thing as part of the pandemic was knowing. It was really weird when someone put it that way. It's like we had that to hang onto. Really? Yeah, so it was quite nice.

Tim Berglund:

And you can't underestimate those things. I mean, we're recording this late July 2021, people are talking about the Delta variant, but life is becoming post-pandemic we think.

Jason Bell:

Yeah. I still got to get out of the car.

Tim Berglund:

But little consistencies like that. I have a younger brother who's a letter carrier and especially early on in the pandemic, just the, "Okay, mail is still happening." You got to hang on to these little things. Little parts of life that just don't seem to go away and sometimes they're a comfort, sometimes they're not. Like, say, segue approaching, dead letter queues.

Jason Bell:

That was pretty good.

Tim Berglund:

Yeah, that was a good segue. I thought it was so good I had to hang a lantern on it.

Jason Bell:

It was seamless. I'll give you that, Tim.

Tim Berglund:

Right? A month or so ago, we had a conversation on Twitter. We'll link to that conversation. I'm not sure that it was really all that deep, but we had a brief conversation about dead letter queues. And to be honest, I don't even remember what we were saying about them. I just remember that was the topic and I'm like, "You need to be here. You need to come back on the podcast. The triumph of hope over experience for you as a returning guest. And we need to talk about this."

Jason Bell:

Who wasn't free today? Just tell me.

Tim Berglund:

Fair enough. Fair enough, I guess we were both free today, weren't we?

Jason Bell:

There you go. Well, that's fair enough. Yeah, okay. Okay, let's go with that. Yeah, it was a... I can't even remember the tweets right now. I do know where it originated from when, it came from my friend, Bruce Durling, who is the CEO of Mastodon C. I used to work for Mastodon C a long time ago. And Bruce is a dear friend and he's a great data nerd and he taught me Clojure programming, among a few other people, which is now my main language of choice. Not within Digitalis because I do DevOps so I'm not really doing much programming, but when I do any programming it's in Clojure. And I can't even remember, now, how that conversation came about.

Tim Berglund:

Well, I just looked up the tweet. Let me refresh your memory. You posted a picture that was what my people would call a trash can and it said "event litter bin."

Jason Bell:

Ah, yes. That was it, yeah.

Tim Berglund:

"Kafka dead letter queues, also known as..." and then someone else said, "Why does Kafka need a dead letter queue? It is a tape, not a queue." I'd say log, not a queue. "It's like reading a CSV." I would not say that. And Jase said, "Do you want a proper reason?" You said, "Happy to present a lecture in your town soon." And I said, "It's on. Get on my podcast."

Jason Bell:

Yeah, so I actually visited Bruce very briefly a couple of months ago when I was driving across to Scotland. And I did see him, but we didn't talk about that, we just hung out together which was very nice. I've not seen him in a while. Anyway. Yeah, dead letter queues. Where should we start with this one?

Tim Berglund:

Yeah, is it okay to call them queues or can I call them dead letter topics?

Jason Bell:

No. Yeah, so I'm trying to get my brain to say dead letter topics, but the thing is that gives me the abbreviation of DLT, which is basically the abbreviation of a DJ in the UK in the 1970s. So it's in my head where I just go DLQ. It's still dead letter queues to me. It's just language.

Tim Berglund:

It is, and Kafka topics aren't queues, but it's the standard term.

Jason Bell:

So in fact, what does Mr. Moffat call them? Well actually, even in the Connect, in the Confluent cloud documentation it says dead letter queue.

Tim Berglund:

Yes sir, it does.

Jason Bell:

So let's call them that.

Tim Berglund:

Yeah, okay. Dead letter queue.

Jason Bell:

Sorry, I've got [inaudible 00:07:18] big monitor.

Tim Berglund:

For those not on the YouTube feed, Jase is looking at a big monitor. I'm occasionally looking at mine. He's looking around what is apparently a wall screen in front of him.

Jason Bell:

No, it's not that big, but it's there. My show notes are here, my monitors are there, and you're here.

Tim Berglund:

It all falls into place.

Jason Bell:

Feels like I'm waffling a bit. I'm sorry. [crosstalk 00:07:45]

Tim Berglund:

That's okay. So, in case there's anybody, maybe back us up a step. What if there's somebody new and who hasn't done a lot of work in messaging and isn't even familiar with the original concept. What is the original concept?

Jason Bell:

So in an ideal world, especially in Kafka, we produce a message and then we consume a message. That's all fine. So it goes from here to here and everyone's happy. Occasionally, due to some reason or another, that is going to fail. And the message gets so far and goes, "No, not going to deal with that." And it goes and gets redirected to a dead letter queue, topic, and then we deal with it later or not, as the case may be, which we'll end up talking about as time goes on. So it's essentially another topic where we store messages that have caused errors. Caused by what? We'll get to discuss. But that's the basic premise. And then [crosstalk 00:08:42]

Tim Berglund:

Something about the computational process we were using for doing work on messages failed. Just, the program doesn't know what to do.

Jason Bell:

Yeah. So a good example in point: deserialization is a big thing. So if I was deserializing an Avro message and someone by accident, and it does happen, passed in either a or value or adjacent message, the deserialization would fail and we would throw it into the dead letter queue. That's basically it. Within the Kafka realm, we have certain considerations to take whether it is just a normal producer, whether it's the streaming API job, whether it's Kafka Connect, or whether it's ksql.

Tim Berglund:

Okay.

Jason Bell:

Does that answer your question, Tim?

Tim Berglund:

It does. Backing up farther though in the pre-Kafka world. We say queue because this comes from messaging systems. What was a dead letter... I think basically the same thing, but what were they there, and if I lived in the messaging world, why would they have a place of prominence in my mind?

Jason Bell:

You know what? That's a really good question. And one I can't answer.

Tim Berglund:

All right, there you go.

Jason Bell:

You got me.

Tim Berglund:

The same basic idea though, is you've got messages going into queues and inside the messaging system, some sort of logic about where they ought to go next and say routing or something like that, and that fails. And so anytime the computational process for figuring out what to do with a message next isn't successful, then you've got this place to put it, which is the dead letter queue where messages go to die. And this was an expected thing in your old-school enterprise messaging system. It was a feature of the system.

Jason Bell:

Yeah, which does really bring up the... You use a dead letter queue obviously to take those messages out because they're going to cause you problems downstream, but at the same time, are you ever going to use them? That's one of those debates and it really depends on the type of data it is. When I talk about this with teams and advising people and all the rest of it, it's like, would you miss the message? If it failed, would you miss it? And if they go, "Oh no, it's no big deal to us," then fine. You don't need the DLQ you just get rid of it. Just handle the exception and move on.

Tim Berglund:

Yep. Log the exception. Send the exception to the browser window like a good programmer and...

Jason Bell:

Send it to elastic search, please.

Tim Berglund:

There you go. Put it somewhere where it can be found and drop the message. So in the spirit of not burying the lead, I think that's the, at least in my opinion, I think you just expressed the rule. Are you going to do anything with it? [crosstalk 00:11:56] process?

Jason Bell:

And what is surprising, and what I found out over the last few years is people will say, "Oh yeah, we'll just put this message in the letter queue." And it's like, well what are you going to do with it? "Oh, don't know." So you've just got all these messages building up. Are you going to look at them? Because you need to write something to consume the DLQ as well, and this seems to escape some people.

Tim Berglund:

The serialization thing, that's a really clean example. Can you walk us through it? Because you actually have an opportunity to a second pass computationally at the messages in the dead letter queue. Can you walk us through that use case?

Jason Bell:

Well, you literally want me to walk. I mean, I sat down right now. You literally want me to walk it?

Tim Berglund:

No. I want you to sit and drink tea. I pace when I talk sometimes. It's a lot of work for me to sit down for this podcast.

Jason Bell:

I would do, normally, but I'm wired to this thing. So the laptop's going to go everywhere, it's going to make a mess. That's why [crosstalk 00:12:59].

Tim Berglund:

Metaphorically.

Jason Bell:

Metaphorically. So yeah, the deserialization is actually a really interesting one because, as we know, we should all be doing things in Avro or Protobuf or Parquet or... To make sure that our messages are formally accepted and work on both ends from producing to consuming. So we serialize at the producer side and we deserialize at the consumer side, that's fine. Whether that's streaming Kafka Connect...

Tim Berglund:

And there's only one producer and it's always serializing the same way, perfectly, all the time, and everything is fine?

Jason Bell:

In an ideal world.

Tim Berglund:

That's a world that is like a Star Trek episode, yes.

Jason Bell:

Yeah, big data jokes parked, the number of people that send me that gif, thank you, I love you all dearly, but please don't send me that gif again. Yeah, so I'm going to get onto the guts of Kafka Connect later, but a lot of my work is around Kafka Connect. So the deserialization at that point, Kafka Connect would batch 500 messages by default from the topic, it would deserialize them, and then do the processing. If there are any transformation steps in there it will do that too. But what I'm talking about here is just the sync of consuming the messages, deserialization, and then forwarding on to whatever that process is.

Jason Bell:

Whether it's writing to a file or an HTTP endpoint, or a database, or what have you. We have to deserialize from that Avro message, as an example. So if that Avro message, during deserialization fails for whatever reason, it could be incorrect, schema against schema registry is not working, schema registry could be down, but obviously, you should have two of those load balanced. I'm full of good information now. Yeah, so your deserialization would happen. If a message fails, that message would get rooted to the dead letter queue, assuming you have the settings set in your Kafka Connect sync configuration.

Tim Berglund:

Right, which is actual, it's a feature of Connect. If something fails, put it on this other topic. Yeah, and it's called dead letter queue in the documentation.

Jason Bell:

It is, that's why I call it dead letter queue and not dead letter topic. It's in my head. It's just ingrained in...

Tim Berglund:

And I realize I am getting a little ahead of ourselves with the Connect.

Jason Bell:

But no, it's actually a good place to start in that explanation. It's actually the perfect use case of knowing when that DLQ is going to be used. Because officially we don't have DLQs in streaming API. You are either handling a deserialization exception or a producer exception, which is configurable. It's either continue or fail at that point in the streaming API, yeah. But you ain't stopping unless you are failing and then you do stop. Fine, fair enough. But it's up to you to handle that message within the streaming API, whether you forward it to another topic to say that that's failed or you just ignore it. That goes back to my original argument of, do you need it? So anyway, we're targeting off here and I'm trying to bring that... That's not your fault, my fault entirely.

Tim Berglund:

But the planning question, beginning with the planning question on dead-letter queues, the question is really, do you need it? Or to make that a little sharper, is there going to be a consumer? Is this just a waste bin or is there at least a plan for something to read these messages?

Jason Bell:

Well, let's take it back a step further. What's the nature of the message? If it's a payment message, if it's a payment command, you don't want to be losing that.

Tim Berglund:

No, you don't.

Jason Bell:

Okay. You don't want to be losing money. I don't want to be losing money. I'm English. I'm Yorkshire. I definitely don't want to be losing money. So yeah, it all depends on the nature of the topic. Depends on the nature of the message that you're processing. If it's a payment thing, then you definitely need it in a DLQ to handle it later.

Tim Berglund:

Yeah. And that might even be some sort of human in the loop process for that, right? If the computational process fails, somebody might need to a look at what's wrong with the message, right?

Jason Bell:

Yes. And it doesn't have to be a DLQ. It's actually quite an interesting design pattern, and it's one that I've seen. Uber did a post on this a long while ago where it wasn't so much a DLQ, the DLQ actually came at the end. But what they said is, "You produce a message, the consumer takes it, tries against a service. If the service fails, what do you do with it? Let's write it to a retry topic first and then you have a consumer on the retry topic. Tries that same service again. If it fails, it goes to a second retry topic." And it goes down this little chain. Obviously, you have a time out of stopping for a period of time in case anything's happened along with the service. It might need restarting. There might be an issue with that. And you go through that by N number of steps, and for each retry topic that you have, you have a specific consumer trying against that topic.

Jason Bell:

If it fails, you forward it onto the next retry topic. If by the last retry topic it's failed, then it goes to a DLQ and it gets handled [inaudible 00:18:37] a manual process, which is actually probably the perfect case for that. You know, microservices fail, they get restarted again. If you've got pods on Kubernetes and they fail, they just come back up again. So if your payment service on Kubernetes, for example, it dies, it restarts, it joins back into the cluster as available, and you can then start picking up where you left off. That timeout may only be 20, 30 seconds. And you don't want something like that going to a dead letter queue because you still have potential. If it's a website, you've still got a user waiting, potentially in 30 seconds. "Please wait." You can have a spinner going on. That's how I see things from a customer perspective. So I'm end to end.

Tim Berglund:

Yes, [crosstalk 00:19:26].

Jason Bell:

Oh, absolutely. And don't get me wrong, there's nothing wrong with narrow focus as well. I'm not wearing my t-shirt, I'm sorry. I'm not wearing my [inaudible 00:19:40] Kafka dog t-shirt.

Tim Berglund:

That's all right, also not on the brand today. This is my Star Wars-Magritte remix. [crosstalk 00:19:52]

Jason Bell:

That's excellent. I wore that t-shirt at Kafka [inaudible 00:19:59].

Tim Berglund:

Very good. Very good.

Jason Bell:

Anyway. So yeah, depending on what the customer focus is, 30 seconds of waiting for payment, a retry is actually the best thing. You don't want it in a dead letter queue because it just says, "Oh, there's a problem with payment. Please try later." Yeah, so from the customer experience, is that good or bad? And it was actually a really interesting post for me, but not necessarily about the customer side, but just say, "Retry, there's a timeout, retry again, there's a timeout, retry again, it's failed, DLQ." Then we manually handle it or we have a process to then handle it manually. It's not something I've tried myself, but I've been looking into it a lot and thinking actually that's a really good pattern. I like that because I don't want... And it's like that railroad track thing, isn't it? It's just there's a siding there and it's a dead end. There's a buffer at the end and that's it. And that's the dead letter queue as far as I'm concerned. It's like, "Off we go," and it stops.

Tim Berglund:

And you need to think of it as operationally expensive to end up there because somebody has to go look at it. The fundamental distinction I make here. So the Kafka Connect thing, if you have a topic with mixed serialization, there's JSON and, and Avro in there, and you're expecting it to be Avro on the sync connector. No. Well, I imagine it's happened, right?

Jason Bell:

Mixed serialization. I know it can happen.

Tim Berglund:

Yeah. This is the canonical Connect dead letter queue story because you have a messy topic just because of vagaries of life and Avro serialization fails. Okay, kick it to the dead letter queue. And now you've got a consumer on that that will reserialize in Avro and then you can... That's a nice computational, I know it's horrible to contemplate this, but [crosstalk 00:21:52]

Jason Bell:

It's just going round in my head. [crosstalk 00:21:54]

Tim Berglund:

Right. But the whole story is computational, okay. There's a program that knows what to do at every point. And that is using the connect feature called dead letter queue and that's responsible and fine. If this is your life, this is how you deal with it and we're here for you. The real distinction, the fundamental distinction I make here is when computational plans have failed, the programs that you have written can't even, then it goes to a dead letter queue. And probably there's going to be, as I put it, a mental process and not a computational one that takes over. Is that too simple?

Jason Bell:

No, but you got me thinking about something. No, it's not too simple. It is a case that, from what you just explained to me, my brain's going, "No, that's just wrong." You send it to a dead letter queue because it deserializes incorrectly, even though you've got mixed deserialization going on. No, that actually sounds like a kludge at that point and not actually a thought-out process. And it's actually something that I go on about, not just in Kafka, I mean generally, is when thinking about data to products. And let's be honest, we're all in data products. So it's thinking about end to end. What does that data look like at the start? Where's it going and how is it you're expecting it to look at the end? And at that point, if we bring Kafka into that equation, Kafka might just be the middle bit. Connect, Connect sync, database website, all of that all bringing all these services in. But I'm just thinking, is there a kit that you could do now in Kafka Connect to have commerce-separated deserializations?

Jason Bell:

There you go. I don't know, it just come to the top of my head. Because if you are saying, "Right, I'm going to send this JSON serialized downstream, and then next there's an Avro." Because the producer can come from two different places. What if one department's serializing in Avro and another department it's serializing in JSON, yet into the same topic? Yeah, that's interesting.

Tim Berglund:

Yeah, and I think that's in our Kafka Connect course. Maybe it's in our data pipelines course. It'd have been Kaka Connect. So that'll be on Confluent Developer by the time this episode airs. And that's a pattern that we explain for a completely computational use of a dead letter queue where you're doing it, but now you have this other program that knows what to do. And I think of the end of the line of the DLQ being where it's like panicking, in a kernel.

Jason Bell:

That actually, the panic there, what do we do at that point? Do we halt and catch fire? Where do we stop? Error tolerance equals none is my favorite thing in Kafka Connect syncs. And I've had this with... So this is an interesting one. We've gone to the end first. Anyway, we'll go [crosstalk 00:25:21]

Tim Berglund:

And here we are at the time.

Jason Bell:

Nice talking to you, see you later.

Tim Berglund:

I guess it has been.

Jason Bell:

No, so the error tolerance. There are two error tolerances, all at none. For those who may not know. So in Kafka Connect, in syncs especially it's quite important. You need to make a decision. And listen, it is back down to discussing things again, on the design level of what the expectation is. And it's something I go on about quite a lot in my Kafka capacity planning talk. Which was actually the one that I did for the summit in Europe.

Tim Berglund:

Making a note.

Jason Bell:

Yeah. We know that a Kafka Connect sync is going to go to an external service. It's going, we're going to write to a file, could go to elastic search, could go to... Now, I would say the majority of cases, the most popular use cases, given by using the JDBC plugin: to write to a database table. Yeah, so my question is, not everyone in the world has sharded replicated database clusters going on. I would probably know if there's one instance of my SQL. I like my SQL, I have no problems with it. In the same way, I have no problems with Postgres or SQLite. I really don't mind.

Jason Bell:

But the question remains is: if there is a network failure but between Kafka Connect sync and the database, you get an exception back. If you have error tolerance equals all, it will continue to process but you will lose messages because they don't go to the dead letter queue. Remember the dead letter queues only handle deserialization and transformations. Now there was a KIP a while ago. What was it? It's in my notes. I actually wrote things down today. So the put, we have a convert method, which deserializes the message. We have a transform method, which transforms any data that you have within the configuration, okay. And then the put method is the actual sync sending data out. And it'll send the collection of records and put. If anything happens in that put, on that batch, and it fails, you get an SQL exception, and error tolerance is all it just spirals out of control.

Jason Bell:

It will just carry on running, but records will never get written. They don't go to a dead letter queue because they can't. There's nowhere for them to go. Now, if you're just doing that in the general JDBC plugin, there are a number of retries, yes. But I don't know what the real... I haven't looked at the code now for the JDBC connector for a while. So the retry policies were in there to say, "Okay, it failed. I'll try again. It failed. I'll try again." N number of times, and then fail. But if error tolerance is all, it's still going to carry on. It's just going to try again.

Jason Bell:

If error tolerance is non, then obviously the connector will stop, it failed. Hit your failed state. And obviously, you are monitoring your Kafka Connect workers. So therefore you will get an alert and paged and you'll have to wake up and go and fix it and see what's going on.

Tim Berglund:

Fix the network connectivity problem with the database, yeah.

Jason Bell:

Now some databases, like Oracle, for example, you might get IP switching over the work, reconnects, those kinds of things. Sometimes it drops. Sometimes it doesn't, it's quite seamless. I've come across that before. Those are sleepless nights. And the only thing you can do is just go in and reboot Kafka Connect and start it again. But those messages at that point where messages are put from the connector to the destination, if it's a failure at that point, it's too late. Everything to do with the DLQ is done and dusted.

Tim Berglund:

Got it, because you don't have error tolerance equals none.

Jason Bell:

Yeah, exactly. I like error tolerance equals none so I can go and investigate, go and check the logs and see what's going on. That's the kind of guy I am. And I will use a tail.

Tim Berglund:

I know you're not afraid to.

Jason Bell:

I still do everything via CLI.

Tim Berglund:

I mostly do too. Maybe that's a when you came of age thing.

Jason Bell:

It's an age thing, Tim. It's an age thing.

Tim Berglund:

Here we are where I feel comfortable.

Jason Bell:

Yeah, exactly. Just give me a terminal window and I'm happy. I must apologize to Robin, by the way, the kafkacat thing was just like, I can't use that actually. In some organizations, I'm not allowed to.

Tim Berglund:

Yes. Jase is referring to a Twitter conversation of a few days ago where I had tweeted, and I'll try to link from this if I find the tweet, but I tweeted some cool stuff Robin Moffat had posted about kafkacat.

Jason Bell:

And it is really cool, [crosstalk 00:30:33]

Tim Berglund:

It is a great command-line utility. And Jase jumped in. He said, "Yeah, some places you're not allowed to use that." I'm like, "Okay, that's right."

Jason Bell:

And I'm just digressing from DLQs for a second. It does highlight an interesting thing that when we document these things and educate these things, it's not always available to everybody. I have to use the CLI and Kafka can console consumers and get things out the hard way. I'm not saying, "Feel sorry for me," it's just that's what we have to do. There's no way around it.

Tim Berglund:

And well, this is one of the roles you play in my life. Sometimes I'll tweet something about some kind of starry eyed, "Oh yeah, the cloud is perfect and everybody obviously use the cloud."

Jason Bell:

Right, the cloud is not perfect because your error tolerance is all.

Tim Berglund:

Right. And you almost say, "Hey, you know there are people who run on-prem on account of they have to." Yes, you are right. So this is a good corrective to the shiny and the cool. The majority of the world is not there.

Jason Bell:

When people ask me how to get into Kafka and they go, "Oh, we use a kafkacat and all that." So that's great but learn how to use a CLI definitely first, before you start diving into kafkacat. That's my thought, anyway.

Tim Berglund:

Now, we are my friend, at times. So closing thoughts on DLQs. What's a takeaway? If you could distill your thoughts on the topic. One thing for people to remember, what would that be?

Jason Bell:

Okay, here's my checklist, one to end, because I don't know where this is going to end yet. First of all, do you actually need to deal with the message if it goes into the DLQ? If it's important to you then obviously yes, you do. Secondly, if you are not using any form of serialization, deserialization, why not? It's available, it's there for you. And to minimize things going into a DLQ, that's where you start. It starts at the source. It starts with the producer. Don't forget that if you are implementing a DLQ, you need something to consume those messages, you need to investigate what went wrong. Was it an application thing? Was it a Kafka thing? Kafka failures do happen and have to handle them. So when you are using your DLQ, are you going to take in just the message or are you going to take the message and the exception, the message, and the metadata? There are various controls you have, especially in Kafka Connect, about what information you can put into that DLQ. Okay.

Jason Bell:

If it's important stuff like payments and things that involve customers, yeah, you need a DLQ. Okay. Bearing in mind things like the streaming API and ksql will continue to stream on failure by default. So if you do come across messages within those components that need handling it's up to you to handle them into some form of the topic of your choosing. But the same rule applies. If you're rooting things to an exception topic, for example, you still need to read from it and you need to know what's going on. Plus, my final point, 1, 2, 3, 4, 5, I think it's five. I think I'm not really taking count. And now I've completely lost my train of thought. The DLQ itself, when you're consuming, if ordering is important to you, how do you then get those messages back into order?

Jason Bell:

So if it's a social network, I make a post, the post gets published, the person replies to post, I delete the post, but the delete never goes and goes to a DLQ for some reason, how do I handle that? Or account creation goes to a DLQ, but then someone updates the account. What happens? You have to think about the end to end of every message that you're dealing with in those kinds of systems. There are tons of permutations where things can trip you up and catch you out. So, by rule, I would prefer to try and handle things the best I can within the application and not necessarily route to a DLQ. A DLQ is fine, but what I've noticed is people just go, "Oh, it goes to a DLQ. We'll deal with it whenever," when actually it needs to be dealt with in parallel to what's being processed at the same time.

Jason Bell:

But most people, I'd say most people... The temptation, I say the temptation, that's far better and not casting out to other people, that's not true. What I'm saying is, we shouldn't be leaving consuming DLQ messages as a second afterthought. It's actually really important that we still deal with that message as quick as possible.

Tim Berglund:

My guest today has been Jase Bell. Jase, thanks for being part of Streaming Audio.

Jason Bell:

Thank you very much. Nice to be here.

Tim Berglund:

And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer. That's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture and design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kaka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code PODCAST100 to get an extra $100 of free Confluent Cloud usage. Anyway, as always, I hope this podcast was helpful to you.

Tim Berglund:

If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening, or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support and we'll see you next time.

If you ever wondered what exactly dead letter queues (DLQs) are and how to use them, Jason Bell (Senior DataOps Engineer, Digitalis) has an answer for you. Dead letter queues are a feature of Kafka Connect that acts as the destination for failed messages due to errors like improper message deserialization and improper message formatting. Lots of Jason’s work is around Kafka Connect and the Kafka Streams API, and in this episode, he explains the fundamentals of dead letter queues, how to use them, and the parameters around them. 

For example, when deserializing an Avro message, the deserialization could fail if the message passed through is not Avro or in a value that doesn’t match the expected wire format, at which point, the message will be rerouted into the dead letter queue for reprocessing. The Apache Kafka® topic will reprocess the message with the appropriate converter and send it back onto the sink. For a JSON error message, you’ll need another JSON connector to process the message out of the dead letter queue before it can be sent back to the sink. 

Dead letter queue is configurable for handling a deserialization exception or a producer exception. When deciding if this topic is necessary, consider if the messages are important and if there’s a plan to read into and investigate why the error occurs. In some scenarios, it’s important to handle the messages manually or have a manual process in place to handle error messages if reprocessing continues to fail. For example, payment messages should be dealt with in parallel for a better customer experience. 

Jason also shares some key takeaways on the dead letter queue: 

  • If the message is important, such as a payment, you need to deal with the message if it goes into the dead letter queue 
  • To minimize message routing into the dead letter queue, it’s important to ensure successful data serialization at the source
  • When implementing a dead letter queue, you need a process to consume the message and investigate the errors 


EPISODE LINKS: 

Continue Listening

Episode 187November 23, 2021 | 29 min

Explaining Stream Processing and Apache Kafka ft. Eugene Meidinger

Many of us find ourselves in the position of equipping others to use Apache Kafka after we’ve gained an understanding of what Kafka is used for. But how do you communicate and teach others event streaming concepts effectively? As a Pluralsight instructor and business intelligence consultant, Eugene Meidinger shares tips for creating consumable training materials for conveying event streaming concepts to developers and IT administrators, who are trying to get on board with Kafka and stream processing.

Episode 188December 1, 2021 | 30 min

ksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work Together ft. Simon Aubury

What is ksqlDB and how does Simon Aubury (Principal Data Engineer, Thoughtworks) use it to track down the plane that wakes his cat Snowy in the morning? Experienced in building real-time applications with ksqlDB since its genesis, Simon provides an introduction to ksqlDB by sharing some of his projects and use cases.

Episode 189December 7, 2021 | 33 min

Using Apache Kafka as Cloud-Native Data System ft. Gwen Shapira

What does cloud native mean, and what are some design considerations when implementing cloud-native data services? Gwen Shapira (Apache Kafka Committer and Principal Engineer II, Confluent) addresses these questions in today’s episode. She shares her learnings by discussing a series of technical papers published by her team, which explains what they’ve done to expand Kafka’s cloud-native capabilities on Confluent Cloud.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free