Mitch Seymour is a staff engineer with MailChimp and the author of O'Reilly's Mastering Kafka Streams and ksqlDB, a recently released book about Kafka streams and ksqlDB. And he's really just a great guy and the sort of person that we probably want to listen to on these topics. We get to do just that on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud. Hello, and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined today by my friend, Mitch Seymour. Mitch is a staff engineer at MailChimp and he's an author, we'll talk about his book a little bit. Mitch, welcome to the show.
Thank you. Appreciate it.
You got it. We were talking before we started rolling, I just really thought you'd been on the show before and you were one of those people I was going to say, returning guest. It just seemed like it because we've talked and you're a pretty active guy in the community, but welcome to this, your first and hopefully not last appearance. We'll see how I do. It'll be up to you whether you want to come back.
I'm sure you'll do great. You're pretty experienced at this.
I mean, we've done some episodes, I suppose. So I'll do my best to treat you well. [crosstalk 00:01:30] I want to talk about two things. Number one, your book, and number two, I'm looking over at your book. If you're watching the video, I just looked to the side because I'm literally looking at the book. And also stream processing at MailChimp and what you've done there. You've been there for a while and this is obviously a thing that's a big part of your life professionally. So I want to get into that. First, again, for those watching on YouTube ... Number one, if you're not if you're watching, I'll take the book away for a second. I'll bring it back in a minute.
If you're listening to this podcast, which seems like the normal thing to do, that's how I consume all my podcasts. Avid podcast listener never watched the video. But now in this world that we're in, it's somewhat expected to have a video version as well. And we're very glad if you're joining us by video. And you get to see the cover of this beautiful book. Let me not obscure it with my hands. This is Mastering Kafka Streams and ksqlDB by Sir Mitch Seymour, Esquire, is what it says on mine, for [inaudible 00:02:38]. And I don't want to make you jealous everybody, but I have an inscription. I can't read it. It's a private message from Mitch to me.
It's very sensitive.
It is, it really is. We're going to talk about this book. I just wanted everybody to know this book exists, probably a book you should own if this is what you do. This message was brought to you by the Mitch Seymour marketing council.
I did not ask Tim to do this, but I really appreciate it.
So completely did not. If you know Mitch, you didn't even need to hear that, because you know Mitch would not ask me to do that. But it's a good book and I want people to know about it. So do you want to talk about the book or do you want to ... Does that just end up coming out if we could just make reference to it as we talk about what's going on at MailChimp? Your call.
Yeah, I think it ends up coming out in my experience at MailChimp, but yeah, I will preface it by saying the book took over a year to write. I poured everything I had into it. So, yeah, I feel good with where it's at and I hope you give it a shot.
Again, I'm holding it up. For those of you who are not watching on video, I'm holding the book edge-wise so that you can see all of its, not counting index, 390 pages. And then a few pages of index. It is an imposing volume, though accessible, and practical. So yeah, there's a lot of work. I've written a much smaller book and it felt like that was a lot of work.
I did go over my page count. I had to ask O'Reilly for more because there's just so much to talk about in this space. And I grossly underestimated how much content there was that I wanted to communicate and-
Odd for a professional software developer to underestimate a thing like that. Shocker.
That's right but at the same time, I tried to not be too verbose. I have a philosophy on code and writing in general, it should mimic a haiku. It should be very short, succinct, to the point. So I tried to accomplish that, but there's just so much material around Kafka streams and ksqlDB that I think that's the right pitch count.
Yeah, no, you want to be more like Hemingway and less like Dickens, less like Hawthorne, in writing technical books. We're here to learn stuff and beautiful writing is always a good thing, but brevity is the soul of utility in the case of the tech book. So let's talk about what you've built at MailChimp. MailChimp, I think everybody knows, is an email service. If you want to send mass emails to people and know who reads what, clicks on what, and you don't want to be a spammer and you want to do this in a way that no one misinterprets spam, that's MailChimp. Would you add anything to that?
Yeah, no, that's pretty accurate. And I would say we've evolved into a more powerful marketing platform in general. Email is definitely a core function, but we have a lot of smart features. We have a strong data science team and ML engineering team, and, yeah, I would just say a more general-purpose marketing platform is what we are.
Nice. So how did streaming get started? You've been there for a few years. So what was your introduction to the streaming and talk us through that story and how Kafka and Kafka Streams enter the picture.
Sure. So MailChimp started using Kafka when the ecosystem was still very young. Before Kafka Connect, Kafka streams, and of course ksqlDB.
Before Connect, even. That's 2015, right?
Yeah. So I tried to dig up an exact date when we started using it, and I couldn't find an exact date, but it was definitely at least early 2014 when we started using Kafka. So our use cases were pretty standard in the beginning. So we were shipping application and server logs to Kafka and then writing those logs to Elasticsearch. And then we were also doing some change data capture. So MailChimp has a lot of relational databases that power the application. And so we would stream changes from those databases into Kafka. And then we had a lot of stream processing applications that would use that data for various things, including supporting some MailChimp Pro features.
So yeah, I would say it started off pretty simple, but because we were early on the Kafka side, I think we experienced a lot of the ... I'll just say maybe rough edges or complexities of building stream processing applications without some of these tools, like Connect and Kafka Streams. So I think it has basically given me a deeply rooted appreciation for these technologies. This is the big motivation behind my book, I wanted to share it with people because I've seen a world without these technologies and it's not as fun.
It's a dark, dystopian place. It's always raining and smoky and, yeah, it's not a world you want to live in. I don't remember who said it, but somebody said every sufficiently complex C# program has a buggy, partial implementation of lists buried within it. I mean, I've seen large C# programs and that's always been surprising to me. Maybe I just don't or didn't know the list well enough at the time because I didn't see it. But the phrase buggy, partial implementation, when you look at early Kafka adopters, especially prior to 2015 ... Yelp is another one that always comes to mind because they've spoken a lot about their experiences early on. They built their own. So if you don't have a Connect and you don't have a Kafka Streams, you're going to need one. Those things have emerged as a part of the ecosystem because they're pretty much always necessary.
And so if you have the foresight and energy and enthusiasm and good judgment to have been an early adopter, you end up with buggy, partial implementations of those pieces of infrastructure. Because you had no choice but to build them. And you're too busy being MailChimp and evolving into a full-fledged marketing platform with email as a strong component to make world-class, pluggable integration platforms and stream processing frameworks and things like that. So you never quite get there. So it's good that you've seen that. I tell this story a lot when I give my basic introduction to Kafka talk. And I love to hear it from people who have lived it. So, thank you.
Yeah, definitely. And we tried building some of this stuff ourselves. So before Kafka streams, we recognized the need to reuse some abstractions and code when building these applications. And I don't think we ever caught the right abstractions quite right like Kafka streams does, so our version didn't really last long. And we gladly adopted Kafka streams when it came along. But yeah, I refer to those days as the wild west days where everybody was forging their own path without a lot of guidance or rules. And yeah, I mean, just in general, there's always been a lot of excitement around Kafka at MailChimp. And when you basically are developing applications that have a lot of individuality, because there's no patterns or library to use, it becomes hard to maintain, I would say.
So yeah, when Kafka streams came about, we were very on board with using that and it solved a lot of our problems. But the hackathons were also really interesting, in the pre-Kafka streams world, where we would basically have a company hackathon and everybody was so excited to use Kafka that they would come and it would develop in a short period of time. Just so many variations of a stream processing application in such a short period of time. And it was kind of interesting to see, but yeah, I guess all of that is to say, I've come to appreciate Kafka streams for that reason. To help build applications in a more uniform way, I'll say.
Yeah. Yeah. Yeah. And also, I always want to be careful when I talk about the buggy, partial implementation thing because it's not like MailChimp lacks engineering talent. I'm talking to some of it right now. But when you've got a community of people, all focused on contributing to a framework, and even just designing an API. I mean, you know how hard that is. If you're one person or three people may be thinking about that, it's super hard to get right. If you've got 20, 25 people all debating the shape of that API, you're going to end up with something better. And then in terms of implementation, you've got more people and more people running it against different kinds of use cases. It's just the distributed development effort on a thing like streams is always going to yield a better result. Even if you're a three Sigma, four Sigma, world-class engineering team. It's just, you get all that input, you get a better thing. I guess thus concludes my argument for Opensource.
Yeah, I totally agree. And yeah, I kind of talk about the pre-Kafka streams world in my book a little. But yeah, there's definitely a huge value for Kafka streams. And I think we were lucky that it came out when it did because, at the time I joined, a new business use case was cropping up. And basically, the core business function we're talking about, sending emails, we had built our own MTAs. Our delivery team has. Which is the software that-
MTAs, for those of you ... Yeah. [crosstalk 00:13:23].
Yeah, exactly. So the software that sends emails. So we needed a way to monitor those systems in real-time. And Kafka streams came out right before we tackled that problem. And I will say I'm eternally grateful for that because solving that without Kafka streams would have been pretty difficult.
Would have been. Let's just go with, not fun. You use this term, software speciation, and I love ... I've just had an affection recently for an analogies to biology and analogies to psychology. I've been just doing some of that. But what do you mean by that? How does that relate to your Streams adoption journey at MailChimp?
Yeah, I think it comes down to uniformity. So when you have a large number of applications that are deviating in how they're implemented and how they're designed and, I don't know, they become almost different species of applications that you have to maintain. I think it goes along with, you want to treat your software more like cattle than pets. So software speciation, to me, is just having a bunch of pets, basically, that you have to basically maintain. And I think over time there's a big maintenance cost to basically having several species of a stream processing application or any sort of software. And so I think you can reduce that if you control complexity and individuality in a way that reduces those maintenance costs, I should say.
Got it. So it's a thing to be avoided. In the history of life on earth, the Cambrian explosion, I think most of us look at with a certain aesthetic appreciation, if nothing else. It's a really cool thing. And all of a sudden you've got all these neat animals, and it seems to have worked out well, but in software that's bad because now you've got to care and feed for all that stuff.
Yeah, exactly. It's like yin and yang, in some areas, it's great to have lots of species of something, like in the natural world. When it comes to the systems you maintain, you probably don't want that so much.
You don't, you don't. It's funny because we talk about evolutionary design and evolutionary architecture. And I use those phrases all the time and we all pretty much know what we mean by that. But the analogy breaks down because, yeah, those things don't actually run themselves, in any sense. They're actually very carefully designed and curated and maintained and you can get emergence in systems. That's a good thing. But it's not as if you've said here is this undirected, natural process and a bunch of good things is going to pop out. That doesn't work in what we do.
I like it. Yeah so you started it with the MTA, the mail transfer agent thing, but I think you also talked about an email, or we had talked about an email tracking application, as a case study.
Walk us through it.
Sure. So part of observing these mail agents basically requires us to ... I compare it to a package tracker. At any given point in time, MailChimp is sending tons of emails. We send over a billion a day. And so when we think of emails, we think of probably two primary states. An email can be delivered or it could be bounced. And if you expand a little, you could probably understand that emails might go through a pending state and they also go through queued. So we have all these states that an email, outgoing email, can go through. And being able to properly support our MTA systems, we need to have an understanding of what states are outgoing emails are in. And we need to be able to aggregate that in a way where we can answer questions about how many messages are in a given state per domain?
Are we seeing a lot of deferred messages? For example, messages that we've got a soft bounce to try again later, type of error. How many of those are we seeing per remote MTA or maybe per server? So we have all these features that we want to aggregate this by. So we've built, with Kafka streams, basically a package tracker where instead of tracking fiscal parcels, we track these outgoing emails and they can be in any number of these states. And given the data volume, if you think about over a billion emails per day, and then fan that out by X amount of states that it can go through, and [crosstalk 00:18:52]. Yes, yes.
Okay, that's a lot of state.
It is a lot of state. Yeah so we basically, yeah, use a state machine model on each outgoing email. And it's possible with Kafka streams, and the stateful operators that that comes with, and RocksDB, mixed up pretty quick because that's a local state store that you can use instead of hopping across the network. So we've been able to build a package tracker for all of our outgoing emails using Kafka streams. And, yeah, a lot of that experience also informed what I write about in my book. Because we were so new to Kafka streams that, even though we have this collection of operators that we can use ... That's right. Thanks, Tim. Tim is holding up the book for the podcast listener.
Let the record show. Go on.
That's right. Even though we now ... adopting Kafka streams had a collection of operators we could use, I personally, and I think a lot of people as well, had trouble framing the problem in the correct way. I didn't know how to understand the application I was building in the correct light. And what I landed on in the book, and later, is making a distinction between stateless applications and stateful applications is a pretty important thing to do early on. Because it tells you a lot about the operational costs and the nature of what you're deploying. So yeah, being able to distinguish between stateless and stateful is important for us to get, I think, this package tracker right. And to understand the more complex stateful operations we were doing.
Yeah because stateful, again, my reaction when you say there are a billion emails a day, I guess a lot of those are going to have a short life cycle. so they're going to get to a state of, everything's fine, I can forget about this now. So that makes it less bad. In the happy path, 90-some percent of the time, that's a short period of time, right? That's seconds.
Mm-hmm (affirmative), right.
Okay. Okay, all right. [crosstalk 00:21:20]
Normally seconds, but it can extend into as much as three or four days, actually.
We've all gotten bounce messages three days later. You're like, "What? Did I send that? I can't remember things." Speaking of state, I mean, I'm stateless. But okay, so I'm just curious, can we talk about the scaling out of that Streams application? Was it many nodes? Was memory a problem in K tables and what happened there?
Memory wasn't the biggest problem for us. I will say we did realize early on that we needed to leverage tombstones in order to keep the state small. Which, I'm not sure how common knowledge that is, but the idea of tombstones was definitely new to me when I picked up Kafka streams.
Just describe that to us.
Yeah, so it's basically a message with a key and no value, which acts like a delete marker for a record that you're tracking. So if an email gets hard bounced, or if it gets delivered, you realize you don't need to track it anymore. So if we keep that around in RocksDB, then we've got a lot of state that we have to basically maintain, which isn't great. You don't want a big, stateful application when it can be very small. Because if the application experiences an outage, or needs to replay data in those topics, you want to minimize the number of records you're tracking to make that quick as possible.
And so you tombstone things, say when an email is successfully delivered ...
You tombstone it. Oh okay, okay. All right, so it ends up being pretty reasonable. I was picturing scaled out to 10 nodes and all kinds of off-heap memory and some kind of craziness. But I guess that was not the limiting factor.
That was not. I think we've been pretty lucky in terms of, scaling has been ... We've added more partitions so we've been able to scale horizontally that way. And we do run several replicas of this application. But yeah, I will say early on, we did try using in-memory state stores for some of this, which probably was just me being unfamiliar with ramifications of that. But yeah, I mean, we did certainly see high memory usage at that point. But yeah, I would say the trickiest thing, actually for us, has been generating enough load to stress our application out. So I think that's probably a good sign when you have trouble stressing your application. And we had to do that because a lot of our email volume is seasonal. So as you can imagine, around the holidays, we have searches and outgoing emails. So we want to basically load test ahead of that.
How about ksqlDB? We've been talking about streams, has there been any Ksql equaling happening in your world?
Yeah. So we do use ksqlDB some. We use it for pre-processing some data that gets fed into some machine learning models. And we also have some social feeds that we collect and use Ksql for doing some data projections and sentiment analysis on some of the Twitter data and social data that we're capturing. But yeah, I love talking about abstractions, ksqlDB, because starting from the point where we didn't even have Kafka streams, and then seeing Kafka streams come onto the scene and it makes things easier, you can reduce complexity just by using this library.
And then with the introduction of ksqlDB, I think managing complexity almost becomes surgical at that point. Because basically, for most of what you would want to do in the stream processing application, you're now working with SQL, which I think is a language a lot of people are already familiar with. And if you're not, then it's easy to pick up. And the thing I love the most, probably, is the ability to create your own custom functions in ksqlDB. This is actually what my Kafka Summit talk is about if anybody wants to dig that up. And that's when I talk about [crosstalk 00:26:31].
Remind me, which Kafka Summit? That was 2020?
It was 2019, yeah.
2019, okay. Okay. It'll be in the show notes, of course.
But yeah, I think it's really interesting being able to drop in a custom piece of code without impacting the rest of the stream processing logic. It's so contained. I don't know. There's something, dare I say, beautiful about that.
It is beautiful. That is where I think a lot of the magic is. And is [inaudible 00:27:09]. That's not a thing you can do in the cloud right now. And I don't know why. These product managers are all like, "Well, arbitrary code execution in the cloud is hard," and I'm like, come on. It should be easy. So, I mean, kidding of course. That is an incredibly difficult thing to do in a secure way. And I imagine will be solved at some point, but in self-managed ksqlDB, it's frankly pretty simple. There are some annotations and you drop a jar in a directory and you've got this custom function with all this beautiful stuff going on expressed in SQL. And it really is great.
Yeah, I agree. And I think I would love to see the ecosystem evolve in that direction too. And yes, I'm saying this, and I should be contributing directly, but I have a new baby on the way.
[crosstalk 00:28:03] you got a second baby on the way, you have other things that you have to do because you're going to die someday and I think is courteous of you to leave some replacements.
That's right. Yeah. But there's an opportunity there to basically share these units that we're creating and plugging into ksqlDB. [crosstalk 00:28:28]. Sorry, that was a sudden shift by me. But I think especially in the machine learning domain if you can build some interesting things. I keep going back to sentiment analysis because that's the one I like. But building like a reusable and highly functional UDF that you can just drop in your own ksql deployment. Or that you can share with others via ... I think it's something like Connect Hub except for ksqlDB UDFs. I personally think there's an opportunity there.
That day must come. Well, Mitch, half an hour with you is not long enough. And so maybe we can get back together on the show soon and deep dive into some UDF things that you've done. That would be a fun thing to talk about.
Definitely. For sure.
But, in conclusion, I do want to, again for our video audience, display the cover of this lovely book, Mastering Kafka Streams and ksqlDB, by Mitch Seymour. Link in the show notes. Get a copy. My guest today has Mitch Seymour. Mitch, thanks for being a part of Streaming Audio.
Thanks so much.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer. That's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event driven architecture design patterns, executable tutorials covering ksqlDB, Kafka Streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code PODCAST100 to get an extra $100 of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening, or reach out in our community Slack or Forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
Building a large, stateful Kafka Streams application that tracks the state of each outgoing email is crucial to marketing automation tools like Mailchimp. Joining us today in this episode, Mitch Seymour, staff engineer at Mailchimp, shares how ksqlDB and Kafka Streams handle the company’s largest source of streaming data.
Almost like a post office, except instead of sending physical parcels, Mailchimp sends billions of emails per day. Monitoring the state of each email can provide visibility into the core business function, and it also returns information about the health of both internal and remote message transfer agents (MTAs). Finding a way to track those MTA systems in real time is pivotal to the success of the business.
Mailchimp is an early Apache Kafka® adopter that started using the technology in 2014, a time before ksqlDB, Kafka Connect, and Kafka Streams came into the picture. The stream processing applications that they were building faced many complexities and rough edges. As their use case evolved and scaled overtime at Mailchimp, a large number of applications deviated from the initial implementation and design so that different applications emerged that they had to maintain. To reduce cost, complexity, and standardize stream processing applications, adopting ksqlDB and Kafka Streams became the solution to their problems. This is what Mitch calls, “minimizing software speciation in our software.”
It's the idea when applications evolved into multiple systems to respond to failure-handling strategies, increased load, and the like. Using different scaling strategies and communication protocols creates system silos and can be challenging to maintain.
Replacing the existing architecture that supported point-to-point communication, the new Mailchimp architecture uses Kafka as its foundation with scalable custom functions, such as a reusable and highly functional user-defined function (UDF). The reporting capabilities have also evolved from Kafka Streams’ interactive queries into enhanced queries with Elasticsearch.
Turning experiences into books, Mitch is also an author of O’Reilly’s Mastering Kafka Streams and ksqlDB and the author and illustrator of Gently Down the Stream: A Gentle Introduction to Apache Kafka.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us