You know, it's not entirely straightforward to make Apache Kafka cloud native. That's why it's good that we have ### Gwen Shapira on the job. Gwen comes back to the show today to talk about what her team and other teams have done to help take the open source Apache Kafka, which was hatched in a world before modern cloud native assumptions were really in place, and bring it to life in a service like Confluent Cloud. It's a great conversation.
Before we get there, a reminder that Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io. Go there not right now, but as soon as you're done listening to this episode, and you'll find things like video courses, a library of event-driven design patterns, executable tutorials, it'll get you hands-on coding with Kafka and ksqlDB and Kafka Streams, really everything you could need to get started. When you do, you'll probably want to sign up for Confluent Cloud to do exercises. Use the code PODCAST100 to get an extra hundred dollars of free credit. Now let's get to the show.
Hello and welcome to another episode of Streaming Audio. I'm your host, Tim Bergland and I have with me today, ### Gwen Shapira, returning guest and host of-
-Ex host. Okay. I would like to say host, but once a featured host of content on this very channel. Gwen, welcome to the show.
It's great being here with your team.
In case anybody doesn't know, what are you up to these days? You're a fairly well-known person in the Kafka community, but what are you doing now?
Yeah, so I am managing the Cloud Native Kafka team, which is part of the Kafka org at Confluent. I think it'll not be a surprise to anyone. Confluent has quite a large organization focusing on Kafka engineering and investing in Kafka. And as part of it, we want to develop features specifically for making Kafka have a better experience in the cloud. And that's something that you really have to think about from the ground up because it's not like we always made a joke, right? The cloud is someone else's computer like it. But if you just take Kafka and plop it on someone else's computer, well you have Kafka on someone else's computer. But if you look at what Cloud Native data services give you, if you look at something like Aurora and S3, they give you a totally different experience. Like there is nothing you can plop on a machine that will give you the S3 experience, the Aurora experience.
So we took a step back and thought what will allow Confluent Cloud to give the same kind of experience to Kafka? And we were thinking, okay, elasticity is a big part of it. Like you never have to think about capacity planning for S3. I never had to. Aurora has Aurora serverless auto scales. So that's clearly really important. Things like being very API driven are very important. Kafka has APIs, but you kind of have to install REST Proxy on the side of it. Confluent Cloud aims to just be API native. Things like going very high scale. Again, like you are never afraid that you will run out of S3, that you lose so much S3 that the cloud will run out of it. We don't want people to worry about that ever running out of Kafka.
And then also things like multi-tenancy is kind of important. I mean, Native cloud services allow the people who run them to be very efficient by actually putting a lot of logical customers on the same physical machines. And again, with S3, you don't know who else is running S3 on the same machines, but it's probably everyone. Maybe everyone except Netflix, something like that. So we're simply-
That concept of-
-we're looking at world class cloud services and how can we make Kafka and Confluent Cloud that way.
Work that way. The concept of a machine in S3. You're like, wait, what?
Yeah. I mean, we kind of-
You know they're there, but that's not a thing that you think about in the-
Exactly. Like if you think about it, if I'm asking you, is there machines in S3? You'll be like, yeah, probably somewhere has to be.
Yeah. Are there servers and serverless? There are, but you don't know it.
Yeah. And I think people kind of focus on, yeah, of course, there are servers. Of course, there's never serverless. Like there, everything is, everything is ones and zeros running on 20 stores being manufactured, or maybe not manufactured into one. But if I think it is a kind of an engineering tendency to not think about what it looks like to the users and to the customer, and the customer experience is vastly different between giving me five Kafka boxes with that much CPU and that much memory, versus I'm going to have certain topics, 400 partitions, 200 megabytes a second. I need something that will give me that. And tomorrow I actually need 800 megabytes per second. I want it to just give me that.
And when Kafka was born, let's say 2011, that's not pre-cloud right? The cloud was very much a thing, but these Cloud Native data and it's not pre S3, S3 was six or seven years before that. But the Cloud Native experience that you're describing, isn't really the context that Kafka was born in. Kafka was born-
Yeah, I think back then, even though cloud was popular, it was still like an SMB thing. Like if you had a startup like nobody thought that enterprises will run on the cloud. Nobody thought that one day the Goldman Sachs and Morgan Stanley's of the world will be on the cloud. And even like, two, three years ago, you listened to an interview with JP Morgan and they tell you, oh yeah, no, we don't have cloud services. Cloud is good for other people.
For those crazy people in Silicon Valley.
But it's not our thing, we're a bank. And if you talk to the exact same people at JP Morgan today, it's like, oh of course we have a cloud strategy. I mean, what city does not have a cloud strategy?
[crosstalk 00:06:49] I mean, that's just obvious of course, we've been doing that all along, right. That transition has been made.
Kafka in 2011 in its inception. Again, the assumption was there are, there are boxes, there's sheet metal. You can cut your finger on the sheet metal, it's cold and noisy in the room and it dries your skin out. And that's that world.
I do want to point out that the fact that Kafka was designed initially with this kind of mechanical sympathy, like with a lot of affinity for the metal, this is actually not working against us. This is actually working for us because one of the things that everyone really wants in the cloud service is to have really, really consistent performance. Any slowdown is obviously a disruption. And like the customer expectations really changed that, yeah, we are buying a service. We want it to be the exact same service performance included day in, day out. No nasty surprises. And the fact that Kafka was actually built for this highly performant experience from the ground up is still serving us extremely well.
Good. Good. Now your team recently wrote a series of blog posts on this work of how to take the Kafka that everybody knows and is familiar with and what do you have to do to turn it into this Cloud Native experience. We're going to, those are linked in the show notes, those blog posts, of course, and you should go read them, but I kind of wanted to give everybody like the ### Gwen Shapira summary. Why don't you just kind of take it away? Well, let's just talk through-
-what you've been doing.
So first of all, it started by kind of just putting, if anything, like my three-year thesis of what Cloud Native Kafka is about. I don't know, maybe calling it strategy is a bit too much, but it's kind of my belief of where we should be heading and where my team and other teams are driving. Obviously, it's kind of a very collaborative effort and those are the things you said, I said, just said like about the experience, about focusing on real users who no longer want to think about boxes. And then I basically had the four-team leads, trying to think, two from my team and then one from our cloud control plane and infrastructure team, and one from the Kafka storage team and have them each kind of write a deep dive on the problems in a specific space and our learning on how to approach it.
And the target audience is not really people who use the cloud, even though they may find it really interesting. It's really people who are writing their own cloud services. Like I have a lot of friends today I think we both kind of even swapped messages about like [inaudible 00:10:00]. Like we are just seeing all those companies basically take an open-source technology and take it to the cloud. So I really, we really focused on, okay, we are trying to help our friends out here. What best advice can we give them? So in many ways, if you are, it's more kind of behind the scenes on Confluent Cloud and something that is a very useful takeaway as a user, but obviously, we're all engineers. And I enjoy reading behind the scenes of S3 as much as everyone else. I know, right.
You always want to pick it, how stuff actually works, even though you know there's absolutely no reason for you to care about it.
Look the first article in this space that I can remember reading was about how, and this was a good 10 years ago or so, Facebook engineers were rewriting ethernet drivers on the servers that they had in their... You know, like, I mean, this is so great. Like early in my career, I cut my teeth on code like that, like device code. And so that was super fun. And I'm like, oh man yeah, I guess you would have to do that to make that work at that scale. Yeah, it's fun. Pointless, but you know we need-
I was inspired to a career-changing degree by reading the DynamoDB paper from AWS. That was like 2008.
Like this was the first time I was like, oh, actually, you know until now I had a successful career on doing relational database architectures, but I was a data architect, I can go a step beyond that. So if I hadn't read that paper, who knows if I would've ever known to appreciate Kafka.
That's, I didn't know that that paper also launched you in this direction. It launched a few open source distributed databases.
And even [crosstalk 00:11:47] an early Jay Kreps project. Those of us, if you know you know, Voldemort.
Yes, exactly. Because you read a paper and they make it look easy. And then two years and a lot of gray hairs later and you're like, no, they cheated. So I think in our papers, we didn't have the goal of making Kafka look easy. And in many ways, we actually had the goal of making Confluent Cloud look as hard as it was in practice because we gained nothing by making it look easy.
So we were fairly transparent to a degree that made a lot of product managers and marketing feel uncomfortable, but that was something that was important to us because again, like our target is people who may try to do it themselves and just, they have to know that it's not easy at all. So, yeah. So we wrote about the four big areas in our space. We started with elasticity. And again, like it sounds easy, you just put more machines, how hard can it possibly be.
There's always been Alaska. Yeah, brokers.
Yeah. Yeah. So this was actually an excuse to do a really deep dive into our control plane.
And the main feedback I got is that we should do an even deeper deep dive into our control plane because not many people know our control plane is Kafka driven and a lot, you know, you had in your show, so many people talk about microservice patterns and architectures. We used a lot of those and we tried a bunch of those and some worked and some were more difficult to implement than we expected. So we kind of just give the big architecture and it would show some of the decisions we made, but yeah, we may need to dive there a bit deeper because it's almost like Kafka and microservices architectures in practice on a used case that we all know and love.
Yeah. Wow. Okay. Not, I guess I knew, I knew the control plane was Kafka based and that all makes sense, but there you go. Yeah.
Exactly that. And then the second blog post is around our chase toward performance and scalability. And this is one place where I can say this is day one like very literally. Our goal is for Confluent Cloud to be more performant than any other Kafka cloud service-
So we did some steps and the key insight around it is that the reason we can do it, how can you even do it? Right? I mean, Apache Kafka is out there. You're using the same Kafka as everyone else. But the key thing that we have is that we have thousands of customers and their workloads to learn from, and we can actually make it faster and tune it and improve it for the specific workload that we are seeing in production. And nobody else is seeing as much customer performance data as we do. I mean, it's above and beyond and we obviously have the chops to actually say, oh, actually nobody knew that enterprise companies use a lot of access control lists.
And it turns out that they do, how can we make a customer with many access control lists, better performing essentially.
So we're kind of targeting specific types of customers. Like it's not super general purpose, but we do believe that we can be maybe even 10X faster for very meaningful production, real-world production workloads, not artificial benchmarks than any of our competitors. So obviously, as I said, we are not 10X faster. Now, if we're lucky on some things, we may be 50% percent faster, but it's still pretty cool. So we talk a bit about, not about the numbers as much as the process we're using to get there and how we're using the fact that we have production workloads as a competitive advantage essentially.
Yeah. Which makes sense, because we do. This is,-
I mean you use what you want right.
We're really good at running Kafka in the cloud is another, this is just a slightly more sophisticated way of saying that.
Exactly that. We're talking about a very, very specific way of a new frame. It's really nice generally, but I think that's the main thing, right? I mean we have some competitors. I mean, let's be honest. We will never be as big as AWS, but we can definitely be a lot more focused on things that matter to our customers because we are so much closer to them on things like Kafka.
Yeah. And it's all we do. So when you're the smaller, more focused business and it's funny, I don't feel like we're a very small business. Compared to AWS, we're very small.
Tiny and super focused, laser-focused on being good at this one thing and taking care of customers.
You can always out innovate-
-if you got that narrow focus.
Yeah. We're trying. That's exactly it. We're trying to out innovate.
Yeah. So you've got elasticity, you've got scalability and performance. What was next?
Yeah. So multi-tenancy-
-is the next one. And I think that's the one place if you are on a multi-tenant cluster like this is the basic standard, excuse, the ones that are not called dedicated are in fact multi-tenant.
So if you are on any of those, it may feel good to see the effort with this to actually guarantee you really amazing isolation, both on the security layer, on the data layer, and on the performance layer. And this is literally like Ph.D.-level work. Like you could do multiple research papers on what we did there. And I think Anna Povzner's blog post really illustrates both the challenges and our very unique approach. Unlike anyone else, we are not using containers to drive multi-tenant Kafka because Kafka is in Java and containers are not always the best approach even if you do like [inaudible 00:18:09] serverless, you know that java is not the best language.
Yeah. Warm starting and things. Or you get start up time. Things that don't get-
Yeah, exactly. So it has to do with just in time compilation and it takes a bunch of memory just for itself, so there are ways around it. Like I can see that, hey, if you have like [inaudible 00:18:36] and all those things, but that's proprietary. So we really, ended up building multi-tenancy into Kafka and this is, and we detailed a lot. We share a lot of details on how, and it's always like native Apache Kafka capabilities. So it's yeah, that's the cool thing. We actually didn't really, we could deliver a unique experience without holding back on the technical capabilities that we give the community. And I think that's kind of the whole approach to cloud-native Kafka in general is that we are focusing on the experience that customers have and we use Kafka capabilities to do it.
Right. And building Kafka out of Kafka has always been what Kafka does. So it's nice.
I love it. Yes. Yes. [crosstalk 00:19:34] So the other thing that I recommend this blog post is to people who are on the fence, do I really need multi-tenant? Or should I go with dedicated because the cost is like the cost difference is extreme? Like it's a hard decision.
Yeah. And it makes sense that it should be because in one case you've got the economics of the cloud and in the other case you've got dedicated, well, won't say dedicated hardware, dedicated instances.
Yeah, exactly. So in one scenario you actually pay for a lot of the outer lying course in one case we actually get to amortize it. So just like going to a gym, you don't actually imagine the difference between a private gym and a gym membership. So I think maybe reading about how multi-tenant works will help make it easier for people to decide that yes, actually multi-tenant is secure enough and performant enough for my needs and I can afford to pay less and get the experience I need.
And that's again, you said basic standard and dedicated. So basic and standard are the two multi-tenant cluster types. Basic is really like a development only standard is the production.
Yeah. [inaudible 00:20:50] is the one you can take to production.
Yeah. Dedicated is well it's not multi-tenant and there are other feature differences, right. So you need to look at what's going on in terms of the current feature set-
-of those things.
What you're saying is if you're just thinking, well, it's, multi-tenant, it's shared, that's not secure enough, I better pay for dedicated. Maybe you can get a better value and feel okay about the security by reading Anna's blog post.
Or maybe you'll read it and say, no, this is actually not good enough in a dedicated.
Right, right. [crosstalk 00:21:24]
You'll deeper understanding of how to make it.
And that's Anna Povzner once and future guest on this podcast she's been on, I think twice before and great author. Just somebody that you should just follow in general.
Yeah. Yeah. And then the last blog in the series is the storage durability audit. And this is so unique both in the fact that it exists because there is no other thing like that for Kafka. And in the fact that we are telling you about it, because if you think about that, you can imagine again that S3 probably does some things around the scene to make sure your data stays uncorrupted forever, but they don't really tell you how they do it. So you don't really know what kind of guarantees you should expect.
I was super impressed by the work our team did. So I basically went over and begged them to contribute it. And here is the thing, when you write data to Kafka, you kind of have some expectation of how long it will stay there.
But nobody actually checks like if you, and you can imagine that mistakes have been made either a screw operator error or some bugs, even has happened to a Confluent Cloud where we thought that we created a compacted topic and therefore that I should stay forever and due to a risk condition in a client that we used and we haven't tested super deeply. We actually, yeah, I know it was embarrassing. We ended up with a seven-day retention default, and this can definitely lead to data loss if you don't catch it in time. And we're not the only ones to whom it happened. Like we're getting, I get a lot of customer support cases. People, the risk condition existed for a while. People don't know about it. People make mistakes. People expect that if they create a topic and immediately change its configuration, they will end up with a new configuration. Because of the asynchronous nature of both operations, it may not end up exactly the way you expect. You actually need to check that the topic was created first. Lesson learned. But once-
The second admin API call to change the topic is operating on a topic that doesn't exist yet.
There you go.
Mistakes were made.
Mistakes were made.
Yeah, exactly that. Exactly that one. By us, but also by other people. It's a natural mistake to make.
I would say. I mean, it was like in general, acing processing is hard. So we figured out that it's a natural mistake to make. And we also figured out that when you store data and you have expectations, not all points in time and not all, yeah. Not every point in time is equally vulnerable, right? Because we could check say we're checking every 14 days. Do we check something and checking the valid data. But maybe 14 days is too often or maybe it's not often enough who knows.
The unique insight from the really brilliant Kafka storage team and Rovit was at specific points of time are more vulnerable. For example, when you change the configuration, you may be more vulnerable.
When we add, when we upgrade Kafka, it may be more vulnerable. When you expand on showing the cluster, it may be slightly more vulnerable because you start moving stuff around. So there are specific times where it's more important checks than others. So we really built this auditing capability into Kafka that when an important event happens, we double-check the storage to make sure that your data is still there as expected.
Okay. I like it. It's like changing lanes or going through an intersection. That's probably when bad things are going to happen so-
Exactly that, exactly that. Imagine that you heard, like you don't want safety systems in your car to be overly naggy, but you can imagine that at specific points you want them to be very, very proactive and naggy like my car has one of those automated breakings and it drives me nuts. It's overly sensitive. It actually almost cause accidents by breaking emergency breaking in rush hour in the middle of a busy highway. And like literally like, yeah, something, a big truck move to the side or something.
Yeah. And it's like, ah beep beep beep beep.
If it could know that I'm on a busy highway and the last thing I need is for it to emergency brake for me versus I'm now going through an intersection and it's actually very, very meaningful. Then yeah, life will be better. Luckily we know that.
Now we know that. So that was elasticity performance and scalability, multi-tenancy and-
Durability. Yeah. Durability guarantees 1, 2, 3, 4, and they're blog posts on each one of these and the work's not done. I assume. Kafka is probably never cloud native enough. What's going on now that you could talk about?
Yeah. So can I give a plug?
We planned for 2022. We have very exciting projects. We desperately need more engineers to work on those exciting projects. We have a specific, if you go on LinkedIn or any other of the job sites, there is actually a specific job role for Cloud-Native Kafka engineer. If any of this sounds like, oh yeah, this is amazing. Applying research to hard problems to deliver a next-generation user experience is exactly what I want to do with the next four years of my career. Do ping us.
I will make sure a link shows up in the show notes for that. And I can say, Gwen might not say this herself. Gwen is a fantastic person. I would work for her in a heartbeat. So whether it's on her team or not, if it's near her, it's a good idea so.
I feel like Confluent, in general, has a high number of fantastic people that-
Fair enough. Fair enough. Gwen says, deflecting the praise.
I mean, here you are and you have your own team. And they are all amazing so.
A lot of good people here.
Exactly. Exactly that. So yeah, definitely worthwhile reaching out and in terms of the project, where we going to push everything forward across those excesses in kind of its logical direction. So if you think about it, for example, in elasticity, after you can expand and after you can shrink, what is the next level? The next level is that it will happen automatically, that you will not even have to do it. So we're looking into it, and I cannot even promise features because we're at a stage where we're doing customer research, what is the best way to scale Kafka automatically for you. And obviously, people are concerned about costs and timing and all this. So we're going to really invest on the next level in that direction. If you think about performance, that's easy, right? The next level is we'll make it faster. We'll make it two times faster, maybe 10 times faster.
This is kind of obvious. On multi-tenancy, we are really looking to improve the robustness of our guarantees. Like right now we provide a solution, but pretty much everyone is guaranteed a specific slice of the cluster. If the clusters are very busy, you may see higher latency again, because there were just more tenants around and you don't always know in advance what to expect there. So we really want to be clear about the expectations. Like what is the latency you should expect from if you have basic? What is the latency you should expect if you have a standard? And under what conditions, like what workloads will actually give you this latency? Because as you know, you can, that's something that I always tell our customers, you can make half arbitrarily as slow as you want, and you can make half arbitrary as fast as you want. And a lot of it is on the client-side and how you tune your workload and taking the same workload changing [inaudible 00:30:40] is kind of the classic situation. You can literally 10X the performance of a workload by changing that.
So we want to kind of give customers visibility around the performance they get and what would they do to get them to where it is and what they could do if they want to, to make it better. And we to do it on dedicated and on multi-tenant as well. So it's kind of like the very next step, but we really see customers looking for those performance guarantees in a lot of situations. And then on the storage side, you can imagine that if we detect that something went wrong, obviously it makes sense that we will restore it automatically. And I believe that this is something we already do for you, but it means that we have backups and a way to restore your data automatically. Why do, maybe you don't want us to detect that something is wrong. Maybe something is logically wrong for you. You did something wrong, an engineer deployed something and rolled a bunch of crap. And if Confluent has the benefit to restore, why-
You never know.
Yeah, 2022 is going to be a very exciting year for us.
It really sounds like a lot of good work is planned. My guest today has been ### Gwen Shapira. Gwen thanks for being a part of Streaming Audio.
Thank you, Tim. Always a pleasure.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
What does cloud native mean, and what are some design considerations when implementing cloud-native data services? Gwen Shapira (Apache Kafka® Committer and Principal Engineer II, Confluent) addresses these questions in today’s episode. She shares her learnings by discussing a series of technical papers published by her team, which explains what they’ve done to expand Kafka’s cloud-native capabilities on Confluent Cloud.
Gwen leads the Cloud-Native Kafka team, which focuses on developing new features to evolve Kafka to its next stage as a fully managed cloud data platform. Turning Kafka into a self-service platform is not entirely straightforward, however, Kafka’s early day investment in elasticity, scalability, and multi-tenancy to run at a company-wide scale served as the North Star in taking Kafka to its next stage—a fully managed cloud service where users will just need to send us their workloads and everything else will magically work. Through examining modern cloud-native data services, such as Aurora, Amazon S3, Snowflake, Amazon DynamoDB, and BigQuery, there are seven capabilities that you can expect to see in modern cloud data systems, including:
Building around these key requirements, a fully managed Kafka as a service provides an enhanced user experience that is scalable and flexible with reduced infrastructure management costs. Based on their experience building cloud-native Kafka, Gwen and her team published a four-part thesis that shares insights on user expectations for modern cloud data services as well as technical implementation considerations to help you develop your own cloud-native data system.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us