Anna Povzener is an engineer who's a return guest to the show and she and her colleague, Anastasia Vela, both have been working on multi-tenancy in Kafka as a part of Confluent Cloud. It's part of our ongoing series on things we do to Kafka to make Confluent Cloud work. Anastasia also started as an intern here at Confluent and worked on some pretty important work that made it to production, so she talks a little bit about that experience too, which is super cool.
Streaming Audio is brought to you, as always, by Confluent Developer. That's developer.confluent.io, the website that's got pretty much everything you need, in my opinion, to get started learning Kafka and Confluent. If you do any of the labs or the examples that use Confluent Cloud, you want to sign up for Confluent Cloud, obviously, and you want to use the discount code PODCAST100 to get an extra $100 of free usage credit. Now, let's get to Anna and Anastasia.
Hello, and welcome to another episode of Streaming Audio. I'm your host, Tim Berglund, and I'm joined in the studio today by Anastasia Vela and Anna Povzener. Anastasia and Anna, welcome to the show.
Thank you. Thank you very much.
Now, you two are both engineers. And Anastasia, you have the kind of cool experience of having been an intern here and now being a full-time engineer. And Anna, you are a several-time repeat guest on the show. But just in case this is somebody's first episode and they somehow don't know who you are, I would love for both of you to just tell us a little bit about yourself. And Anastasia, if you want to even go into a little detail on your project and the experience of being an intern and converting to full-time and everything, you go first.
For sure, yeah. So, my name's Anastasia. I am a software engineer on the cloud-native Kafka team. Kafka Cloud multi-Fundamentals is the subteam. And I started working with multi-tenant systems when I first interned back in the summer of 2019. And this was when I contributed to a really critical multi-tenant feature that is automating tenant quotas, which would help provide additional broker protection. So since then, I've completed my degree and I returned as a full-time employee in August 2020 and I'm continuing working on creating awesome features to make Kafka cloud-native.
I love it. And Anna, returning guest. In fact, the most recent episode that I think we did was about multi-tenancy. But tell us about yourself just in case anybody's new to the show.
Yes, thank you for inviting me again. So yeah, I'm Anna. I've been in Kafka now actually six years, slightly more and I work on multi-tenancy for about three years now in Confluent. I did something that actually... I have about 10 years of experience in multi-tenancy before I did my Ph.D. around the topic. At that time, it was not a system like Kafka. It was mostly on disk performance, on storage performance, but it's something that I worked on for a long time. And then in Confluent, when we go into cloud and multi-tenancy is one of our kind of fundamental features in the cloud. I said I work in multi-tenancy. I think last time we mostly talked about multi-tenancy in Apache Kafka. Little bit.
Right. And this time, right? So, this is multi-tenancy mostly in cloud. I worked on it. I was a tech lead on it, and then because that kind of multi-tenancy becomes very important and we have few multi-tenant tiers that we kind of have available in Confluent Cloud. Now, we have the whole team that is owning the multi-tenancy, which is under Cloud-Native Kafka, which we call Kafka Cloud Fundamentals. And the reason for the Fundamentals is because it's basically fundamental to our offering. So, I think that's about me and basically... Yeah, and I think this time I was hoping to talk more about the actual cloud part and a little bit, of course, to compare it with Kafka multi-tenancy.
To start with though, tell us about Kafka's multi-tenancy. Out of the box, what is there? I mean, the difference between what you need for a cloud service and what AK has is big. But what's there to start with?
Yeah. So in Apache Kafka, so it does have multi-tenant capabilities, but the use case, I would say, is kind of different. On Apache Kafka, the reason it was built was mostly in the context of an on-prem data center. And so imagine, I would say very common use case would be in your company, you have one Kafka cluster, which serves all the workloads. And so the idea, you can share the data across the company, and so that different teams can collaborate. But obviously, because those are different teams, in a sense, those are different tenants, right? You can have different applications. Some of them are critical, some of them less.
And so to be able to actually host that, right? Even though you might share the data or you might share the cluster, you need to be able to isolate data in some cases, right? Because some data could be more sensitive, so you need to be able to control who can access the cluster. I mean, you don't want some, for example, randomly somebody just starts running an application, they are maybe not very well configured, they just take over. Lots of bad things can happen. So you need control on insulating data, on insulating your performance. You can actually also isolate userspace so that you don't have conflicts on topic names. So, that's what Kafka provides. And you basically would need to configure, set up your cluster and configure, for example, quotas. This is the way we control performance. So, you just configure and make sure that any application does not use more resources. And so that's kind of how you do it. You kind of set it up, and then you know what your workloads are, you configure your quota, you configure your insulation, and basically, here it is. You run the Apache Kafka cluster.
Yeah, those are all Apache Kafka. That's the unmodified, not Confluent Cloud. So to get to the cloud, and I guess you've kind of been talking through some of this, but just kind of name off for us, what are the things that have to change to make... Because I know those Kafka multi-tenancy things you listed off, they're there, but they're not everything you'd want. And if we've got now shared clusters in a basic or standard cluster and Confluent Cloud, there's a tremendous amount of sharing going on. So, what other things have to have happened? And by the way, these questions can either be either one of you. You guys tell me.
Yeah. Yeah, let me start because that kind of would lead to, I guess, Anastasia, more detail on the project. But actually, I think the first thing is actually use case is different. If we think about it, right? It's not that here, you sharing. You're kind of really thinking of sharing your application and it's in the company, you know them, you know your workloads most often.
You can go and ask someone. In [crosstalk 00:07:49]-
And ACLs could be the most. Like, "This team's not allowed to read this data," or something like that. That's [crosstalk 00:07:54].
Yeah, exactly. So it's like you kind of still have this, but you can also go to someone and ask-
... "Oh, how much data do you expect also for performance [crosstalk 00:08:05]?"
Your friends, maybe there are boundaries because boundaries are healthy, but you're still friends.
Exactly. But in the cloud, obviously in on-prem, the whole idea, okay, you have this one cluster, it's very easy to manage, right? It's obviously in the cloud the same thing, but the main thing that why you want multi-tenancy there is for cost efficiency, right? Because imagine when you're hosting service, it could be anything. Like Kafka, could be a database, right? Often when you buy service, you might have... It's very common to have pretty small workloads, right? You can maybe run a few megabytes per second, but you still, as a cloud, as a service, you do expect availability. And so in that case, we would advise you... Or we as a service, right?
We will still configure three replicas, meaning that you might have... It's a three broker cluster at least. And so basically you kind of still have certain sides of the cluster for availability, which is not doing very much utilized. And so then, in that case, you actually pay... The whole point why you have multi-tenancy is that you kind of can do the opposite, right? All the small workloads can just go in the one big cluster and you utilize very well and you have cost-efficient service, obviously, it's cost efficient to cost, and then you share that cost efficiency by providing more cheaper service to customers.
Yeah. But what do you think about, I think the main thing about this, right? So basically, now you think of cost efficiency. When you come back to that on-prem design of multi-tenant kind of setup, right? To set all this... Especially to set all these quarters, right? You actually need to think about capacity. It's kind of manual. Obviously, when you have just a normal one workload, you size the cluster. When you have multiple tenants, you still need to size your cluster. You need to think, "Okay, I have some critical workloads. I want to make sure they have capacity." You reserve it... Reserving means you make sure that there's capacity, then you may be reserve for maybe less important workloads and some capacity, but then set all these quotas' limits on how much resources you can use
And so all that, you basically, if you see, you need to make those decisions. How much capacity. And usually, you kind of can guess. You can either over allocate that capacity or ask the other team, "How much do you really need?" and set it. But as you can see, cloud is very different. You're not running... Our customers or the hosted surveys, they don't tell us what it is. Actually opposite. They just want us to give credit cards, "I just want to get my bandwidth and I don't want to worry about anything else." And then we don't-
Or my bandwidth, or not use it all, or use it all, or you want to be free to just have your workload run. And the whole idea is you're paying somebody else to care about that. So, the customer doesn't want to think about it.
Exactly. And [inaudible 00:11:12] them to ask, "Oh, can you please tell us how much workload?" And it's not scalable at all. So basically as you see, it's actually very hard. Take the Apache Kafka and then like, "Okay, let's put it in the cloud." But we still did it in a way that we started, right? We wanted kind of deliver that feature, which is pretty complex. So we had to start with the mechanism that Kafka has, but that mechanism means that what we did actually very, very early on, and that was maybe first months when cloud was available, if anybody remembers, we ended up kind of addressing this by setting a very small limit that customer can get on bandwidths, which is five megabytes per second.
And we're like, "Okay." And we had still very few customers, so you cannot kind of reserve enough headroom and kind of see statistically how many of them can spike at the same time to the full capacity. So there are a few of those things, but obviously, it's kind of very fragile in terms of if you're getting more customers, those assumptions [inaudible 00:12:18] you need to, again, reserve more, which is going against your capacity or your cost efficiency. Or you potentially actually, right? If you are over that capacity, you can actually have an unhealthy cluster and availability loss, which is what we also don't want to provide.
And then that's what lead us to actually one of the first problems we thought of, which is interestingly was pretty early. Was like, "Okay, how do we..." We know that we can use the capabilities of the cloud, that you can always get more computing power, more storage when you need it. But obviously, it's about data, right? Even if you expand your cluster, you need to move data. It's still a time, right? It's not instantaneous because any data movement, it's a physical resource. And so obviously, it's a time, so you need to kind of solve... The first one of the first things you want to solve was, what do you do in those temporary times where there's not enough capacity, but everybody wants to use more?
Okay, because you would have... That's an outage.
Yes, for the customer. It could be maybe very degraded performance, which is also could be an outage because you don't know how application can [crosstalk 00:13:34]-
One person's degradation is another person's outage, right?
Exactly. And so that's how we come the first, which is very interestingly... That was a project that when Anastasia joined, we gave her a problem, which was, what do we do? There's pretty much a problem. You do have not enough capacity. How do you address it? [crosstalk 00:13:57]-
So, back in the halcyon days of 2019, Anastasia, what was that problem statement? If you remember.
Yeah, so it was just we oversubscribe our brokers because we want to be as cost efficient as possible, but there's a problem because we don't have enough capacity for our tenants to use when they do want to use up to the max. So, an obvious solution to this is just to decrease everyone's quotas and what they can use, so that it's equal among all tenants. And we call this a fair limit, but determining what this fair limit is, is really difficult to do actually because, how do we define fair? It's super subjective. So in an ideal world, I guess fair would be a tenant can use however much they want to, right? But we can't do that obviously.
Because the whole thing is you're bumping your head. You're [crosstalk 00:14:58].
Yeah. Yeah, yeah, yeah.
So, an alternative was also to reduce everyone's quotas, fortunately, to their original quota. So, this means that a tenant who was originally allowed to use more than another tenant can end up getting more allotted to them in the end, so that they both don't see the same quota in the end. And yeah.
Gotcha. So, did that go to production? I mean, tell us about the building out of the thing and how the summer went. And you're an intern back then.
Yeah, yeah, yeah.
This seems like a big problem for an intern.
Yeah. So there was no solution to find, which was what I really liked about the problem. I was just given a problem and I had to determine what solution we could do. And so in the end, we did end up going to production. And initially, we had targeted the problem towards request quota because we see more overload on CPU utilization rather than on bandwidth. So, we focused it more on making sure that the CPU was overutilized.
Okay. And request quota is a good proxy for CPU load.
Yes. Yeah, definitely.
Okay. I guess that makes [crosstalk 00:16:29].
So afterward, I returned and continued more on this. Or I made another improvement basically.
Oh, nice. This is after you completed your senior year and went back-
Yeah, yeah. This is after my senior year. I came back as a full-time employee and I had another improvement. So improvement was to also autotune on bandwidth. So as I said earlier, it was only on the request quota. So what we decided to do is also autotune on bandwidth, which is much rarer, but it still does happen. And rather than manually decreasing the process or manually decreasing the quotas, we would autotune the quotas, such that according to this algorithm that I described.
Nice. Okay. So, what was next after that? Is that all there is to multi-tenancy and cloud? And I want to back up a step and I like to restate things to make sure I've got them in my head.
Yep, for sure.
There are three cluster types in Confluent Cloud. There's Basic, Standard, and Dedicated. Dedicated, I think multi-tenancy is not a thing. There are dedicated instances and that's the whole theory of them being dedicated. Basic and Standard, they're cheaper to operate because the unit costs are lower because they're multi-tenant. So, that's the big multi-tenancy problem. And your topics are sharing a Kafka cluster with other people's topics, and your view in Confluent Cloud and the UI of there being clusters is an abstraction that we're putting on top of things. And so you're trying to make sure that everybody does the best job meeting the performance contract when you're bumping your head up against a load when the statistics of things just get a little upside down and there you go. Is that all there is to it? So, what else is there to cloud-native multi-tenancy?
And Anastasia, if I'm glossing over something in the original problem, just correct me. But yeah.
Yeah, yeah, yeah. Yeah, so there is actually another project which is similar to autotuning that we are working on. So, autotuning was strictly for a broker. Broker-wide. But what we want to do now is to make sure cluster-wide resources aren't overutilized.
So, we have cluster limits to make sure that clusters don't... Or so that tenants don't use more than what's available in the cluster.
Yeah, maybe I can a little bit... Yeah, I just maybe kind of step back for a second because I think what we kind of didn't mention, but it's an own blog post, there is much more, is that maybe it's actually good... Let me just come back for a second to the cloud-native part because I think there is one piece we didn't mention, which would actually help for them to come back to Anastasia that current project, right? So only so far, we kind of touch on this one problem that when you need more capacity and once you are expanding and getting more capacity, you're fairly making sure that if you somehow have a little bit of degradation, all the tenants kind of fairly get a very small part of it.
You kind of amortize a degradation and way that whole autotuning thing that you kind of temporary reduce quotas, but because of so many tenants, right? It's kind of small impact on each until you expand. And that's why that specific project that Anastasia did was pretty critical because we actually use the same mechanism for bandwidths. Like, protecting from bandwidth overload, from CPU overload, and then later for connections because we kind of need to also keep connections. So that's kind of one thing, but there is much more, and I would refer to the blog post. But one of the important things about cloud-native, which I want to bring up is that [inaudible 00:20:48] kind of mentioned three things already, but not directly, right? One is, yes, you have this expectation that you can instantaneously get your resources.
So, [crosstalk 00:20:58] instantaneously and you only pay for what you're using. So, all the space you go to model. So that's what we need, kind of this idea that as soon as you need more, we give more. Then you have obviously, as I mentioned, right? Because I was talking about availability, expectation on a good performance. Those are your SLA, SLO, right? When you buy a service, you expect a certain quality of service. And then of course cost efficiency, the whole point of multi-tenancy. Otherwise, why do we do it? But the final thing is that abstraction, right? When somebody buys a service... In Apache Kafka, the way you set all the quotas and manage it, you actually set it up a broker. So you know your cluster size, you decide per each broker how much quota, then you add it up, when you expand or shrink it, you start calculating. It depends. What you already know how to do, you will do it.
But on the cloud, it's not something that you want to expose, and especially... Right? This is a cluster you can potentially share with some other... You don't really care and you really want to see that it's... As an experience, you want to have with your own. You don't really want any experience that, "Oh, I know about all these details." So, we don't want to have any brokers exposed or any of those details because it becomes just operationally probably as hard to manage as [crosstalk 00:22:23].
That's the whole abstraction of Confluent Cloud, is this is supposed to be serverless Kafka. You're not supposed to [crosstalk 00:22:27].
Yeah, exactly. And so when you buy something, in reality, when you buy, you think in your application terms. You think, "Okay, what I care, it's Kafka application is I get a bandwidth and I get a latency and I get some storage, some number of partitions, availability, just kind of application side." But in the cloud, why, that abstraction is actually not that trivial because we give you bandwidth. In the back, we have to think about the CPU because you need to have enough capacity. So we kind of manage it, offer you in the back how much resources you are using and so forth. So, this is all in the blog post. And the final, big piece is that because we had to expose... We don't want to expose brokers, right? So the way we exposed bandwidth limits is because when you buy Confluent Cloud, you know what kind of limit you get on those non-dedicated clusters.
Right now you get 100 megabytes per second separately for producing than for consume and those are cluster-wide. That's why we kind of build another layer, but it's in the broker. It's actually embedded in the broker, so its Kafka cluster have this knowledge that we set quotas per cluster and then brokers decide how to slice it to them how much quota they actually enforce, so that aggregate [crosstalk 00:23:54]-
The brokers have multi-tenancy metadata.
Yeah, exactly. And so all that lag, we have the pipeline, how to get this metadata, algorithms, how to allocate those quotas. And originally, initially, again, we started always something very simple, especially when they have small clusters, fewer tenants, we just very, very simple kind of distribution of quota, which kind of leads to what I think I... Now, I'm coming back to what Anastasia was describing with a new project because it's basically... Finally, we are doing something smarter. In basic ones, we scaled. Now, we need to be more careful how we kind of divide that quota between brokers. So, I'll give it back to Anastasia.
Yeah. Yeah, so we're trying to be smarter in how we allocate bandwidth across for the cluster-wide [inaudible 00:24:45]. So currently it's just 100 megabytes per second for multi-tenant clusters, and it's just evenly distributed across all brokers. But this becomes an issue as we increase the number of tenants for this multi-tenant cluster because it is equally distributed.
Got it, got it. So, for very large... For whatever that means in the way that Confluent Cloud allocates tenants to clusters, and it's some number N, and that gets up there, that gets more difficult. Okay.
Yeah. Yeah, so it works well with smaller, medium clusters obviously, but definitely not larger ones.
Got it. And you're trying to make it be smarter. Well, I think I can say you'll find in your career as an engineer, it is a process of building something that's not smart because it's the best we can do, and then making it a tiny bit smarter and that's just kind of how the software works.
Awesome. So you two, thank you for this. Briefly, we're up against time, but what... And Anastasia, you were kind of mapping out the future. But what's the other kind of thing you're pushing into in multi-tenancy? Where do you see the future going? And Anna, that can be you if you'd like.
Yeah. So, we definitely want to build... We are kind of right now with this project, we are focusing a lot to make sure that we deliver everything we promised with a bigger scale of basically clusters and tenants, right? So basically, we're able to do pretty simple approaches quickly, make this happen. Cluster-wide bandwidth, installation with kind of a smaller size, a smaller number of tenants. But once we scale, those basic approaches become a little bit too simple. I mean, they do not maybe work as well because at scale, for example, in that case, why it's important to distribute this quota better with higher scale because you want to be able to actually provide the capacity we promise or provide the bandwidth we promise.
So, it's really we focusing a lot on kind of keeping up our promises while we are scaling and getting more customers so it still works and it also includes providing more visibility and kind of more monitoring to that and how we operationalized because so far, sometimes when we... Example that Anastasia was describing with autotune bandwidth, right? Even though was rare cases, right? We didn't implement it, but if we have monitors, if somehow there is some overload, we get a notification, and then on-call will go and mitigate it. So all that, we are trying to basically make sure that we reduce on-call time. So, we operationalize it. How do we actually... One hard question, for example, right? Because you have so many tenants, we have thousands of tenants, but on a cluster.
So if any of the tenants have performance issues, how do you actually detect them? How do you go and observe what happening and how do you resolve it? That's a very hard problem to solve because it's so many. And how do you know, right? Because there is so much else happening. So kind of resolving that, and then we are basically looking a lot... Even with scaling, right? We need to make sure we still, as we scale and having so many tenants, it's [inaudible 00:28:21] more important. We need to be more, I don't know, explicit how we manage availability, right? So if anything happens, because could be bugs, right? You want to reduce the blast radius, make sure that nothing... We still maintain our nine... I think we have 99... Was it four nines for Standard? So, it's pretty high availability. So once you scale, you need to actually have much better software to handle that kind of... Provide this availability. So, that's pretty much all that we're solving. So, lots of hard problems solve for us.
My guests today have been Anna Povzener and Anastasia Vela. Anna and Anastasia, thanks for being a part of Streaming Audio.
Thanks for having us.
Thank you, Tim. Was really fun. Hopefully, we do more, so we can have another broadcast. Thank you very much.
Yeah, thank you.
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
In an effort to make Apache Kafka® cloud native, Anna Povzener (Principal Engineer, Confluent) and Anastasia Vela (Software Engineer I, Confluent) have been working to expand multi-tenancy to cloud-native systems with automated capacity planning and scaling in Confluent Cloud. They explain how cloud-native data systems are different from legacy databases and share the technical requirements needed to create multi-tenancy for managed Kafka as a service.
As a distributed system, Kafka is designed to support multi-tenant systems by:
Traditionally, Kafka’s multi-tenant capabilities are used in on-premises data centers to make data available and accessible across the company—a single company would run a multi-tenant Kafka cluster with all its workloads to stream data across organizations.
Some processes behind setting up multi-tenant Kafka clusters are manual with the requirement to over-provision resources and capacity in order to protect the cluster from unplanned demand increases. When Kafka is on cloud instances, you have the ability to scale cloud resources on the fly for any unplanned workloads to meet expectations instantaneously.
To shift multi-tenancy to the cloud, Anna and Anastasia identify the following as essential for the architectural design:
You can also read more about the shift from on-premises to cloud-native, multi-tenant services in Anna and Anastasia’s publication on the Confluent blog.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us