Get Started Free
June 30, 2022 | Episode 222

Automating Multi-Cloud Apache Kafka Cluster Rollouts

  • Transcript
  • Notes

Kris Jenkins: (00:00)

How do you deploy software? That's easy, right? You log onto the machine, you bring the service down, you FTP in a new binary and you bring it all back up. Okay, maybe that would work if there were just one developer working on one machine, but what happens when you're dealing with a cluster of machines? Then you're probably going to automate it away these days. You're probably going to use something like Kubernetes, maybe Nomad, maybe NixOps, depending on what spiciness level you're at.

Kris Jenkins: (00:30)

But what happens when you get beyond the scale of those kinds of tools, when you're managing a whole host of services, some of which you're at different versions across not just one cluster, but hundreds of them? That's when you get into a whole new level of problems and in our continuing informal series of things that I would like to understand, but I want someone else to worry about.

Kris Jenkins: (00:53)

We're joined today by Rashmi Prabhu, who leads a team at Confluent, dealing with exactly these kinds of super-scale deployment problems. The kinds of problems that maybe a decade ago, only Google and AWS were thinking about. But now they're becoming more and more part of our life in an online cloud-providing world. Before we get into the gory details of that, I'll remind you that if you don't want to deal with managing your own Apache Kafka clusters, then you should check out Confluent Cloud, where you can easily spin up a development instance or a cluster fit for a major enterprise.

Kris Jenkins: (01:28)

Sign up with a code PODCAST100, and we'll give you $100 of extra free credit. And once you've stopped worrying about managing your own Kafka, check out, which will teach you everything we know about how to use Kafka and how to get the best from it. And with that, I'm your host, Kris Jenkins, this is Streaming Audio. Let's get into it.

Kris Jenkins: (01:56)

My guest today is Rashmi Prabhu. Rashmi, welcome to the show.

Rashmi Prabhu: (01:59)

Hey, thanks. Nice to meet you.

Kris Jenkins: (02:02)

Good to have you here. You've got a background in interesting places like Yahoo, and I know you worked at Box, which I don't know much about. I'm sure you'll fill me in offline, but right now you are head of fleet management for Confluent, which I confess, I don't know what fleet management really is. So that's what we're going to talk about. How long have you been at Confluent?

Rashmi Prabhu: (02:28)

I've been in Confluent close to three years now.

Kris Jenkins: (02:30)

Three years. Okay. So you really know your way around?

Rashmi Prabhu: (02:34)

It looks like it.

Kris Jenkins: (02:37)

Cool. For just at a high level, tell us what fleet management is for those that don't know?

Rashmi Prabhu: (02:43)

Yeah, yeah, sure. Fleet management when I first started trying to research this whole topic, there was not a whole lot on the internet. It was like what is fleet management? It popped up cars and it popped up something about transport and all of that. And I wasn't quite sure whether it was just the terminology or why that wasn't really coming up. And I think for us, fleet management basically means how we run our Confluent software in the cloud. And it's about being able to run all the different applications, services, infrastructure components, the products themselves like Kafka, ksqlDB, connect, Schema Registry, all of them at scale on the cloud for our customers in a very safe and reliable manner. That is our primary mission at Confluent.

Kris Jenkins: (03:45)

Is the analogy there, then you're treating these different software products as a fleet of cars that you are managing out in the field?

Rashmi Prabhu: (03:55)

Yeah, that's how it would draw balances.

Kris Jenkins: (04:02)

That kind of scale, you're going to know all the horrific things that you need to deal with out in that world. Let's start from my point of ignorance and you can teach me what I'm missing here. You've got these different pieces of software, you've got lots of machines. I would think you just sign off a release and you roll them out and that's it till the next release date. What's the first thing in that idea that goes wrong?

Rashmi Prabhu: (04:32)

In an ideal world, that's exactly how it should be, right? Just like how you described it, you finish signing off and then you ship it. You're done, and then you go off working on the next one and you know that things will be handled magically behind the scenes. Except that's not quite how it works today and that's where we want to get to with the stuff that we're building. I think taking a step back into why is this so different and why is this interesting? We can start with what is so unique about Confluent Cloud?

Rashmi Prabhu: (05:06)

I think the unique thing about Confluent Cloud is the way we have structured the entire landscape of our products and services. We have what is called as the data plane and then we have the control plane. The data plane is the one which basically is used to deploy all the Confluent products. We have thousands of instances of Kafka clusters. We have similarly ksqlDB clusters, connect, schema, all of these deployed in hundreds and thousands of instances across all three cloud providers across various regions. And we have our customers, our Confluent customers directly interacting with them to egress/ingress data.

Rashmi Prabhu: (05:51)

Now that is one part of it. And in order to make all of this even possible, right? We use Kubernetes as our layer of abstraction, but then to even allow all of this to run in Confluent, we have what we call as some of the satellite services, which are again, deployed in thousands of Kubernetes clusters, where all of these Kafkas and all of these products are deployed. And so we are basically looking at a large scale of instances and applications that we run.

Rashmi Prabhu: (06:30)

And here you see that running this manually or running this through script is not a safe approach. It's not a sane approach anymore. You need some safety mechanism, some protocols to be able to manage all of these in a better manner, right? Having mature operations is very core to any technology company. And I think with fleet management as a platform, the systems that we are building are going to aid just that. And so basically the systems that we are building are going to provide better mechanisms to roll out things in an automated fashion across this fleet.

Rashmi Prabhu: (07:12)

And to be able to have safe deployments, to be able to know what is going wrong and proactively figure out what might happen and give a chance to the human operator to go and figure things out and fix things so that the external customer is not being affected. There is like always this focus on how our external customers are going to be able to continue running their operations using Confluent hosted instances, and how do we aid that and how do we make that happen?

Kris Jenkins: (07:50)

To summarize that then maybe you've got actually getting the job done at scale, monitoring that you did the job, and making sure you do the job transparently?

Rashmi Prabhu: (08:03)


Kris Jenkins: (08:03)

Maybe we should break those three down individually. Actually doing the job, there'd be some people thinking, is that not just Kubernetes command once you've updated the configuration? Why is it harder than that?

Rashmi Prabhu: (08:18)

I think the whole operation concept can be broken down into select, apply and monitor and then rinse and repeat, right? Basically when you have thousands of instances and you're looking at 2X, 3X growth for Confluent customers, which means that we are possibly going to see a 10X growth in just the cloud footprint. To be able to target the right instances or right parts of the fleet becomes critical. We cannot just go and start to upgrade something that is belonging to the customer. And then the customer goes down and that's not the situation we want to be in. Essentially be able to select the right targets in the right order and then apply that operation. Sorry.

Kris Jenkins: (09:13)

Is this targeting certain kind of service or targeting certain kinds of customer you're talking about?

Rashmi Prabhu: (09:20)

It could be either. The selection process is pretty involved. Today, some of it is done manually, but we want to automate all of that. But essentially in a futuristic world, we are basically wanting to have a way to basically describe that fleet and say, "These clusters are good for Canary," right? These are some of our Confluent internal clusters.

Kris Jenkins: (09:48)

Good for Canary?

Rashmi Prabhu: (09:50)

Yeah, Canary them, to allow the initial-

Kris Jenkins: (09:55)

These are customers we can shove down the mind to see if they-

Rashmi Prabhu: (09:58)

These will not be the external customers, these will be internal. Oh yes, no external ones. Be able to target some of our internal test clusters, maybe sales is something, we are using something internally and then go and target maybe customers who have their Kafka or other Confluent product clusters running in their non-prod environments. We are not really touching anything that is in their production. And then maybe there's a metrics based thing. Oh, if the CPU of this is running high, then maybe go target those first and fix those. That could be one kind of selection strategy.

Rashmi Prabhu: (10:42)

You'll see it's a huge combination of things, where it could be cloud-based, or it could be region-based, it could be customer-based, it could be metrics-based. In order to describe all of this in a coherent manner, in a way that can be handled by a human operator and not have them run around looking up databases and metrics and creating Excel sheets and manually doing all of this, we obviously want to get away from that world and basically provide that developer experience during rollouts of any software.

Kris Jenkins: (11:24)

Sounds almost like you want to create like an SQL front end to figuring out which parts you want to update.

Rashmi Prabhu: (11:31)

Yeah, sort of, kind of, but maybe we also want to take it one step further and say, "You describe it to me in the form of a JSON or use a GRPC endpoint, just write out the strategy and then we'll convert that into SQL queries, or whatever is required generate that plan for you, and then we'll go and execute it once you approve it." Which means it's like, if you have some predefined strategies for the developers basically feed in the strategy, make sure the plan looks okay and then give a thumbs up. And then off you go on your next project and the system is going to handle everything else.

Kris Jenkins: (12:13)

And that does actually start to seem analogous to a query with a query planner, right? It's the same idea transferred into a completely different world of running machines?

Rashmi Prabhu: (12:26)


Kris Jenkins: (12:26)

Interesting. And then once that selection process has happened, the operate process is Kubernetes. Is it other things as well?

Rashmi Prabhu: (12:36)

Yeah, there are a couple of steps before we reach the Kubernetes side of it. That's where the control plane comes into picture, and I think we have this way of describing our clusters as we call them, the Kafka clusters in the form for spec. And we need to build up that spec to include maybe the image version that we want to deploy, maybe the the feature flags that have been turned on for a particular customer, maybe certain special configurations that are there. We build up this spec and then the control plane sends it down to the data plane. And we use Kafka in this whole thing. We use our own products, of course. And that's what makes all of this happen on a multi-cloud environment. And we push that down to the data plane and essentially the magic happens after that with the operator and the Kubernetes API.

Kris Jenkins: (13:43)

We had a couple of guests telling us about the control data plane separation that happened, right? So you are figuring out what the new spec is that goes onto a topic that gets transferred down to a data plane that runs the actual Kubernetes job that resynchronizes everything?

Rashmi Prabhu: (14:01)

Right. And some of these are described as stateful sets, some of these are deployment objects. We let the operator in Kubernetes API handle that bit of it. That's like the apply part of it.

Kris Jenkins: (14:18)

Does that form part of your fleet management remit or is that cloudOps-ish? Is that already there that you're building on or is it your job?

Rashmi Prabhu: (14:31)

Basically it's already there, the core of it is already there. Of course, there are usually some enhancements and stuff that always have to happen, but that core of okay, one unit of operation needs to be defined and we plug in the operation into fleet management and then we can scale that out across the cloud.

Kris Jenkins: (14:53)

Right. So you are building up the specific description of these particular types of batch job that perform the transformation, that's what we're saying?

Rashmi Prabhu: (15:04)

Right, yeah.

Kris Jenkins: (15:07)

And then we come into observing it all?

Rashmi Prabhu: (15:09)

Yeah, that's-

Kris Jenkins: (15:12)

Is also the biggest job, right?

Rashmi Prabhu: (15:13)

Yeah, that's like a very interesting piece. And that's one of the, I would say, the highlights of having fleet management platform and using us to essentially drive all of these operations on the cloud. And that is like having active monitoring. At least from what I've observed in the past, and even to some extent even today, I think that a bunch of time goes away in people trying to make sure that things are fine. Of course, we use Datadog, so we have metrics, we have alerts, we have all of those things.

Rashmi Prabhu: (15:51)

But yet there is some amount of manual intervention that happens. "Hey, is this rollout going on fine. Is something not looking right here?" We've automated some of this for a subset of the fleet and we plan to extend that out further, but essentially our platform is going to listen into the alerts and make sure that there is nothing bad happening on a particular target that we are operating on. But then if it detects that something bad is happening, we are able to make that decision to either pause it, or error out the whole thing so that we don't spread potential bad changes across the fleet. And we contain that within the blast radius.

Kris Jenkins: (16:43)

If you automatically noticed that something was going wrong with one particular recipe, let's say it is, you would hold it from being applied to other future clusters that are in the queue.

Rashmi Prabhu: (16:55)

Yeah, that's right. Basically we are cutting out that entire turnaround time, where a person would be alarmed about certain problems and they would have to log in and figure out what's happening and then go figure out what is the operation to go pause it and then go and debug. We are cutting all of that out and saying, "The system can react, we just need to know what to listen for, and we want to know what to do to pause a certain operation and that's it." Within a few seconds, we should be halting that process until someone comes in and tells us, "Okay, you can proceed or it's good you stop, don't proceed."

Kris Jenkins: (17:38)

Give me a concrete example of that? Take me through a scenario where actually it finds a useful metric on how it processes it?

Rashmi Prabhu: (17:51)

Let's say that there is a fleet wide upgrade going on for Kafka clusters. Today we have close to 2000 clusters running, and let's say that we have a rollout for image version 1.X going up. Someone has prepared all the clusters and they've fed that in. And they've called a fleet management API to say, "Go apply this across the fleet and let me know once things are done." Fleet basically, the system basically is going to chunk up this entire list according to certain criteria, once we have the deployment strategy based deployments.

Rashmi Prabhu: (18:35)

We'll chunk that up, we'll do some amount of parallelization. We of course, will not try to go one cluster at a time, because then it'll finish only after six months. We'll try to roll that out batch by batch. And then since we have Datadog agents and metrics and all of those being produced from the Kafka brokers, they are getting streamed out into our metrics API. They are also getting into Datadog, where certain alerts may have been configured.

Rashmi Prabhu: (19:08)

Now for each Confluent product, we have asked that each product owner defines what are the critical alerts, right? They say maybe the CPU should not go too high, maybe the throughput should not suddenly drop to zero, maybe the latency shouldn't start to spike. Those are some of the simple ones and they'll be a little bit more complex ones, especially around maybe storage sizes and all of those. Usually we have around on the order of around 10 alerts that are configured during rollouts for each product time.

Rashmi Prabhu: (19:49)

And once we have this, the fleet system basically constantly monitors and checks, are any of these alerts fighting? And let's say it detects that maybe one of the clusters is showing a drop in throughput. Maybe it's dropping request. So then we kind of know that, okay, this should not have happened. Then we immediately send in a pause signal from fleet system into the control plane APIs, on all the currently running upgrades. Maybe we have 25 in parallel, we go and pause all of them. And we also block any further new ones from getting picked up and getting processed.

Kris Jenkins: (20:35)

Sorry, can you pause them cleanly?

Rashmi Prabhu: (20:37)

Yeah, we should be able to pause it cleanly, yes, right in the middle of a broker.

Kris Jenkins: (20:42)

Even if it seems like it's halfway through?

Rashmi Prabhu: (20:44)

Yeah, let's say a Kafka cluster is 12 brokers large and maybe the sixth broker was just getting bounced and it was just coming up and we started to de detect it. We basically send a pause signal and the seventh one will not be touched. It'll just hold. The operator is geared up to-

Kris Jenkins: (21:08)

It doesn't try to roll back or six.

Rashmi Prabhu: (21:11)

No, no, it'll not roll back just yet.

Kris Jenkins: (21:14)


Rashmi Prabhu: (21:15)

That'll probably be V2 version of fleet.

Kris Jenkins: (21:17)

Let's not get ahead of ourselves. It's pausing that apply queue at the same time, right? And then presumably someone comes into manually diagnose it before you continue?

Rashmi Prabhu: (21:31)

Correct, yeah. So we can go and pause it and that'll basically send in a notification. We've hooked up automated slack notifications as well. So people don't have to come and keep pulling the APIs and is it done? They just get the notifications on their Slack while they're working. And they'll notice that, oh, something went wrong and it had to be paused. And then they know that none of the other clusters are affected. They know exactly where to look and they can take their time and do it, right? And once it's fixed, then they can hit resume and we can keep going.

Kris Jenkins: (22:11)

And it can you replace the unapplied ones with a new specification potentially?

Rashmi Prabhu: (22:22)

Resume basically continues rolling up the rest of the brokers. The entire Kafka cluster is now at the right image and the new Kafka clusters will continue to get the right version.

Kris Jenkins: (22:34)

But if you discovered it was a problem with the particular software you're rolling out and you patched that software, can you then go and say, "Everything that I was halfway through upgrading now needs to upgrade to this version instead?"

Rashmi Prabhu: (22:53)

Because when there is a fix that needs to go in, maybe it's a hot fix, we need to patch that in. We'll then have to restart the whole workflow. Basically that workflow is errored out, it has to be killed or aborted, and then we rerun the workflow and then the whole thing starts again.

Kris Jenkins: (23:16)

Out of interest, what does the tool set look like for someone running this process?

Rashmi Prabhu: (23:24)

I'm sorry, what was your question?

Kris Jenkins: (23:25)

I mean, what does the tool set look like? If I'm the person trying to do this job, am I running command line APIs? Have you got a user interface for the backend management? What's it look like?

Rashmi Prabhu: (23:37)

This opens up some interesting things around the evolution of what we've built so far. A few years ago, it was a handful of clusters, it was done manually. And then we went into tens of around 100 clusters. We started using scripts and then we started to look at 200, 300, 400 clusters. And then we discovered that scripts are not going to work out. And it takes a lot of time for people to manually monitor things, because there was a time when I was essentially opening 20 tabs on my browser and monitoring them.

Rashmi Prabhu: (24:15)

And then we moved on said, "Okay, let's build an automated way of doing this." And now with order of 1,000 clusters, we are able to do it through a CLI and basically call the fleet management API. And we have some of the APIs that are required to roll things out, check the progress and all of that. But now interestingly, what we are seeing is, that doesn't scale either, even the CLI becomes a bit of a manual overhead, because there is some extra learning that every new developer has to do.

Rashmi Prabhu: (24:52)

The engineering group is growing really, really fast, and for each person to now come and learn the CLI, learn that, okay, you have to log in. Now you have to check this read runbooks, it consumes time, everyone's busy. And we are trying to do a lot of different things. We cannot have just one part of the operation take up a lot of time from each developer. What we noticed and after talking to different teams, we figured that having a UI might be a better way to do it.

Rashmi Prabhu: (25:24)

And so what we've done and we basically just launched the new rollout service, the fleet flight rollout service and the fleet UI in a couple of weeks ago. I think the team is, everyone really came together and we shipped that over the last few months. Now we are basically looking at a UI-based solution, where a person can go in and create their rollout plans, they can watch the progress, they'll get all the data that they're looking for without having to read any runbook and run any CLI, it'll just be there on the UI page for them to go and take a look. We are making that much more easier for developers to go and operate their fleet wide rollout.

Kris Jenkins: (26:14)

Oh, wow. So the whole backend developer experience for this has been a real evolutionary journey?

Rashmi Prabhu: (26:20)

Yeah, I think from the backend perspective, first of all, our team has no UI expertise. It's really amazing how everybody ramped up on that and started to build all of that and how grossly we had underestimated it initially on how much effort it takes.

Kris Jenkins: (26:40)

Frontend work is harder than it looks.

Rashmi Prabhu: (26:43)

Yeah, it is, lesson learned. But I think from the backend perspective, we've opened up, we've gone contract first API driven backend. And it's a Java-based service, so we are kind of opening up the GRPC APIs for people to plug in their rollout plans eventually, and use the whole active monitoring notification, all of that built in and go fleet wide, run at scale and build out that system. And with additional features, let's say, it's paused, you should be able to resume a workflow without having to abort and rerun the whole thing all over again. And features like error budgeting, maybe one error-

Kris Jenkins: (27:38)

Error budgeting.

Rashmi Prabhu: (27:40)

Yeah, maybe there are one or two clusters that where there was a blip, and we go and pause the whole workflow, but maybe that was not required. Maybe it was just a one-off thing and someone has looked at it and it seems fine. And we saw that this started happening quite a bit, where sometimes some of the alerts maybe a little too sensitive. And so we want to allow putting error budgets where we say, "Tolerate a little bit of error," it's okay, right? And just keep going. And maybe 10 clusters are airing out over across various spaces and then maybe it's a problem. And then we stop. Essentially it'll make it a little bit more smoother.

Kris Jenkins: (28:26)

So you're programmatically allowing for the fact that reality is never smooth?

Rashmi Prabhu: (28:28)


Kris Jenkins: (28:29)

Fair enough. That seems very sensible. One thing, this is probably more getting right into the nuts and bolts, but is life made significantly harder by the fact that Confluent Clouds rolling out over things like AWS and GCP? You've got a lot of different APIs under your APIs that you're dealing with. Does that make life much different?

Rashmi Prabhu: (28:56)

From the fleet management perspective? Not that much. I think because some of the APIs that we've written and the operator that has been written, they are built in such a way that... And because of course, Kubernetes that we use, I think they've been built in such a way that the higher levels of abstractions need not absolutely worry about cloud specific features and cloud specific APIs. They get handled at a lower level.

Rashmi Prabhu: (29:28)

Of course, there is a little bit of special casing sometimes based on the cloud provider, but that gets hidden away. And we've consciously made that separation of concerns that we don't have to deal with or a particular cloud is such and such, and hence it has to be done this way. Not to say that eventually that will not happen. If there is something special that needs to be done for a certain cloud at the fleet rollout level, then we will do that. But otherwise, we would expect that it gets handled in the core API itself.

Kris Jenkins: (30:06)

We really are dealing with a level of abstraction above something like Kubernetes, where we're saying, "Okay, you've got this wonderful rollout tool for a single spot, but what are you rolling out when and what are you not rolling out? And what happens if something goes wrong?" It's that management of a vast number of different possible Kubernetes recipes that is really where you come into?

Rashmi Prabhu: (30:29)


Kris Jenkins: (30:30)

I think I'm clear on what fleet management means now.

Rashmi Prabhu: (30:35)


Kris Jenkins: (30:36)

There's one big fit thing that we touched on at the start that we haven't really talked about, which is, how do you do all this while without the customer noticing as far as possible?

Rashmi Prabhu: (30:51)

I think it's a combination of different things. One part is knowing what the fleet is looking like, right? We want to have visibility on everything that's running across the fleet. Basically we are currently kicking off building out the view API, which is a beast in itself. I could probably talk for another two hours about it, but in a nutshell, it's about getting a wide angle view of the fleet. Knowing what the control plane sees, knowing what is actually happening on the data plane, being able to go back in history and understanding how certain rollouts have gone in the past on specific targets.

Rashmi Prabhu: (31:38)

Maybe there is a particular cluster of a particular customer whose cluster is having constant problems, right? We should be able to go back in history, go back six months and see how many times during rollout this particular cluster kept alerting. And then like you can derive a lot of new insights on specific customers. Today a bunch of this is a little bit of tribal knowledge, and I think having this historical view of things will capture all of that knowledge into the system itself.

Rashmi Prabhu: (32:12)

We can then make things better for those specific customers and make some adjustments and see how to not have these kind of problems keep popping up for them. Visibility is one aspect of it, right? The other aspect is the active monitoring that we spoke about, where before things go too bad, we will notice it and we will do take the right steps. That basically helps in not having complete downtime for a customer and maybe they're degraded for a little bit, but then other customers are not getting affected either, because now it is contained to just as very small subset during that rollout.

Kris Jenkins: (33:01)

You might be in the situation where because you've caught it quickly, two out of their 10 brokers went down and so they are affected, but not nearly to where had it been a manual process to step into?

Rashmi Prabhu: (33:14)

Yeah, yeah. Because the time to react is that much faster by the system, whereas manually missteps can happen. People make mistakes, it happens all the time. So we kind of cut out all of that. And I think the Kafka system itself and the way the operator and all of these are written, we have certain checks at some of the low levels as well to make sure that not everything goes down, at max, it gets degraded, but it doesn't really go down as such.

Kris Jenkins: (33:55)

I can see that. So it's one other question I wanted to ask you about that, which is, how much is it relying on the fault tolerance of Kafka itself? Is that big part of how much it transparency you can provide?

Rashmi Prabhu: (34:05)

Yeah, I think fault tolerance within Kafka definitely helps a lot. It helps at the basic cluster level for a specific customer, right? Basically you can like we said, it might be degraded, but it won't completely blow up. But I think getting those metrics and getting all of that out of different Kafka clusters and learning from it and making sure that the other customers are not getting affected who have not yet started to get upgraded. I think that is one of the key values that we bring to be able to apply that and make sure that we are not destroying others in this process.

Kris Jenkins: (34:59)

Because I guess, with this kind of scale, you can step into one customer and manually massage things to make sure they're up quickly. But if you've taken down 50 or 100, there's no possible way you can react to that?

Rashmi Prabhu: (35:14)

Yeah, exactly.

Kris Jenkins: (35:17)

Gosh, these problems only come up at this kind of scale that you have to start automating for, it's quite an astonishing... I get the feeling, you get a kick out of doing that kind of work?

Rashmi Prabhu: (35:30)

Yes. It's been super exciting, I think some of the things that we are trying to build, I think they're there in terms of concepts in some of the larger cloud companies. It's there in Google, we know it's there in AWS. I think it's also there to some extent in Facebook and Uber, but not much has been spoken about it, there's not much information as such about it and it doesn't come together as, "Hey, in order to run at this scale in such a heterogeneous environment, you need to have all of these blocks in place," right?

Rashmi Prabhu: (36:14)

And I think that's what we are trying to do. Bring all of these things together. How do you do rollouts? What should the experience be? Because engineers are important too, we don't want to have crappy experiences. We want to people to do all of these operations and then go off and work on other important things. Having visibility at various levels and the amount of data that we can get and amount we can learn to make this better in the future. Having safety, right? Safety it's part of our mission itself. And a lot of this that we are talking about is to bring that safety to our external customers while we are performing cloud operations on them.

Rashmi Prabhu: (36:58)

I think like be able to, first of all, limit blast radius, all of that is one part of it. But eventually we are also looking at being able to apply maintenance blocks, card enough are part of the fleet. Maybe describe it as a policy, where no matter what happens, that part is fully protected and there's no for power there. And we are upgrading everything else in the fleet and this particular part is completely safe. Maybe it's a Thanksgiving weekend and certain customers don't want to be disturbed during this time.

Kris Jenkins: (37:36)

Maybe you block out all your florist the day before Valentine's Day, right?

Rashmi Prabhu: (37:44)

Exactly. And make sure that our customer is successful, which means Confluent becomes successful, right? Once we have all of this in place, that makes it a more well-rounded system or a well-rounded platform. And then we can use it in different ways. Having describe things with policies, make it as flexible as possible, allow overrides wherever it is necessary. And these are literally some of the projects that I think it's been around two years since I've been forming all of these ideas.

Rashmi Prabhu: (38:23)

And I'm so excited that over the past one year, we formed this team, the fleet management platform team, and all the engineers here, everyone's coming together, working on all of these designs, building out all of these systems, productionizing them and killing off one after the other and making this a complete platform. In fact this year is like the year where we've worked on rollouts, which is continuing, we are kicking off visibility, which is going to take off and we'll be taking off on safety as well to do the maintenance windows a bit of it.

Kris Jenkins: (39:03)

Plus you are presumably hiring a frontend engineer, you are writing a book because no one else has written this yet?

Rashmi Prabhu: (39:11)

I think you'll see some block posts coming out of us very soon.

Kris Jenkins: (39:15)

That'd be a good start. Let's not write the book this week. Where do you think you'll be on this in a couple of years? It's always hard to estimate, but give me a guess?

Rashmi Prabhu: (39:24)

I would say in a couple of years, we'll have perfectly automated rollout system and we'll have complete visibility, I think this is definitely a reality. We will have deployment strategy based rollouts because we are already working on it. We will have complete visibility across the fleet. Now and work on how this data can be consumed, how it can be ingested. And also have better described safety rules and safety mechanisms so that all of these operations become seamless and very well automated.

Rashmi Prabhu: (40:07)

And eventually also have the system now tell you, "Hey, I detect some drift here, these certain clusters are running way behind. Looks like the fleet upgrades never reached these for whatever reason. Maybe they were aborted, maybe they were killed." Being able to roll out security patches very efficiently, right? You might have fleet running in two modes where one is an emergency mode, maybe there's a security vulnerability. We want to roll things out to all the ksqlDB clusters, right? How do we do it in the smoothest way possible?

Rashmi Prabhu: (40:42)

Risk versus space is always a balance. I think in regular operations, it would be make sure that it's not too risky, but also not too slow. But maybe in security patches, you want to be really quick, because you do have some people monitoring it. So then you can make some workarounds. I think there are multiple modes in which fleet can run and fleet should support going forward. These are all some of the open questions that we are looking at, seeing how to address them.

Kris Jenkins: (41:18)

Cool. Simply some of those will feed into each other, right? When you've got visibility of three years worth of customer data, you can start writing new safety heuristics rules?

Rashmi Prabhu: (41:32)

Yeah, absolutely.

Kris Jenkins: (41:32)

As of now. It's an interesting future. If there's one thing I should take away from learning more about this world, what do you think the headline is?

Rashmi Prabhu: (41:45)

Can I give you four?

Kris Jenkins: (41:47)

Four. Okay, I'll let you have four. I'm counting though.

Rashmi Prabhu: (41:52)

I think one thing is making sure that we do whatever it takes to protect our external customers. I think that's the main core focus that needs to be there. Because we don't want to have or provide that crappy experience where a customer finally feels I could have managed this myself, right? Why am I paying Confluent? We don't want to be into that at all. Making things as seamless as possible, as safe as possible and keeping them notified, right? Notifications is important to a good extent, because they want to know what just happened. Maybe there's a problem.

Kris Jenkins: (42:36)

Sorry, not to interrupt your list of four, but do you think this view API that you're building will eventually feed back into something customer facing?

Rashmi Prabhu: (42:44)

Potentially, I think the safety part of that would actually be a little bit more customer facing thing, I would say, because then eventually we could get into, once we have all of this built up, we could potentially have customers define those safety rules for us, right? They might put in a policy saying, "Do not do this for me." And then it gets delegated to our customers to tell us how they want to be managed. And that's I would say it's a good way to be in.

Rashmi Prabhu: (43:22)

And the visibility part, I think for a particular customer showing some of their historical data might make sense, but since we run a lot of multi-tenant setups and all of that, maybe they may not be as interested in everything else that's happening across the fleet. But yes, for that customer, maybe there'll be value in bringing up some of that data and showing it to them.

Kris Jenkins: (43:50)

How often you've been updated without realizing it might be an interesting statistic.

Rashmi Prabhu: (43:55)


Kris Jenkins: (43:58)

Is that one or two that we've done? Safety and visibility. One and half?

Rashmi Prabhu: (44:04)

Yeah, safety and visibility, definitely. And the other is, I think developer tools, we are trying to give the new meaning to it. It shouldn't just be limited to some quick scripts or just some CLIs. I think it's important to build a good product even for developers, because this is what developers do day in and day out, maintaining production, operations on production.

Rashmi Prabhu: (44:35)

And it needs to be intuitive, it needs to be easy to use, and it needs to be safe for them, where they don't have to worry about, "Oh, I just issued a V5 upgrade and I'm not going to sleep the next five nights until this finishes." It shouldn't be that way for them. It should be more like, I'm confident of what I'm doing because I know the system is there to support me and I can rest.

Kris Jenkins: (45:05)

I'm going to risk sounding a bit like an advert here, but I'll do it anyway. One of the things I quite like about working for Confluent is, we're building things for developers and it's a very different kind of business when you're building stuff for developers, right? Yeah. It's like you're talking to your own people in a way as developer. But you're another layer on top of that, your developers building for developers who build for developers.

Rashmi Prabhu: (45:31)

Yeah, a bit of-

Kris Jenkins: (45:32)

It's a very curious form.

Rashmi Prabhu: (45:36)

I think currently all of engineering is our direct customer, I would say, right? We are basically working with our developers, we get feedback from them right then and there. And we know what to work on next and what should be the priority. And I think we've learned a lot, speaking to other teams, speaking to other engineering teams and it has helped shape the product quite a bit and our roadmap and everything. I'm really thankful for that.

Kris Jenkins: (46:08)

Wow. Cool. Are you hiring at the moment?

Rashmi Prabhu: (46:12)

I want to say I'm always hiring.

Kris Jenkins: (46:13)

Okay, fair enough.

Rashmi Prabhu: (46:16)

We definitely need developers, engineers coming in with this kind of mindset, like be customer focused, be developer focused and build the right systems.

Kris Jenkins: (46:28)

Cool, maybe we should put a link to your specific job board in the show notes?

Rashmi Prabhu: (46:34)

Mm-hmm, yeah.

Kris Jenkins: (46:36)

Actually this has been really interesting. It's a bigger world than I expected. So thank you for teaching me about problems I don't actually want to manage myself, but I'm very, very glad you are thinking about.

Rashmi Prabhu: (46:48)

Yeah, it is nice talking with you too, Kris. Thank you.

Kris Jenkins: (46:51)

Thanks for joining us. Bye.

Rashmi Prabhu: (46:52)

Thanks, bye.

Kris Jenkins: (46:54)

And there we leave Rashmi, who I think has a fair claimed the title of the developer's developer. She's definitely hiring, so if you are as enthusiastic as she is about these kinds of problems at this kind of scale, check for a link in the show notes, and we'll link you to the job board. I will remind you before we go that if you want to learn more about Kafka, we've got documentation, tutorials, courses and more over at And when you want to apply what you've learned, you can easily get a Kafka instance running at Confluent Cloud.

Kris Jenkins: (47:28)

Sign up with the code PODCAST100 and we'll give you $100 of extra free credit. In the meantime, if you've got questions, or comments, or just warm fuzzy feelings about this week's episode, you know what to do. Click the thumbs up, type in the comment box, leave us a review, drop me a message on Twitter. All the links are in the show notes. It's not just Rashmi that needs to monitor the health of her services, Streaming Audio does too, so get in touch. And with that, it remains for me to thank Rashmi Prabhu for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.

To ensure safe and efficient deployment of Apache Kafka® clusters across multiple cloud providers, Confluent rolled out a large scale cluster management solution.

Rashmi Prabhu (Staff Software Engineer & Eng Manager, Fleet Management Platform, Confluent) and her team have been building the Fleet Management Platform for Confluent Cloud. In this episode, she delves into what Fleet Management is, and how the cluster management service streamlines Kafka operations in the cloud while providing a seamless developer experience. 

When it comes to performing operations at large scale on the cloud, manual processes work well if the scenario involves only a handful of clusters. However, as a business grows, a cloud footprint may potentially scale 10x, and will require upgrades to a significantly larger cluster fleet.d. Additionally, the process should be automated, in order to accelerate feature releases while ensuring safe and mature operations. 

Fleet Management lets you manage and automate software rollouts and relevant cloud operations within the Kafka ecosystem at scale—including cloud-native Kafka, ksqlDB, Kafka Connect, Schema Registry, and other cloud-native microservices. The automation service can consistently operate applications across multiple teams, and can also manage Kubernetes infrastructure at scale. The existing Fleet Management stack can successfully handle thousands of concurrent upgrades in the Confluent ecosystem.

When building out the Fleet Management Platform, Rashmi and the team kept these key considerations in mind: 

  • Rollout Controls and DevX: Wide deployment and distribution of changes across the fleet of target assets; improved developer experience for ease of use, with rollout strategy support, deployment policies, a dynamic control workflow, and manual approval support on an as-needed basis. 
  • Safety: Built-in features where security and safety of the fleet are the priority with access control, and audits on operations: There is active monitoring and paced rollouts, as well as automated pauses and resumes to reduce the time to react upon failure. There’s also an error threshold, and controls to allow a healthy balance of risk vs. pace. 
  • Visibility: A close to real time, wide-angle view of the fleet state, along with insights into workflow progress, historical operations on the clusters, live notification on workflows, drift detection across assets, and so much more.

Continue Listening

Episode 223July 7, 2022 | 50 min

Blockchain Data Integration with Apache Kafka

How is Apache Kafka relevant to blockchain technology and cryptocurrency? Fotios Filacouris (Staff Solutions Engineer, Confluent) has been working with Kafka for close to five years, primarily designing architectural solutions for financial services, he also has expertise in the blockchain. In this episode, he joins Kris to discuss how blockchain and Kafka are complementary, and he also highlights some of the use cases he has seen emerging that use Kafka in conjunction with traditional, distributed ledger technology (DLT) as well as blockchain technologies.

Episode 224July 14, 2022 | 66 min

Streaming Analytics and Real-Time Signal Processing with Apache Kafka

Imagine you can process and analyze real-time event streams for intelligence to mitigate cyber threats or keep soldiers constantly alerted to risks and precautions they should take based on events. In this episode, Jeffrey Needham (Senior Solutions Engineer, Advanced Technology Group, Confluent) shares use cases on how Apache Kafka can be used for real-time signal processing to mitigate risk before it arises. He also explains the classic Kafka transactional processing defaults and the distinction between transactional and analytic processing.

Episode 225July 21, 2022 | 53 min

Event-Driven Systems and Agile Operations

How do the principles of chaotic, agile operations in the military apply to software development and event-driven systems? As a former Royal Marine, Ben Ford (Founder and CEO, Commando Development) is also a software developer, with many years of experience building event streaming architectures across financial services and startups. He shares principles that the military employs in chaotic conditions as well as how these can be applied to event-streaming and agile development.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free