They say the cloud is somebody else's computer and that's true and useful but it's no good if something isn't automating those computers, particularly when you've got a complex managed service like Confluent Cloud that's got Kafka and Kafka Connect and Schema Registry and ksqlDB and things like that. There's a sophisticated control plane behind all that, that's pulling the levers and twisting the knobs to make it a pleasant service. Today I talk to one of the engineers who builds that Rashmi Prabhu about some work she's done on rolling upgrades. So there's some kind of neat things to uncover about how a cloud control plane works all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Hello and welcome to another episode of Streaming Audio. I am as per the usual your host, Tim Berglund And I'm very glad to be joined in the virtual studio today by my coworker Rashmi Prabhu. Rashmi is an engineer on the control plane team as part of the people, the group of people who build Confluent Cloud. Rashmi, welcome to Streaming Audio.
Thanks Tim. It's really nice to be here with you.
Cool.
And thanks for this opportunity. I'm looking forward to discussing a little bit more about the work that we've done.
And I'm looking forward to learning a little bit more about it. Before we get started tell me a little bit about your background. How did you come to be doing the work that you're doing now?
So I joined Confluent I think it's not yet been two years but almost two years ago and I joined the control plane team. It sounded like something new and exciting, I hadn't done something like this before so I thought it's a great opportunity to learn and that's how I'm here. I thought I'll give it six months but it's been two years.
I like that. It's a very like tentative investment like, "Maybe six months, we'll see."
Six months.
"I don't know if I'll even like it." So yeah, on the control plane, that's the set of services and software and systems that make everybody else's clusters run in Confluent Cloud, right. Is that a good way to define that?
Yeah. So in Confluent we have a whole suite of products, we've got Kafka, ksqlDB, Connect so all of these Schema Registry, all of these run as separate instances in the cluster, in a particular cluster. So we basically have separate clusters for different customers sometimes multi-tenant and it's all of the cloud on all the three cloud providers and that's sort of like the data plane. And so to govern all of these we need a slightly next level of services which we call this the control plane. And that kind of determines how we operate on these customer clusters, what are the APIs that we provide our customers and various such functionalities.
Yeah. And it's a good point about there being lots of different kinds of clusters to some degree I like sign into Confluent Cloud and I go into an environment and if you're not a Confluent Cloud user you don't know what I'm talking about but also if you're not a Confluent Cloud user like, "What do I have to do?" Start, if you sign up, become one. There's a promo code, if you listen all the way through this episode there should be a promo code in the outro stuff that I say that gets you more free cloud stuff. Anyway, it just looks like one thing but there are some little tells like, "Okay, I'll set up a connector. Oh, I need a key in secret or there's something a Schema Registry that needs to be key and see."
So anything that talks to anything else you're building and you don't have to do much to manage it but there are these keys everywhere which is just to tell that, "Yeah there's actually separate things happening that are talking to each other in a secure way." At least from a completely users' perspective. I know not a lot about how the control plane works. I can see that and I can see how of course there are these separate clusters. And it's a good thing you do what you do because one of the good things about Confluent Cloud being a cloud thing is that somebody else is managing the computers obviously, right. That's the cloud, that's why we have the cloud but it's more than that. There's a lot to do to operate all these pieces together and we have this nice control plane that does all that stuff. So talk to us about kind of historically, I guess we know probably historically what the pieces of operating a Kafka cluster were like but specifically what are some of the things that the control plane has to manage for us?
Yeah. So I think some of the operations that out of interest here would be provisioning, the provisioning of clusters. When you sign up you want to create a cluster, we provision it for you, it goes through the control plane. We have sort of rolling restarts and general management day in operations that we perform upgrades of those clusters make sure that it's at the latest image, it's got all the latest fixes in, the latest configurations, features. All of those are managed through upgrades and rolling restarts of those clusters, we have elasticity operations. So we have set of different things that we do on Kafka and other clusters that we run in the data plane.
You said elasticity that reminds me, I'm going to make sure there's a link in the show notes. Over the, I guess it was last summer I don't know it all runs together but sometime during the pandemic we did some fun things with scaling in Confluent Cloud. One of them was a Basic or a Standard cluster, it was just a zero to 100 megabit auto-scaling but another one was a Dedicated cluster. And the difference if you don't know this is a dedicated cluster you're actually specifying not the number of brokers because what's a broker but kind of we have this compute unit thing that you're asking for a certain amount of that capacity. And so we scaled the cluster from like 50 megabits per second to 11 gigabytes or something like that to some crazy thing. So why did that work? Well, because Rashmi does what she does.
Oh, well the control plane org which has a lot of people.
The control plane org. I know there's more than one. Out of curiosity, so everybody this isn't a scripted question. I did not interact with Rashmi last summer when we were doing this. Did any of that come across your radar when we were doing that craziness? Were people talking about it?
About the?
The 11-gigabyte cluster.
I don't particularly remember it but-
Okay.
Yeah but then.
Fine. You know what? That's good.
Now, everybody is freaked out.
You know what? That's good because if everybody talked about it then it would have been like this big controversial thing but it was a one test engineer and I collaborated because I didn't know how to drive the test tool and he did. And I was like, "Okay. Well, make sure you put it in a region where it's not going to kill us cost wise and everything was fine because it was not a cheap test to operate that's a lot of data. So, yeah. Good that it didn't cross your radar, right. Everybody's like, "No, we're going home at five o'clock."
Yep.
Yeah. So what are in your view some of the more interesting things that you automate, you mentioned making sure everybody's on the same image and scaling and what is difficult that you spend your time on without obviously, we're talking about internals of a cloud service and you don't get to talk about every proprietary this and that. But from your standpoint, as an engineer who builds this what's the fun and hard stuff?
Yeah. So within the control plane we have a bunch of different systems through the stack, right. From the control plane and even at the data plane level which make all of this happen. We have various APIs for each of these functionalities to realize these functionalities at the data plane on the cluster. So a lot of different changes, improvements, scalability aspects, all of this has been done. A lot of work has gone through behind these APIs to make it even work. So for example, at the operator level we had a lot of changes to build out our role controller to make sure that a rolling restart of brokers were seamless.
And you referring to the Kubernetes controller.
The Kubernetes.
Yeah.
And at the control plane level, we had a lot of improvements made in the wrong API to make sure that scales when we try to roll a lot of different clusters. Now, all of these worked well except there was one aspect where, which was causing a lot of time for our engineers and that was the problem that was what we were trying to solve with this automation. So the way a particular cluster role or as we call it cluster role took place was-
That's a rolling upgrade.
Yeah, a rolling upgrade.
Okay.
Yeah. So a person would sit in front of the system, choose which clusters were the next target, issue the call the role API on them. And then for hours and hour they would have to sit and actively monitor every different metric for these clusters.
So this is a Confluent SRE. This is in the old days of Confluent Cloud. This is a person, you're not talking about some poor soul who doesn't get to use Confluent Cloud, you're talking about how we did this.
How we did this, yes.
Okay.
Internally. This is how our team was.
Okay. All right. A lot of transparency here today folks, we're revealing the embarrassing underbelly of the early days but, "Hey what we're trying to get it started.
Yep.
Not this way anymore.
That's way.
This is how it was.
Yes that's how it was. And it was a lot of time although it looks like it's a background job, it was really not because it takes away a lot of energy from engineers and engineers wanting to build. It's a little hard to just keep watching dashboards and immediately try to react to it.
Wow.
Make sure that customer impact is minimal.
Yeah.
And so we recognize this and decided that this needs an automation. It just works better that way, it's less error prone and it gives back a lot of time back to the engineers.
As I like to say, "This is one of those times when you realize it's too bad we're not using computers for this."
That's right exactly, well said. And that's when we built like a service, a microservice that would basically replace the human operations that were involved in this whole process and be able to scale as we increase the number of clusters or customers. And as more and more productive engineers began to ship features out to be able to scale along with all of these dimensions that were changing. I think we had a good fit for an automation solution and that's what we built.
Cool. And yeah, that sounds like automation. None of this is easy. If you're going to make it be robust and handle edge cases and handle scale and everything, it's not easy but again when a person is watching a console and sort of clicking like yeah, we can build a service for that. You're not going to get a conference talk out of the cool machine learning classifier that you had to build to make this work, it's bread and butter automation.
Yeah.
Which again it's not trivial but you know what to do and you do it. And that's what makes a cloud service be a more efficient thing.
Yeah. So basically we built functionalities into the system which replaced all of the manual operations that a person, an engineer had to do like validating those clusters, making sure that we are not touching something that we shouldn't really be touching and fighting off those upgrades in batches, running concurrent roles and then actively monitoring it. And today we have features that we very monitored clusters before an upgrade, make sure that it is okay to touch this and then go ahead and apply the operation. And then while the operation is in-flight, we keep actively monitoring it. If there is any problem we basically automatically pause it. Pause that operation midway, wait for things to stabilize and we have the option of automatically resuming that operation or not. And then once the upgrade is done we again have several series of iterations where we monitor these clusters, make sure things are stable before moving onto the next set. And that's kind of how we spread out the changes across all the clusters.
Got it. So for a trivial example, you wouldn't want to do a rolling restart when there's an under replicated partition or something like that.
Yeah.
So you have to look for a condition like that.
Yes.
Cool. And I imagine there are some nontrivial things that you discover along the way that you also monitor. So you make sure you're not going to break it because a rolling restart really does require taking brokers offline.
Yeah.
That's what that means. So I wonder this would probably be good advice for people who are automating things whatever they are, because that's kind of like a subspecialty of our vocation, right. Is people who automate stuff like that's obviously there's a control plane engineer that's your whole job building a cloud servers but you're good old enterprise software development team of the kind that I would have been a part of 10 years ago or something like that. There's the person who likes to automate stuff and is drawn to that. And the discipline is kind of getting a name, you call that developer productivity these days and it's sort of a community is forming around that. I'm sure there's going to be a conference for that if there isn't one already, there should be talking to that person when you build automation.
I wonder what the social dynamics of that are because you had this whole Cloud SRE group who had a way of doing it that was manual and I have my warm fuzzies, I know what to look for, I've got dashboards, I know how to do this. I've learned how to do it in ways that minimize the probability of downtime. And I as this human in the loop, which is the definition of unsustainable and obviously not going to work and nobody's ever going to have an efficient business running that way but we know how and we feel good about it. And now going to replace it with software. And yeah, we're all software developers. We'd like to replace things with software but just what was that like to get this adopted once you built it?
Yeah, great question. I think there were two parts to this whole system. One was designing collecting requirements, designing building and then came part two very tired to be adopted. We trust ourselves when we're doing operations and then we can't do these switch overnight and say, "Okay, now i let some machine do this." We want to be sure that things are running reliably, smoothly and it's actually going to do what we wanted to do in a safe manner without causing customer escalations and.
Yeah.
Stuff like that. And I think in that whole journey we basically went phase by phase. We approached every team. We help them configure and plug into our system. We made it as smooth as possible, hopefully. We basically showed all the advantages of using this automation like dynamic configurations, being able to pause as needed, all of the safety mechanisms that were there and then demonstrate how it would work.
And then we had people who recognize that it's good to start putting faith in this and shifting over that responsibility over to the system so they can get back their time. And very slowly we saw basically ksqlDB clusters getting adopting a Schema Registry, Connect all of them adopting this automation mechanism and this whole new process and taking up the responsibility because you kind of feel empowered to now, you're now in control of your whole deployment trusses whom to deploy to when to deploy. You don't have to rely on some other team to do it. You have the API, you can go for it and do it as needed. And then came Kafka, which was a whole different adoption phase where there were a lot of new features that were requested. Mainly all the safety features, extra safety features that are required for Kafka but all built-in and now they are fully using the API to do their own deployments.
Love it, okay. So it's adopted.
Yes.
What's next for this? It sounds like you've solved a lot of the problem. Are you done? Is there a next step for this part of the control plane?
Yeah. I think that this is probably like the first, I would call it a lengthy POC sort of a thing but I think we have a lot of new things that we want to build especially... So we have like the select apply monitor fees for an operation. We think we have the APIs to do most of the operational work. We have the system that can automate all of these operational aspects, we're able to monitor operations as they go. But we have this whole selection phase that is still a little bit of manual work and we want to automate that as well and basically enable a more controlled phase deployment where the engineer can give a deployment strategy in the form for configuration and say, "Hey, this is how I want my deployment to go ahead. Or this is how I want my future to be released across all the clusters." They might say some aspects of GCP maybe parts of Azure but not everywhere. Being able to configure it that way in the form of a policy and then roll that out to the whole fleet.
Okay, yeah.
Yeah, I can't.
Like all dedicated clusters in AWS are all basic and standard in GCP.
Yeah.
Or something, whatever. You only have so much metadata about the cluster but for all clusters that have connectors configured or just things like that because certain changes might affect those first.
Yeah.
Okay, cool. So that's being able to add that automation is next step for you?
Yeah.
Cool.
That's going to be.
The following could sound like a Confluent Cloud commercial and I want to be clear that it's not but a cool thing about this is what it lets us do with just the efficiency of the thing, right. And this is another kind of a subtler economics of the cloud, they've been compared to public utility economics where there's constant pressure to make more efficient utilization of resources now there always is. But I think I figured who it was, I think it was Nick Carr like 10 or 15 years ago early cloud days. He was writing some provocative things about the cloud. I'll try and find an article and link in the show notes if I can it's been a while. But basically there is a particularly relentless drive to push costs down if you're a cloud provider.
And that includes the raw infrastructure provider, you know you're one of the big three or you're a Confluent who's a customer of theirs and providing a cloud service to somebody else, this kind of work is that dynamic in action and so it allows cloud providers over time to be more price competitive, right. That's as the broader economic and sort of product management forces and all that stuff circles around the product, prices go up, prices go down, things happen in the marketplace but this efficiency engine which is really directly what you're driving provides the business that flexibility to make those decisions and that can result in prices that go down or services that do more with prices that don't go up which is pretty cool stuff.
Yeah.
Beyond that most exciting thing you're looking forward to that you get to work on in the near future that you could talk about?
Yeah. I think one aspect would be what we just mentioned. We just spoke about and also like adopting more and more of these kinds of operations and automating all of them and then slowly grow out into... So basically we are building a team to also address larger efforts around the whole fleet management aspect, managing all of our clusters. How do we ensure maintenance windows? How do we do automated inspection, health monitoring, capturing events around all of this? So, yeah, I think those are like some of the outer goals that we have that I'm really looking forward to it.
My guest today has been Rashmi Prabhu. Rashmi, thanks for being a part of Streaming Audio.
Cool. Thanks so much Tim, I really enjoyed this.
Hey, you know what you get for listening to the end, some free Confluent Cloud, use the promo code 60PDCAST that's, 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available, so don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter at @tlberglund that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign-up links for those things in the show notes, if you'd like to sign up. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review and we think that's a good thing. So thanks for your support and we'll see you next time.
If you’ve heard the term “clusters,” then you might know it refers to Confluent components and features that we run in all three major cloud providers today, including an event streaming platform based on Apache Kafka®, ksqlDB, Kafka Connect, the Kafka API, databalancers, and Kafka API services. Rashmi Prabhu, a software engineer on the Control Plane team at Confluent, has the opportunity to help govern the data plane that comprises all these clusters and enables API-driven operations on these clusters.
But running operations on the cloud in a scaling organization can be time consuming, error prone, and tedious. This episode addresses manual upgrades and rolling restarts of Confluent Cloud clusters during releases, fixes, experiments, and the like, and more importantly, the progress that’s been made to switch from manual operations to an almost fully automated process. You’ll get a sneak peek into what upcoming plans to make cluster operations a fully automated process using the Cluster Upgrader, a new microservice in Java built with Vertx. This service runs as part of the control plane and exposes an API to the user to submit their workflows and target a set of clusters. It performs statement management on the workflow in the backend using Postgres.
So what’s next? Looking forward, there will be the selection phase will be improved to support policy-based deployment strategies that enable you to plan ahead and choose how you want to phase your deployments (e.g., first Azure followed by part of Amazon Web Services and then Google Cloud, or maybe Confluent internal clusters on all cloud providers followed by customer clusters on Google Cloud, Azure, and finally AWS)—the possibilities are endless!
The process will become more flexible, more configurable, and more error tolerant so that you can take measured risks and experience a standardized way of operating Cloud. In addition, expanding operation automations to internal application deployments and other kinds of fleet management operations that fit the “Select/Apply/Monitor” paradigm are in the works.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us