There are a lot of specialties within the very broad vocation of software engineering and all of them are hard to do. Distributed Systems Engineering is one corner of the discipline that poses a particular set of challenges. What's it like to build a distributed system? What special problems arise? How do you land a job doing it? And that's the conversation on today's episode of Streaming Audio, a podcast about Kafka, Confluent and the Cloud.
Hello and welcome to another episode of Streaming Audio. I am as ever your host, Tim Berglund, and I'm joined in the virtual studio today by my colleague, Roger Hoover. Roger, welcome to the show.
Thanks Tim. Great to be here.
Great to have you. Now you work on the Confluent Cloud team here, and what I'd like to talk to you a little bit about today is really just this ... You're a part of this continuing series on what it is to be a Distributed Systems Engineer. How you got into distributed systems? How your career prepared you for this? What you look for in colleagues? And also because I think it's cool, I want to spend as much time talking about the Confluent Cloud control plane, as possible as we dig through those other questions. How's that sound?
That sounds great. And since the Confluent Cloud control plane is itself, a distributed system, it fits nicely with the theme.
Kind of seems like it works, huh, it didn't seem too controversial to me. But what is your actual role? You're an engineer at Confluent, right?
Yes. And I'm working now on what we call the Confluent Cloud control plane team, although there are actually sub-teams inside of that. But I actually started on the very first where we had a single cloud team building all of Confluent Cloud. I was kind of the first three engineers working on Cloud before when it was just a little baby. I've kind of been there from the beginning.
Oh, wow. What made it interesting to you? Like you said, it is fundamentally a distributed system, but was that it, why did you get into it?
Yeah, so I guess maybe backing up a little bit more. I was a Confluent user back on 1.0, I was a Confluent user before I was a Confluent employee. And the reason I got started in it was that previously I worked out of school at etrade.com. And that was also a distributed system, but it was a little bit different, it mainly got ... We had databases as the kind of root state, and everything else tended to be stateless, but nonetheless the distributed system and interesting problems there.
But at E-Trade, I actually left and did a couple of startups and came back, but at one point on the way back, E-Trade was very interested in collecting real-time events, and making them available for things like customer service teams to give better service when someone calls in, and for the Ops teams to have a much better sense of what's going on in their giant distributed infrastructure.
E-Trade was actually microservices before that was the word, they were microservices in 1995 or something. But back then it was ... They were built in C, and they were built on a product called BEA Tuxedo. But from the very beginning though, they had a services architecture where a request might come in from the website, and span out to sometimes hundreds of different services to actually gather all the information to show the customer.
Making sense of latency in that kind of environment, you really do need a lot of information gathered from different sources, and that's where we put Kafka to work. And when I started that project, Kafka and Confluent didn't exist. Kafka company did, but not Confluent platform. I was kind of implementing my own version of schema registry and things that ended up being part of Confluent platform.
As soon as Confluent platform came out, I was like, "Oh, thank God, I don't have to maintain this anymore." I became maybe one of Confluent's, very first users. I don't know for sure if I was the first, but I was definitely one of the first. And as you can imagine, Confluent 1.0 had some rough edges, I guess.
Sure, sure. I'm willing to believe that.
I actually submitted some fixes for things and submitted issues to get up and so on. I think that's how I got on the radar, eventually, they reached out to me, and said that "Why don't you work on Confluent? Why don't you come to work for Confluent?" And that became interesting because at the same time, E-Trade was also looking to go from an on-prem deployment to at least explore cloud and more dynamic environments, including say Kubernetes on-prem or Kubernetes on the cloud.
They were very interested in that space, and so a colleague of mine, and I had spent time also at E-Trade figuring out how to automate Kafka deployment on Kubernetes, and essentially build somewhat of a control plane for Kafka. When they have reached out, my thought was, "Are you going to do a cloud product?" Because if so, I feel like I have the background to be able to contribute. And so she basically said, "Yes," it was in their future, it wasn't right away. But basically if you join, you'll get to work on it. That's my long journey to landing at Confluent.
Nice, nice. So that Kafka control plane was a key part of that vision?
And by the way, I didn't know that, that you were nearly customer zero. That's kind of cool.
Sorry, if you want to side track real quick. I even have a contribution to Apache Kafka before that. I might've been the first person to run it on OpenStack, I introduced the feature to separate advertised hostnames from the internal hostnames. [crosstalk 00:06:36]. I don't know if you're aware of that feature, but that was the ... I added that to Apache Kafka before it was even-
Strikes me one might need it.
Yeah, exactly. So anyway, I have somewhat of a history, I guess, with Kafka prior to Confluent.
That is so cool. You were a long time at E-Trade, too, right? You said you stay right out of school back in the when days, and then a couple of startups, and then back at E-Trade?
Yes, that's right. I was there, I guess for-
Maybe five years right out of school. And I don't know, do you want me to talk a little bit about what I did there?
Yeah, again, from the perspective of how it formed you as a Distributed Systems Engineer, that prepared you for the work you do now. Walk us through that, with that lens on. E-Trade the early days.
Yes. I was on a team at E-Trade that was in charge of the common infrastructure that the whole company relied on. What that meant was, we would maintain, say the database access layer and any features that people needed we kept it up to date, and meeting their needs. The other one was the microservices infrastructure, as I mentioned, it started off as a BEA Tuxedo deployment.
During the time I was there, we converted to a soap-based services. And then later to Jason, and the team I was on built tooling, because at the time it didn't exist to essentially auto-generate all of the mind-numbing XML parser code that you need to pass messages around and see. That was kind of our job, but where it's, I think, particularly relevant to what I do now is that, that team really sat between developers and production, right?
I would say E-Trade, didn't have the DevOps model, they had a little bit maybe older model, where you kind of had development teams. They didn't necessarily have access to production. Then they handed off their apps to an Ops team. And the Ops team took the pager duty pretty much, right? I wouldn't say that's the best model we've probably learned from then, how better to run software, but at the time, at least our team was right in the middle of those two worlds, right?
We were kind of on the line to make sure that all the core components of the system were working properly, and had to work in production. And likewise our customers were E-Trade's developers, right? Anything we could do to essentially automate the dull work out of their lives we did it. I think that at least prepared me to have somewhat of an SRE mentality.
I would say, I guess, I don't call myself an SRE on paper, but from the very beginning of Confluent Cloud, we had this idea that you run your own code, and everybody needs to understand that any change you make, you have to think about it all the way through to production, including metrics, and alerting, and runbooks, and everything that you need to go to production.
I think that was kind of a big lesson in my career, I think I also learned a lot of maybe practical firefighting just experience, when you're on call at ... E-Trade always had this thing, because the market opens at 9:00 AM Eastern, so anytime you made a change, you had to be on it, like at least by 5:30 AM Pacific. I remember a lot of mornings where you're super high stressed, it's like 5:30 in the morning, and trying to make sure everything ... Or trying to fix the fire, or make sure it's not going to happen. Definitely-
Right, [crosstalk 00:10:23] about to happen.
I had a lot of painful experience that taught me lessons about what not to do.
Totally. Which sounds like good preparation for building the control plane for a cloud service that provides a data infrastructure layer. Because not only ... There's nobody who wants to get paged, right? You don't want your phone to wiggle at an off hour, nobody wants that. But we, I mean, speaking as a Confluent employee, we have this responsibility to provide that infrastructure layer to people that also does not provide a cause of them getting a phone wiggling at the wrong hour. It's not just you, but for the product to succeed, for the service to succeed, you can't be doing that.
We're essentially taking the Ops burden off of the customer, right? And we're doing it many, many, many times over, so we can't do it via human manual processes. I think that's the big difference, I guess, between say Confluent Cloud as a problem versus E-Trade as a problem. If you look at E-Trade as a ... Or any typical maybe SaaS service, they may be large scale, but they don't change that frequently in the sense that you typically don't deploy in some new region all the time, you don't change your architecture very often.
You know what I mean? It tends to be small tweaks, you scale 100 servers to 110. It's not like you start from scratch and build the whole thing up from nothing. In that sense, I think Confluent Cloud is another level of degree of difficulty, because we are essentially spinning up infrastructure on demand all the time, and scaling it for customers all over the place. That's not typical, I think of most SaaS software products.
Right, because most SaaS products are applications and not infrastructure and uptime. I'm not going to trivialize anybody's uptime, because there are applications that are critical to business getting executed. Well, there just isn't anybody in the world who, like I said, there's nobody who wants to get paid. There's no business that wants crappy uptime. Everybody's uptime matters, but infrastructure just feels a little more urgent, because it's definitionally, everything's going to break if the infrastructure is broken.
Yeah, that's a great one. Especially a data service.
You're not going to like, "Well, we have a way of doing orders on paper, let me just ... Give me a call, and I'll fill this thing out." You can't.
You can't do that.
Or even think about analytics say, if your data warehouse is down, you can probably live with that for an hour.
You can, you can. Its bad but can live with it.
It's bad, but if your online event system is down, you might not be able to do what you need to do.
Literally anything, it matters. With that in mind, and you're not the first person I've asked this question to, and I give the same qualification every time, because I want to be careful that this isn't ... That nobody hears this in an elitist way, because there's lots of different kinds of engineers to be. And each one has its own special skills that you need to be successful, special frustrations.
But thinking specifically of Distributed Systems Engineering and one of the purposes of this series is, for people to be able ... For engineers who aspire to do this kind of work, to get an idea what it's like to do it. With that in mind, there's somebody out there who's a software developer, They think, wow, distributed systems sound like a lot of fun, that's what I see myself doing in the next few years. What do you think makes a Distributed Systems Engineer different from say, the full stack Java developer that I was 10 years ago.
That's a fun question. I guess, I have a ... Here's my perspective. There's three things that I tend to look for and wanting my colleagues. One of them I think is, it's kind of a design sense, or a feeling for trade offs, because there's always trade offs to be made. And usually it boils down to ... You can usually think of some really awesome design that would the ideal world you want. And then you have the system that you currently have to live with, and you have to trade off some path to iterating toward that ideal world.
There's definitely, I think, a skill and a practice that comes around being able to balance in your mind where you're headed. So kind of a North star plus how you're going to get there in small increments and constantly deliver value. I think what I tend to find is that there are engineers who are lean toward one side or the other. Some people just really want to get stuff done, and sometimes don't think hard enough about where are we going.
And there's definitely people who maybe think a lot about where we're going, but might get an analysis paralysis, and not be able to get some feature out the door. There's kind of this subtle art, I guess, of making trade-offs that I think is important. I think the second thing that maybe isn't super obvious but becomes really important as a company scales is your communication skills, particularly writing.
And for me that became really clear when I spent a little bit of time in the Apache Kafka, and SAMHSA communities, and watched people like Jay Crips, explaining their ideas to others, because in some sense to scale yourself, you have to be able to ... You can't do all the work yourself, right? In a large complex system, you can't be the one who's writing all the code. At some point you have to get your ideas in someone else's head, right? And have them do it for you effectively.
Ultimately being able to articulate your ideas very clearly and what trade-offs you're making, I think, in writing in particular becomes really important. And then the final thing for me that I want to see in everyone here's a sympathy for operations. Basically, if you've never run an app before in production and taken major duty, you probably just can't really know what that's like, and what things are really important for that.
I mean, not saying, you can't learn that, but it's very important to develop, I guess, that muscle or whatever, that as you're building a feature, you need to be thinking about Ops from the very beginning. How will someone know if this is working? How will your colleague, when they get woke up in the middle of the night, how will they be able to triage this? That's kind of what I mean by the sympathy for Ops.
Right. Now that makes all the sense in the world. I could see that being a reasonably hard hiring requirement, right? You have to have operated things for us to trust that you've got that empathy.
Yeah, that's a great point. And that comes back to kind of the state of the industry. If in some sense, like DevOps is a relatively new phenomenon, you don't always find people who have done both.
Right. Well, it is a new phenomenon, relatively speaking. Has a term, it's less than 10 years old in the popular lexicon, and a lot less than 10 years old in the people actually putting it into practice. I mean, it's controversial terms, just in general a great way to get a group of software developers to fight is just get them together and ask them what DevOps means. And if you have seven engineers, you get at least eight views, maybe nine. And again, then they'll be preoccupied fighting with one another and then you can go do something else.
It's not just new, well, it's something like developer relations, right? It's like what I do not, not a terribly well-established term or discipline. It's like, "Okay, that's a thing that people have been doing," but did people say that 10 years ago. No, they didn't, they were definitely saying at five years ago, I'm not sure when they started. You still have that same sort of, when did this become a thing? And there's just no consensus around it as a professional concentration. Everybody thinks they know what it means, but there's still a lot of disagreement.
Yeah, that's a good point. I think with maybe one other thing with Confluent, at least, we also need to add on two more things that people haven't always done, which is running a data service. Stateful services are particularly hard to run compared to say stateless services. A lot of the previous iterations of Ops, people would just say, "Oh, when in doubt restart it, right." But that's generally not a good idea for stateful applications. And so that's kind of one of those Confluent specific things that we also really need as people who understand that.
Yeah, absolutely. Tell us ... We've got a little bit of time here, and I'd love to dig in just since I've got you, just into the details of the Confluent Cloud control plane a little bit. You gave us a brief description of what it does, but just kind of give us, I mean, you're one of the people who knows it better than anyone, so help us fall in love with it. What's it do, how's it made?
If you think about Confluent Cloud we are managing, now, we've grown to over 10,000 Confluent platform customers, as clusters, sorry, that we run on behalf of our customers. That's actually a pretty large number, I'm not saying there aren't people out there doing more than that, but that's a sizable number, and we have to manage different types. We have Kafka platform consists of Kafka, connect, ksql and registry.
We run those on behalf of our customers, we run wherever the customer wants to be, or wants us to be. That ends up meaning, pretty much every all three major cloud providers, and pretty much every region that they're in. Right now we're in more than 40 regions, where we're running this stuff. The other thing is that, our customers want a variety of network access models, right?
Some of them have public IPS that can access it that way, some people want private networking, so that can mean VPC peering, that can mean AWS private link, things like that. The reason I bring that up is just to say, there's quite a variety of things we have to run and they come in a bunch of different shapes and sizes. So we have customers from who might be paying us a few bucks a month to get started. Maybe they're planning around with Kafka.
We have customers who are running in production with very demanding workloads, that creates a wide variety of conditions, and we have to try to keep the service up with high availability under all those conditions. Underneath of that, powering 10,000 customer facing clusters under the covers, we're managing tens of thousands of resources at the cloud provider layer, right? That means creating VPCs, creating VMs, creating volumes, creating DNS records, we're managing software artifacts, like what version of Kafka is running, user management. The list goes on and on and on. I bring all this up to say that it's a pretty complex application, and it's a distributed system spread across the entire world.
Yeah, it is. Can you talk about how that spreading is done? What the constituent pieces of the control plane are, and where they live? I realize at some point this is all intellectual property, and there's a little bits of secret sauce, so whatever is cool to talk about get into that globally distributed thing.
Yes, I will. So we've logged a little bit about it publicly, so I can talk about that stuff, I think. At the top level, there are two layers of control plane, you have your very top layer. We call it the mothership, because it's kind of the central brain. And then there are a bunch of satellites, and so satellites are where we run the customer workloads. Those are those 40 regions right across cloud providers, that are doing the heavy lifting of the customer data.
But the mothership is where certain global information lives, like users, identities, environments, organizations, credentials, authorization, rules, those kinds of things live in the mothership. Also because it's convenient, we centralize cluster, some cluster metadata in the mothership. That way when you log in to Confluent Cloud, we can show you all your clusters. We can show you metrics about them, how much you've used them, things like that, right?
We have nice visualizations called data flow for showing even data flowing between your producers, consumers, and your clusters. The mothership kind of has this top level information, but the reason that we have satellites, and you might ask yourself, why don't we just have a mothership manage everything, right? The reason we don't do that is, we don't want the mothership to be the central point of availability, such that if it was down, nothing would work in the entire world.
Essentially we want each satellite to be as independent as it can be. I use a term I borrowed from aircraft's terminology called, static stability to describe this, what that means for aircraft's if they're statically stable is, let's just say the planes flying along perfectly level, and it gets a wind gust that pushes the nose down, right? A statically stable plane will naturally come back up to level. That's what it means to have positive stability.
If you're on unstable plane, that gust pushes your nose down and you either keep flying down or you even do worse than like keep pointing straight to the ground. Static stability is a good thing, right? You don't basically-
You want that.
Yeah, you want that, right? You don't want the pilot to have to intervene to deal with every little gust of wind that hits the plane, right? That's what I mean in this case where you don't want the mothership to intervene, if something goes wrong in the satellite. So I'll give you an example, let's just say we're running AWS, we're running a Kafka Cluster for you, and one of the AWS instances is terminated, because it's bad or something. AWS terminates it.
You want it to get replaced, right? You want your cluster to come back to the same state it was previously in. That's an example where we don't want the mothership to have to be available for that to happen. The satellite should just be able to take care of it on its own. It has enough information locally to make that happen. So does that make sense so far?
It does, and I'm assuming since there's one mothership, as suggested by the name at least, and that's going to be hosted in some cloud. It doesn't matter which, but you'll pick a provider and it'll live in that cloud provider. Maybe we are particularly ambitious people, and we run it in on-prem hardware, whatever, it's somewhere. And when that somewhere is down you want say, it's hosted in AWS, but somebody's got a cluster running in GCP, you want that GCP satellite still to be as functional as possible so that you've got isolation between cloud provider availability.
Yeah, that's exactly right.
That strikes me as another good thing that you'd want there.
Yes, exactly right.
And I am right about that. That's good. All right. [crosstalk 00:27:15], my next question.
Tim is right.
I think I know how the mothership works, okay, good. I mean, I know a little bit about this, but all right, so you got mothership, you got the satellites. And so I could still log in potentially, but maybe not create a new environment, or something like that.
Yeah, that's right. Well, you may not be able to log in, because authentication is centralized, but-
Your application will keep running. The mothership could be completely down, you can't say, create a new user, make a new key, or change your ankles, or those kinds of things. But your application just continue running. You can't change anything about your setup, but it should just stay the way it is.
I think that kind of describes ... Now, with that in mind, you can think, "Well, okay, there's two control planes effectively, right? We have satellite level control planes, and we have other ship control planes." Now inside of each of those, those aren't monoliths either because they have a lot of work to do. Even in within those two things, we have multiple services. Just to give you an example even within the mothership itself, think about what it has to do. There's account management functionality, right?
When a customer signs up, they have to go through a flow, they can invite their colleagues, they manage access, there's a bunch of account management to do there. There's as I mentioned before, we have to aggregate all ... There's a metric service, we aggregate stuff together there. There's a service for collecting all that data and rendering it. There's fleet management, which the customers don't see, but internally to Confluent, we have to have a way of upgrading Kafka all the time.
Let's just say we implemented a new feature, or we need to get a security patch out. As I mentioned, a few moments ago, if we're running thousands of clusters, how do we get a new version of software out in some sane fashion over that many clusters, right? That's a big problem unto itself, and then you can imagine all these resources that we're managing, they have their own limitations on the cloud provider. So we need to basically keep track of what say, AWS accounts we're using, Google projects, whether there're limits, do we need to increase the limits, and so on and so forth.
That's just kind of a sense of what's ... Even within the mothership, there's a ton of services. We have to manage certificates and so on, right? And then in the satellite, that's the top level. In the satellite, there's still sub services, most of them are very similar to a product that we offer on-prem called, Confluent Operator. Confluent Operator came out of the cloud, it was first day cloud thing, where we used it to essentially automate the tasks, like the common tasks in the satellite, like upgrading versions of software, increasing the scale, you're doing the configuration.
All those things we build on Kubernetes, and they're largely the similar to the Confluent Operator product. I also considered Kafka's Controller part of the control plane because it's not part of the data plane and it controls stuff. The functionality that the Kafka Controller does in the case of like self-balancing Kafka, we've built that into the controller or tiered storage and Confluent Cloud is essentially an extension to the controller.
If you think about those problems, like generating plans to balance Kafka, or when do you with tiered storage deciding when to copy a segment to blob storage, and then you can delete it locally. That's also part of the control plane, but as you can see that's generally sort of built into Kafka, and then there's other parts of it that are more Kubernetes based, which would be the operator bits. They all kind of have to work together to make it happen.
And I actually didn't know tiered storage took place in the controller to any substantial degree. I thought that was just at the broker level and not a special broker.
That's fair. I think you're right. Probably the leaders are deciding on any given segment. I suspect the controller is involved to decide ... Actually, I'm not really totally sure what the controller itself would do, but that's a good point. I would consider the leaders centrally as manning controllers as well, right? They're making control decisions.
No, they absolutely are a part of the control plan. And I mean, I hadn't even thought just broadly about the Kafka Controller as being a part of the control plane, but that's one of those things that when you say it, my internal monologue was like, "Oh, yeah, that's really true, huh." It definitely is.
I find that's kind of something that you might not ... It's not super obvious, but when somebody says it, you're like, "Yeah, okay, it does." Yeah, of course it, if it's not part of the data path, it's control, right?
Control, yeah. But it's just when you say control plane, like even ... So listeners, if you don't know anything about the structure of Confluent Cloud, I mean, I've talked to engineers who work on it before, I know like four things about it. Not a lot, but that there was this mothership satellite distinction, and when you're listening to Roger talk about what the mothership does, you're like, "Oh, yeah, okay, of course you'd need to do that, somebody is going to do certificates, somebody's got to do auth, cluster metadata, yeah, of course, there's a place where you do all that."
You get that, that sounds like control plane. You were talking about operator, which is the comfort of product that's the Kubernetes operator, that is extracted from satellite functionality that was built originally for Confluent Cloud. And again, that's one of those things like, "Oh, yeah, of course, you'd have to do that." Obviously these are all giant Kubernetes clusters, everybody suspects that. And nobody's surprised to hear that. And we have to build that custom controller to do all that.
Sure, productize it, like all of that makes sense as classical control plane stuff. And there isn't that kind of smack yourself in the forehead realization there. But there is a lot more that is just sort of inherent to Kafka, that are these little tendrils of control plane that even go out to the broker level.
That's a great point, Tim. And they interact with each other in maybe non-obvious ways. Let's just say from ... Kafka can't expand itself, because it doesn't have power over bringing up new VMs, right? That's something Kafka can't do, it as a controller can't necessarily do, unless we taught it about how to do that. But let's just say we're doing that, obviously these control plants have to work together. So the "Outside control plane," which might be the operator could add new nodes, but then the nodes are not useful unless you start putting data on them.
Then now you're back to the Kafka Controller changing where any given leader or partition needs to live, right?. They're very intimate. I think with each, anytime you add functionality, you have to ask yourself, well, does this belong in the Kafka Controller or outside? And then how are they going to synchronize with each other basically. That's largely what the operator product does.
Tell us about the priority on making things declarative in a system like this. How does that as an engineering principle, how did that come about, and what are the positive and negative consequences of it?
Yeah, I think that's a great topic. I think, if you think about simpler setups, if Tim was just running Lamp, Linux, Apache, my SQL PHP on your own home server, you don't need a ton of stuff in your control plane, right? You might just manage it by hand. And then if your app gets a little bigger, and you want it to be repeatable, you might add some scripts in the loop.
But the scripts are kind of a little fuzzy on what to do, they're good at, let's say you're starting with nothing and you use some tools, maybe a chef or puppet or whatever. And you use that to bring up an application into some running state, it's pretty good at going from nothing to having this running state. But once you're in the running state and something weird happens, like one of your servers goes away, now, how do you fix that problem, right?
Generally, it gets hard with those tools to self heal, right? Generally those kinds of tools, like some operator needs to get involved and run chef again or something, right? It's when you get to the scale that I think we're at, or you get to the complexity where things are changing all the time, and you have to deal with these failures. I think that's where you really want a declarative model.
And I think the declarative model more or less came out of Google. They wrote some papers on Borg, in Mega, and Kubernetes, and they define this as what they call, a Reconciliation Control Loop. And basically it consists of, you have a desired state, you kind of know like, "Hey, I'm supposed to have this many brokers say, in the Kafka world, right? I'm supposed to have six brokers right now, and they're supposed to be of a certain capacity."
That's your desired state. And then you have a current state, which the system observes by maybe asking it to be us like what machines I have, or Kubernetes you're asking what pods are alive, that sort of thing. And then you have this reconciliation loop where something that they call a controller is constantly trying to figure out, how can I make current state match actual state? That's actually, I find it, it's one of those things that when someone explains it that way, you're like, "Duh, of course it should work that way," but-
It's actually pretty hard to achieve. And most things don't work that way.
But kind of the Terraform really does work that way.
Exactly, exactly. So I think Terraform was the first one product that I know of that actually worked that way. Like I said, E-Trade had a chef style system, and it didn't work that way. I use chef for many years, didn't really work that way. Terraform is very clear about this model, right? I think maybe they were one of the first ones and then Google, when they made Kubernetes open source, also made this a very kind of a ... Hopefully by now pretty popular pattern for thinking about things.
Likewise, you might not also realize it, but it'll be as cloud formation is also a similar type of tool, and an OpenStack had one called heat. It was the same concept, right? The user would just say, "Here's what I want, I don't really care how you do it, just make it happen." It's got all the same benefits of SQL in the sense that, I think, it tends to be less verbose, because there's a lot of ugly details on making what you want happen, call it [inaudible 00:39:02] APIs.
But I think it also has that property that I mentioned earlier about being self healing. As long as the system understands what it's supposed to be doing, it can work toward healing itself. But if you just have a script, and you fail halfway through a stripped, or something fails after the script has already run, then it's super unclear. How do you get ... What is the desired state? I don't know, you just start the script over and hope that it's idempotent or something, right?
Exactly. And you learn to write them idempotently, but better to do this sort of constant, or the ability to say, "Hey, here's the desired state, here's what's going on in the world. And here's a way to bridge the gap."
Exactly. In Confluent Cloud, we try to use this pattern everywhere, because I tend to think it just leads to the best outcome. But we use a whole variety of tools because there's kind of different parts of the system have different needs. You mentioned Terraform already, we do use Terraform a lot. But it tends to be for things at the bottom, where what I mean by the bottom is say, building up basic networks in AWS, or Google, or Azure, right?
You're making a VPC, or you're setting up all the kind of foundational stuff that you need. And those things tend to be relatively long timescales. Like if the user's asking for a dedicated environment where we have to build a VPC, they're okay with it taking an hour or whatever, right? We can't make the cloud providers go any faster than that anyway. In that sense, Terraform works pretty nicely because Terraform kind of does everything in batches, right?
Users who maybe listeners, sorry, who aren't familiar with Terraform, it's typically run via CLI, and there are two main commands you run. You run Terraform plan, and that checks out the desired state, it checks out the current state of the world, it computes a plan, and it shows it to you. And then you look at it and say, "Yeah, that looks right." And then you run Terraform apply, and then it goes and executes the plan, and updates the state of the world.
But that flow is fundamentally kind of batch oriented, right? It's not really event based. You run it in some kind of loop, right? You just decide to run it every so often, or you run it on every PR merge, into GitHub or something like that, right? We do that for some static parts of our infrastructure, but we also have parts where we need to drive Terraform from an API. For those cases, we actually run Terraform in like a Kubernetes CronJob style, where it just periodically checks desired state, competes the plan, executes.
But again, like I was saying, that's sort of timescales where this batch mode is okay, right? Now, if you think about replacing a broker, when it fails, you want that to be as fast as possible. You really don't probably want this Terraform style for that, right? That's where this declarative control comes in with Kubernetes operator pattern. As mentioned before, Confluent has a operator product, and it's sort of built on what people call this operator pattern, where Kubernetes gives you all the machinery to basically build your own controllers.
If people aren't super familiar with Kubernetes, it comes out of the box with a bunch of primitives that are super useful, pods, and stateful sets, and services, and things like that. But they also give you this really cool way of writing your own, adding your own objects into the system, which is pretty cool. We can define say, a Kafka cluster as a top level object in Kubernetes. And then we can have an operator watch anytime Kafka cluster is added, or changed, or deleted, and then it can take the following actions.
It can do like rolling restarts, or other things that are specific to Kafka in this way. Essentially Kubernetes gives you the machinery so that your desired state is saved in a common way, your current state is saved in the common way. And then you have this event-based API so that when something changes, your controller can decide what to do. You're essentially on the flight, it's kind of a mini Terraform loop, right? You compute a little plan, and then you execute the little plan, and then you go back to waiting, right?
It's just that the executing the plan is done in terms of Kubernetes primitives, not a cloud provider primitives. So you can get performance contracts that are a little more little snappier.
Typically, you're right, you just mess with Kubernetes stuff. If you look internally to Kubernetes, it is messing with the cloud providers, right? When you define a load balancer service in Kubernetes, it has to go talk to AWS load balancer APIs, right? Or Google, or Azure. Under the covers Kubernetes is doing some of the same stuff that Terraform does, but it tends to be that you do this at a different speed, right? You want your stuff to happen pretty fast.
And so we mostly use Kubernetes for that layer of stuff, I guess, the satellite layer. And then if you look one layer up above that at the mothership layer, so what is it doing? Well, it has to basically orchestrate a bunch of sub-services below it, right? To satisfy the customer's request of like, "Hey, I need a whole new environment, and it needs to have private link, and it needs to have this certain cluster size, and stuff." That's where we have this top level thing, called the orchestrator, that is sort of essentially orchestrating all of the services underneath.
We have to say, build a new VPC, and set up all the networking correctly, provision Kubernetes clusters under the covers, provision zookeeper, provision Kafka, provision certificates, all the ... Make sure the ... Set the application up with all the right config. All the things that ... Oh, DNS is another one.
There's basically a ton of things they have to do, and those as I mentioned earlier, since the mothership itself is made up of services, we kind of have this top level layer that's putting them all together, that's the orchestrator. And the interesting thing about orchestrator is, it also follows this declarative controlled model, but it's built on Kafka streams, so you might have-
That's kind of cool.
That right there is a future podcast episode, how that works, that sounds like a fascinating streams application.
I think I agree, we should dig into that more. Just at a high level, it kind of makes sense. When you think about what we talked about before, in the sense of desired state and current state, how would you model that in streams? Well, you have state stores. Yeah, that's cool. And then you need events that trigger reconciliation loops. Well, how do you write event-based applications? Oh, yeah. Kafka streams. Basically as long as your desired state events, and your current state events are in Kafka already, it's a pretty natural fit to be able to write this controller pattern on Kafka streams.
Right. Now, that makes sense. Backing up to our earlier discussion about Distributed Systems Engineering. With all of this in mind, there are certainly people listening who think, "Wow, this sounds like the sort of work I wish I could do, this is where my career should go." Distill your advice to software engineer, who doesn't work on distributed systems, but wants to, what would you say to that person?
I think a great way to get started. At least what I did myself also is, try to contribute to open source. I think that's a really like way that ... There's essentially no barrier to entry. As I mentioned, I started playing around with Kafka in OpenStack, way back when, and then I discovered it didn't have a way of separating advertisers names. Then I ended up making a PR on Kafka and I did something similar for Hadoop back when it was all the rage.
I think even if your current job is not distributed systems, there are ways you can get involved and start learning stuff. And at least for me, I find that the biggest hurdle is the first step. You know what I mean? It's like the first step is scary, you're not sure what to do, but if you just pick something, just be like, okay, just make yourself a small little goal. Like I'm going to build a music playing app on Kafka streams. I just made that up because somebody did that in Confluent.
And then once you get started on that, it'll just keep feeding on itself, right? You'll start asking questions. Well, how do I make this happen? And then I find ... Then you get into this virtuous cycle of you keep learning more, you get involved in the community. And I also mentioned this at the beginning of the podcast. The reason I'm working at Confluent is that I started doing PR against Confluent open source, right? That's a way to kind of get yourself credibility also, and get yourself on the radar.
My guest today has been Roger Hoover. Roger, thanks for being a part of Streaming Audio.
Thank you, Tim. It was great.
Hey, you know what you get for listening to the end, some free Confluent Cloud. Use the promo code 60PDCAST. That's 6-0-P-D-C-A-S-T to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021 and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit and there are limited number of codes available, so don't miss out. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack sign-up link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast, where ever fine podcasts are sold. And if you subscribed through Apple podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So, thanks for your support and we'll see you next time.
Roger Hoover, one of the first engineers to work on Confluent Cloud, joins Tim Berglund (Senior Director of Developer Experience, Confluent) to chat about the evolution of Confluent Cloud, all the stages that it’s been through, and the lessons he’s learned on the way. He talks through the days before Confluent Platform was created, and how he contributed to Apache Kafka® to run it on OpenStack (the feature used to separate advertised hostnames from the internal hostnames).
The Confluent Cloud control plane is now run in over 40 regions. Under the covers, Roger and his team are managing tens of thousands of resources at the cloud provider layer. This means creating VPCs, VMs, volumes, and DNS records, to manage software artifacts, like what version of Kafka is running and user management. Confluent Cloud is a complex application and distributed system spread across the entire world, but Roger reveals how it's done.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us