Is chaos engineering all about breaking things randomly or breaking things very carefully? Does it have anything to do with you if you're running Kafka, particularly if you're running it and Confluent Cloud, which as we all know should be? These are all fine questions. And the good news is Tammy Butow and Pat Brennan, both of Gremlin, the company that makes chaos engineering tools are on the show today to answer them. It's all on this episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Hello and welcome to another episode of Streaming Audio. I'm joined in the virtual studio today by Tammy Butow and Pat Brennan. Tammy is a principal SRE at a company called Gremlin. Pat is a principal architect at that same company. Tammy and Pat, welcome to the show.
Thanks so much. It's great to be here.
Thank you very much.
I want to ask you guys to introduce yourselves certainly, and also to introduce Gremlin. So, Tammy, tell us about you and what a principal SRE does there at Gremlin and maybe a little bit about what Gremlin is.
Sure thing. Yeah. So I'm a principal SRE here at Gremlin where I primarily work on chaos engineering, which to me is the facilitation of controlled experiments to identify systemic weaknesses. So really this idea of using the scientific theory to identify issues and then actually figure out how you can fix them. And Gremlin actually helps engineers do that because we've built a platform that allows you to inject failure and identify weaknesses and then prioritize what you're going to fix. And prior to working at Gremlin, I've been here for quite a while now three and a half years, before this I was at Dropbox as a SRE manager. So site reliability engineering, and I was leading databases, block storage, and I was also an incident manager on call for all of Dropbox.com and the desktop client. So that was a really good experience. And prior to that, I was at DigitalOcean as well.
[inaudible 00:02:15] big responsibility looking off to 14 data centers. And then before that the National Australia Bank also doing a lot of chaos engineering in something that we would call more back then in those days back in 2009 disaster recovery testing. So if you think of chaos engineering, sometimes it helps folks to think about disaster recovery testing, business continuity planning, region fail-over, those are the types of words that I like to, or terms, I like to throw around to help people understand what it is that we do. And yeah, that's just a little bit about me and a fun fact is I also, in my spare time run a movement, a global movement called Girl Geek Academy where we're focused on teaching 1 million women technical skills and girls too. We teach girls as young as like four years old. So it was really fun.
Nice, nice. I didn't know that there will be a link in the show notes to that, a link of your choosing. So please set us up. Pat, tell us about you.
So yeah, I'm a principal architect here at Gremlin. You know, I've worked in financial services for a number of years. I actually started out my career in financial services, supporting traders, asset management, corporate risk. Worked in the technology industry for such companies as Sun Microsystems, EMC, Red Hat, and kind of go in and showing what those technologies can do, how they can address particular business issues that customers have. I've been with Gremlin for about a year and a half. You know, as Tammy said, it's really about finding weaknesses in the environment, going and realizing that not only these are the weaknesses, but how can you go and remediate those issues, and doing that on a constant basis because systems change things, things evolve, cloud providers go and modify things.
So you can get ahead to really go and avoid downtime, which can result in revenue loss, can result in your customers not being able to get the services, goods and services that they need. So by going and remediating those issues we help customers accomplish better revenue, better customer satisfaction. You know, a fun-filled fact about me, I actually like to work in my garden a lot during the summer. So very, very excited that spring is here and I can go and start enjoying the fine weather or just kind of enjoying the peace and quiet of working in the yard.
I'm also a gardener. I live in Denver. So if I were on top of things, I would be getting leafy greens in and peas in soon. I'm never on top of things. Early May as is I [crosstalk 00:05:06].
I used to come to Denver all the time. I actually needed to come back to Denver.
Yeah. It's a great place. And for freeze intolerant stuff like tomatoes and things like that, it's kind of third week of May. If you try to get ahead of it and go second week of May, you'll get that little snowstorm on the 19th, you know, just to teach you to obey the limits. That's how it is here. But anyway, gardening, good thing. I'm a fan. I'll try to put a gardening link in the show notes. I feel like I should come up with that later. That'll be up to me. Anyway, back to chaos engineering. I just tell you kind of my personal history with the term. Maybe it was 10 years ago, it came on my radar as some discipline AWS had developed where they had developed scripted ways of systematically breaking things, well, scripted and systematic or redundant.
You know, so there are programs, these agents that you get the idea of the cyberpunk thing of them autonomously, traversing a network, whatever that even means and breaking things. And so sort of orthogonal to the work of the SREs, by the way, Pat, the impact of downtime as revenue loss, also lower life satisfaction for SREs. And so I think that with respect to our audience of this podcast, that's what we need to think about that nobody wants that. Right?
No. It's a burnout issue. Right?
And it's loss of institutional knowledge that has a tremendous value. That's kind of hard to quantify when you lose those talented people, because [crosstalk 00:06:52]
Totally. And I have a background as a developer and not as, SRE is a new term, but not as an SRE or an Ops person or an admin or whatever term over the course of my career we would have used. And I always sort of feel as a developer, like SREs are vaguely disappointed with me anyway, because I probably broke something, and that's a little bit of an old fashioned way of thinking, but I just don't want them to be sad. I want to create an environment where they're as happy as possible. So chaos engineering as I look at it is this, some programs, this service set of software that breaks things, and it breaks them in a way that's orthogonal to the work of the SRE. So you don't know, and check me on this.
This is just kind of how I think about it. But again, as if this autonomous cyber punky kind of thing, that's not how real life works, but prowling around the network seeking what services it may devour and then things break. And the trick is other systems and culture then has to adapt given that constant set of threat inputs, such that downtime, it doesn't happen when things break. So, tammy, you gave a very elegant definition of chaos engineering just a minute ago, and everybody you should rewind and listen to that again. Does my definition work?
It's what I think of.
Yeah. It's interesting. So I'd say like different people do different types of chaos engineering and the type that you're explaining is more what I see folks do when they're more advanced. So like it's usually best. We say a Gremlin to start with a small blast radius, super controlled experiments, make sure that you over-communicate. This is my plan. This is the type of failure I'm going to inject. This is how I'm going to inject it. This is when I'm going to inject it and then do something what we call a game day, which also comes from Amazon. And both of our founders at Gremlin are from Amazon. So they worked there as engineers and yeah, Colton and Forney. And they built the chaos engineering platforms at Amazon. And then they decided to go out and create Gremlin as a company.
They also worked at Netflix and a few other companies. And so I like this idea though, of like initially when you start, doing a game day is really good. And it's that thing of, you know, the value of bringing people together. So say, five to 10 people in a room. I've seen massive game days with 70 to 80 engineers in a room. You white board out what it is that you're going to actually be focused on, so the piece of your system, your architecture diagram, and then you pick like, okay, we're going to inject latency from here to here, or we're going to do something quite small. We're going to take down this one service and see what happens to its dependencies, or maybe it injects packet loss, something like that. And that's a better way to usually start.
Then what we see folks do is they move to automation. Which is really cool. Right? Because then instead of it happening randomly at some point in time, it's just happening every day, which is so cool that you're like injecting failure every day so you don't drift into failure and it enables you to do a lot of important things like validate that you're monitoring and alerting always works if something does break. You know? So yeah, that's something that Pat and I think about a lot in terms of the different types of use cases of chaos engineering, and we get to work with a lot of different customers, a lot of companies all over the world, like a lot of huge banks and finance companies, as well as e-commerce.
And that's like the types of things that they really care about is start small, but then scale out your practice because you want it to work really well for your entire organization. And if you've got tens of thousands, hundreds of thousands of machines that you're responsible for about a pretty small team, that's like a lot of responsibility in terms of reliability and maybe downtime costs hundreds of thousands of dollars for five minutes. You know? So that's what you're trying to stop. So it is more frequent failure injection, I think. But yeah, Pat, like what would you say there? You've seen lots of different approaches as well.
Yeah. I mean, you know, to that point, Tim, about starting with that small blast radius and progressively increasing it, defining that hypothesis, learning from it, and one of the other things we see is that people learn about dependencies that they don't know they have.
And when we look at that, that is your key to understanding what the dependencies are when we get into DR. So whether you're on-prem or are you dealing with region outages and saying I want to be able to fail over to a different region in the cloud. I need to understand what my dependencies are to help go and identify that and to Tammy's points about game day, it not only is it a way to get people on the same page, but they bring different levels of expertise around an application, around operational issues, around development issues. And you're bringing them together and it forms collaboration. So now people are collaborating. They're working on a single goal of making the environment more resilient, but they are also learning from one another. So that's to kind of Tammy's point about over-communicating and that's really, really valuable and something that we really advocate for, which is bringing people together. So not only can they bring their expertise, but they can learn from one another.
So a couple of things, number one, apparently introducing chaos engineering to an organization by writing a script that randomly instigates like a cross-region fail over or something like that, you're saying, without telling anybody bad way to do it?
Yeah. Well, I always say that every time I give a talk. I'm like please don't go into work tomorrow and like take down a whole region just randomly.
They said it would make it anti-fragile. I thought that's how it worked. Right?
Baby steps. Yeah. Yeah. Yeah. And you know, like stressing bones makes them stronger too, but you don't just like crunch them all at once.
Yeah. It's like weightlifting. Right? You don't want to just like lift like the hugest weight. You got to work your way up to it.
Yeah. Well, number one, you won't be able to. Number two, there's an in-between stage there where you will, but you'll get injured, but.
I'm thinking as you guys are talking that of the, I'm sure you go through this all the time, but the obvious biological metaphors of like exercising a muscle damages the muscle tissue.
And when it heals, it heals stronger and you know, bones, little bits of stress on bones that created by walking around makes the bone stronger. And you know, we have a problem with long-term space flight and bone density where you're not, the Gremlin isn't running anymore, basically. You don't have the chaos inputs and the system gets weaker. So you're trying to create an environment where people are building systems that respond to small insults by getting stronger.
Yeah. And it seems like there's too, you know, you really talked about two systems, you guys just then where there's, we'll just call it the network, whatever the computers are and however they're connected and whatever those services are doing and interacting with the outside world. But then there's the people and getting the people in the room to work on small broken things changes the organization. So reflect on that for a minute. Whoever, either, jump ball. What do you think about that? [crosstalk 00:14:37]
Yeah. So it's a really interesting thing because part of it is cultural. Right? So people can go and work on this or then go work on this or work on this. And there's a little of that kind of silo effect. Right? But what we're really doing [inaudible 00:14:56] your biggest success, the best success is kind of involves kind of a cultural change. Right? It's getting people to work together to collaborate on things. That's a huge part of it. And we've been seeing that for a while now. Right? Or how do we work together? How do we go and collaborate?
Because it's been interesting that COVID has changed a lot of how we work. We work at home. We rely more on the network. Our business models have had to adjust and change. Right? Companies have had to say, okay, I need to focus more online. Right? I need to make sure that my systems are more reliable. Right? So I need to understand, I need to understand things. That requires breaking those silos down and getting people working together so that I have that. When I have outages, you know, again, you know that the AWS outage last year, the region outage. How do I need to make sure that I can recover from them. So a lot of external events have caused people to think about, have to think about how do I accomplish this? And analyze that.
Yeah. It's super important. It's interesting to do. I love the idea of using game days as a way to share knowledge across an organization and also between different levels. So say if you have folks who are, they're just graduated from college or they're an intern, maybe they've come through a bootcamp, and they want to gain knowledge. That's such a good way to understand how things work and also to see more senior engineers role model good behaviors of like I know that this section of the system works in this way, but I'm not sure about that section.
Maybe we need to read the code or maybe we need to like dive into that a bit more. Does anyone in the room know? Like just to be able to openly explain, like, I don't know, everything. We're here to figure this out together. That's like really empowering for junior engineers when they join. And another thing. Yeah. I love that so much. Like it's similar to if you have a post-mortem so like a blameless postmortem and you invite in engineers of different levels and they see that it's just a learning experience. Like that's what it's about. And you're just trying to make things better for people and for the systems. Yeah.
Yeah. Yeah. Blame of course impairs, well, individual and organizational learning, but super great point about the more senior respected even informal leaders in the engineering organizations, when they're openly being ignorant sort of shamelessly being ignorant. Well, I don't know what that code does. Let's look at that.
Because a junior engineer looks at a senior engineer and thinks, well, they know ever- and you know, it's crazy, even for adults still to think this way, but you do it. You know? They know everything. They know all the code. They always have the answers. But modeling ignorance in that kind of collaborative problem solving. Boy, what a great outcome of starting out by saying, we're going to break things.
Yeah. It's really cool. And it like shows you how folks learn something really fast. There's a book that I love called Make It Stick, which is all about how to learn and remember what you learned. [crosstalk 00:18:04].
There's a blurry copy of that [crosstalk 00:18:06].
Oh. You've got that. That's awesome. [crosstalk 00:18:08].
... back there. Yeah. So you can't see it, but it's back there.
And it's just like, you know, giving folks a chance when you're in a game day with 15 people, you see how 15 different people learn and some people want to get up and draw if it's in a room or they want to do it on say Miro, if it's like an online whiteboard. Some folks want to debate. They kind of want to spar a little bit about things and that's fine too. And just like, it's good to see those different styles. And maybe they'll be like, oh, let me just pull up the code there. Let me look at the commit history and understand why we change this, when we change this, or let me run some commands to look at the system, to understand the observability that we have in place. Yeah.
That's really interesting. How much time do you folks spend when, I mean, Gremlin's a product. It's a thing that you could buy. And I want to talk about the how does this become productized in a minute. Streaming Audio is never about commercials, but we talk about things that you could buy and it's okay. Because like we buy things and that's good. But how much of your time do you spend on the organizational stuff? Because it sounds like, okay, we need some software to break things and now we need to teach people how to learn together when things are broken. Is that a lot of it for you?
A lot of what I do is probably watching folks do that. So for me, it's really cool. I like the idea of working at Gremlin because you know, I've always done SRE work, but it was like I was doing the same work over and over whenever I worked at a company. You know? Looking after a new team do the exact same thing to improve the reliability, and I was able to get like a 10X reduction in incidents using my framework of how to improve the reliability. You know? And I can do it in like three months. Okay. Roll that out over and over. But being able to actually shadow folks and see the differences in different industries, reliability does change. How you can improve reliability changes per industry changes depending on the mixture of like senior to junior engineers, how empowered the team is, like how much their leadership believes in them and gives them the ability to actually fix that.
So a lot of what I do is just getting to like, I love celebrating people's wins and being there alongside them to see them achieve what they set out to achieve. Like just today, HEB, they're a customer of ours. And they were saying they just finished rolling out curbside pick up and they use Gremlin to make sure it was reliable. And it's been a huge success. And that's really cool, you know, to help people get their groceries during this time, the pandemic, where things are hard. But I think for me that's like why what we do is important for reliability, always relating it back to real people that are out in the world that need us to do what we do every day. So I think it's probably for Pat and I a motivational thing.
And Gremlin the platform, yeah, you know, you can log in. There's a free version as well. So you don't have to pay to get started if you go to Gremlin.com/Free. But there's 12 attacks that come built in. It's very easy to get started and just try it out to learn about failure. But probably one of the things that we want to help people understand is you need to do it in a meaningful way. Think about how you're going to measure success. Think about what the ROI is. I mean, I come from a punk background. Like I love punk rock and NOFX, one of my favorite bands, but you know, you can't be too punk when you're doing it. It's good because you are injecting failure and you're identifying weaknesses and you're doing something that is different from what you would usually do in your career, which makes it really cool and really interesting. But you have to bring people on the journey with you. You don't want to be like playing a punk show and have no one there. You know what I mean? It's like not as fun.
It's not a good look.
Yeah. I mean, you know, it's also, you know, it's kind of connecting that from a business perspective as well as looking and saying, okay, how do we include the business? And, you know, one of the really interesting things is seeing how people address different business situations, how the things they discover, the things that surprise them, the things they're able to remediate, you know, it's interesting when you know, you take microservices, I didn't know this late level of latency would cause this problem. Or, you know, we've had customers who have discovered bugs in their software from daylight savings time. Customers have issues all the time with daylight savings time and they can go and test it and remediate those issues. And that's fun when customers are like, oh wow, I tried this. And I was able to identify this issue. I'm not going to have this issue anymore.
That's really rewarding because they're actually learning stuff that they didn't expect, things that surprised them. So they can to your point about kind of the exercising, Tim, and stressing things out can go and see what happens when we have those things that affect us versus those things that happen to us. Right? Both can cause outages and problems for us, but we can kind of analyze and look and say sometimes external events happen. How can we deal with that? How can we make sure that we're more resilient to that? And that's really very enjoyable because you're helping, because different customers have different issues. Right? And want to make sure that their applications, right, like confluent, which had parameters, its settings were given potential infrastructure outages and can always make sure that it's working with us for trading application, credit card fraud detection, insurance, et cetera.
Absolutely. So two more questions. We're talking about concepts, but I'm kind of curious, what form does the software take? How does Gremlin work? And like I said, don't be afraid of talking about a thing that it's awesome that there's a free tier. That's great. But how does it work and tie it into, I think this is somewhat obvious, and this is a podcast about confluent, Kafka, and the cloud, as I say. And so tie it into Kafka. Like what if I'm using confluent cloud. How does Gremlin impact my life? How do I use it for my confluent cloud application that has other services elsewhere on-prem and in the cloud? Just how does it work? And then how does it work with Kafka?
Yeah. So, you know, dependencies, you know, you could look at dependencies like we touched on earlier. Right? What happens if this service can't talk to this service, external service, for example. What does that do to the application? What does that do to the business requirements that they need to meet, the SOAs they need to meet, the services that they are obligated to for their customer?
So my microservices in a managed Kubernetes cluster somewhere all integrated through confluent cloud expecting to produce and consume from those clusters.
Right. Because, you know, I may have increased load. So how do I make sure that that increased load can be handled properly? Right? How do I increase latency? What if I have packet loss? So when we kind of think about those network attacks, those are the things that happen around us, whether you're talking about packet loss, latency, or service just vanishing. You know, how do I go and deal with that? How do I make sure that I can handle those things? Right? You know, COVID-19 kind of identified a spike in certain services, as Tammy was outlining. Right? You know, with food delivery, HEB, and other ones kind of went through the roof. So, you know, while a lot of businesses hurt, we saw a customer saying my business has gone up 300% month over month. I may need to go and make sure that I can scale and work on things so that people can get the services they need, because they can't go to the restaurants. They can't go to their bars, et cetera. But they got deliveries.
Yes. And there's been a lot of that at my house.
Yeah. And then as it relates to say just Kafka on-premise, right, you know, there are a whole number of pieces there. Right? I might go and say, okay, I have a broker that failed. Well, I want to make sure that my settings, my min.insync.replicas are set properly to allow X number of brokers to fail, but still be able to send the data from my producer to X number of consumers.
Or, you know, what, if I increased CPU or disk I/O my brokers. How does that effect end to end latency at the 99th or 95th percentile? Because I need to make sure that certain transactions get certain places in a certain amount of time. So I want to kind of look at that and see what do I need to do in order to maintain that.
And in that self-managed case, which still applies to plenty of people, Gremlin allows me to simulate there, cause those kinds of conditions. There's like, is it an agent? I mean, I'm just kind of wondering physically, what is this thing?
Yeah. So there is an agent, so on the systems, right, on the systems, there's an agent that gets installed. On Kubernetes clusters there's a DaemonSet that gets installed.
Yep. And then we also have another product called [inaudible 00:27:32] which is application level failure injection, where you can use our library to inject failure. So yeah, there's quite a number of different ways to inject failure with Gremlin. I think one of the common use cases that makes me think, you know, Pat's story made me think about it. During the pandemic auto scaling was a really big focus for folks because they had the scale like they've never scaled before. And it's really, really good to test that before, you know, you actually have to be in that situation to make sure that you can go up and you can come back down and everything works as expected.
So you can do that with Gremlin as well. You can trigger auto scaling to occur by injecting a CPU. That's an example of an attack. So you can spike CPU, watch the auto-scaling occurs as expected, watch that it then goes back down when you turn CPU back off the attack. And then you'll see lots of issues come up because of that. And not only related to maybe auto-scaling configuration, but dependencies, monitoring observability, backups, just like all of these different types of issues will come up, but I've never ever seen someone do chaos engineering and not identify issues actually, which is surprising. I've been doing it for such a long time now. You know? It's like, wow, like just every single time.
If you haven't tested your auto scaling and you do this for the first time it's going to be-
Yeah and auto scaling is a great example. So whether you're doing, you know, system auto scaling. Right? So like we look at trading applications. Right? You know, during a pandemic we saw 2000 points swings in the market in one day, so really want to go and adjust our auto scaling policies so that we probably want to scale out faster and scale in slower because of the immense program trading that was going on, that was causing these wild swings. So, you know, that's a really interesting thing. But also when we look at it more from a Kubernetes perspective. Right? Horizontal pod auto scaling. Right? Because I might want to do horizontal pod auto scaling within my Kubernetes environment. Right? And then I might also want to do that in combination with system auto scaling. So it really, our platform can both so that you can look at it from a very holistic perspective.
Yes. Yes. I see that. And then it doesn't matter, like from the standpoint of Kafka things whether you're using the fully managed cloud. That becomes now a dependency, I think, in your language and availability and latency. And just, can I talk to the thing? The internals aren't the problems because it's a cloud service. You're not supposed to have a problem, but then if you are managing yourself, you've got all the dials on all those individual things.
Yeah. And that's something really common that we have noticed over the last few years is folks are relying on a lot of really critical services, but sometimes they will have issues on their side that stops them from being able to communicate with these really critical services or have a fail over plan in case something goes wrong or even alert that there's an issue communicating. So that's what we do at Gremlin a lot as well as make sure that those critical parts are always clear. You know? You just want to make sure that you're looking at that. That's my biggest tip as an SRE. It's like always be watching. And also you don't have to hire SREs to think like this. Everyone can think like an SRE. It's talking about exactly what we talked about today in the podcast. Yeah.
Everybody can easily learn how to think like one. It's expensive to learn how to do like one, because there's a lot of specific skills there, but that's the thing that this automates. And, yeah, I appreciate you pointing that out. Far be it from any of us to suggest that the availability problem would be in confluent cloud. Obviously it's just going to be on your side getting through it. Right? I mean, that's crazy to think that that would happen [crosstalk 00:31:18] a cloud provider, could be a cloud provider.
I've definitely seen a lot of issues with cloud providers. I would say and that's the world that we live in. It's not easy to set up cross-region replication yet. There's just a lot of things that aren't easy.
Nope. And the funny thing is they're part of a story we've been telling for 15 years now.
Yes. I know. Right?
And it's still this delicate thing, a few experts can engineer if they try real hard.
But it's getting easier. It's getting to be more commonplace.
Yeah. It really requires looking at the holistic picture and figuring those things out, what those external dependencies are. So you take confluent cloud. Right? I may want to say, I have a producer on-prem that does something. Right? I want to see what a delay of adding latency at the producer end means for the data all the way at the other end, at the consumer end. It has to go to confluent cloud, right, from the on-prem producer. Right? So what is the impact of adding latency at the producer level? We see customers wanting to do that, add latency and then seeing the effect is on those consumers, whether that's one or two, et cetera, and what that effect is. So there's a number of places that you can interject latency or various other things from point of creation to point of absorption.
This all sounds pretty cool, Tammy, you gave a URL that I think was the free tier, but if folks want to check it out, what should they do?
Yeah. Just go along to Gremlin.com/Free and you can check out Gremlin. Give it a go. So you have full access to all of our different attack types and, yeah, let us know how you go. I'm on Twitter, TammyXBryant and yeah, you can actually find me that and then Pat's also on Twitter, if you want to reach out to us. And also actually, if you want to chat to other folks that are doing chaos engineering, SRE work, we have almost 8,000 people in our Slack. So if you go to Gremlin.com/Slack, you can join our community and meet other like-minded folk who are interested in identifying weaknesses by injecting failure and improving reliability, which is lots of fun.
All those links will be in the show notes. My guests today has been Tammy Butow and Pat Brennan. Tammy and Pat, thanks for being a part of Streaming Audio.
Oh. Thank you very much.
And there you have it. Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST—that's 60PDCAST—to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available. So don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign-up links for those things in the show notes. If you'd like to sign up and while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review. And we think that's a good thing. So thanks for your support, and we'll see you next time.
The most secure clusters aren’t built on the hopes that they’ll never break. They are the clusters that are broken on purpose and with a specific goal. When organizations want to avoid systematic weaknesses, chaos engineering with Apache Kafka® is the route to go.
Your system is only as reliable as its highest point of vulnerability. Patrick Brennan (Principal Architect) and Tammy Butow (Principal SRE) from Gremlin discuss how they do their own chaos engineering to manage and resolve high-severity incidents across the company. But why would an engineer break things when they would have to fix them? Brennan explains that finding weaknesses in the cloud environment helps Gremlin to:
Chaos engineering is all about experimenting with injecting failure directly into the clusters on the cloud. The key is to start with a small blast radius and then scale as needed. It is critical that SREs have a plan for failure and then practice an intense communication methodology with the development team. This plan has to be detailed and includes precise diagramming so that nothing in the chaos engineering process is an anomaly. Once the process is confirmed, SREs can automate it, and nothing about it is random.
When something breaks or you find a vulnerability, it only helps the overall network become stronger. This becomes a way to problem-solve across engineering teams collaboratively. Chaos engineering makes it easier for SRE and development teams to do their job, and it helps the organization promote security and reliability to their customers. With Kafka, companies don’t have to wait for an issue to happen. They can make their disorder within microservices on the cloud and fix vulnerabilities before anything catastrophic happens.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us