November 2, 2020 | Episode 127

Distributed Systems Engineering with Apache Kafka ft. Apurva Mehta

Transcript
Notes

Tim Berglund (00:00):

There are a lot of specialties within the very broad vocation of software engineering. And all of them are hard to do. Distributed Systems engineering is one corner of the discipline that poses a particular set of challenges. What's it like to build a distributed system? What special problems arise? How do you land a job doing it? And that's the conversation on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud.

Tim Berglund (00:33):

Hello and welcome to another episode of Streaming Audio. I am as usual as per the huge as I guess, some of the kids say your host, Tim Berglund, and I'm joined here in the virtual studio today by Apurva Mehta. Apurva, welcome to the show.

Apurva Mehta (00:51):

Hi, Tim. Thanks for having me.

Tim Berglund (00:52):

You bet. Now, Apurva works on Kafka Streams and ksqlDB, on that team. And as a part of an ongoing series, I wanted to invite him on the show to talk a little bit about that stuff, but more specifically to talk about being an engineer who works on distributed systems. So Apurva, what do you think?

Apurva Mehta (01:17):

Yeah, sounds like a great topic. Excited to-

Tim Berglund (01:19):

And have you had anything [crosstalk 00:01:20]. What's that?

Apurva Mehta (01:20):

Excited that I have an audience to talk about this with.

Tim Berglund (01:28):

It's funny you should say that's me all the time. Excited that I have an audience. I think, maybe there'll be my tombstone. Excited that he had an audience. Would you add anything to that introduction? You work on Kafka Streams, and ksqlDB, anything else you'd say about your actual role currently here right now?

Apurva Mehta (01:49):

Yeah, that's a good summary, I just nowadays manage those teams, so not quite an engineer anymore. But I was an engineer at Confluent for my first two years here, I've been here four years now. And my first two years I worked on Kafka Core, I had to had exactly one sematics. But I think some of the other luminaries on the show, I think Jason, and [Goshung 00:02:11], and me are the primaries on that.

Apurva Mehta (02:14):

And yeah, I worked with them, lots and lots of fun, learned a lot about Kafka in those two years, very compressed time. And then moved on to ksql, and now kind of manage both Kafka Streams and ksqlDB Teams. And yeah, that's the only thing I'd add.

Tim Berglund (02:31):

Cool. What makes streams ... I guess, when you're working on it as an engineer, and by the way, I will definitely want to talk about the distinctions between your role as an engineer and your role as an engineering manager. I appreciate you saying, "I'm not an engineer anymore," since you were an engineering manager, those are different things, but we'll get into that. But when you were an individual contributing to Kafka in your first two years, what attracted you to that work?

Apurva Mehta (03:04):

Right. Now, I think Kafka as a whole, if you come to Kafka Streams, and the actual clients, and the brokers all as part of Kafka. I worked on all those things, I think, that the topic is distributed systems and engineering challenges on distributed systems. I think Kafka kind of ... And Kafka Streams, both fundamentally the design goals for those systems are, commodity boxes, high amount kind of ... Especially now in the public cloud world, not very reliable networks, not very predictable performance when it comes to disc or other IO systems.

Apurva Mehta (03:46):

But still promising especially for something like Kafka, very high guarantees in terms of availability, throughput predictability, latency predictability, because a lot of systems are actually built with assuming a little variability, right? If Kafka starts sending you high latency responses, if your producers start backing up, your application now account gets blocked and producing.

Apurva Mehta (04:12):

Then based on what you're doing, your whole middle tier may crash. The other things that depend on you may crash and you can have a massively cascading failure up the stack in your organization. That does actually happen. I used to work at LinkedIn and some of these foundational systems there, whenever they used to depart from the very narrow bounds, they kind of the whole system on top seized up.

Apurva Mehta (04:35):

I think that's kind of coming back to your question, that kind of high, building this reliability, building this robustness on top of basically unreliable components I think. And then on commodity hardware, then that's where the distributed systems part comes from. These are not very big boxes. All of the work has to be spread, all the state has to be spread across many such units of nodes of computing.

Apurva Mehta (05:05):

And that's kind of solving for that, I think it's intellectually hard. Some of the best engineers in this space, I think will con Kafka. But thinking in terms of distributed systems, when you don't know what you don't know, and designing with that constraint is intellectually extremely hard. And when you get it right, is very satisfying.

Apurva Mehta (05:29):

And then engineering wise also extremely hard. There's a whole other spectrum of very good systems engineers who can optimize the last bits of performance, who understands really, really well, how the machines work, how various cloud systems work. And it's kind of bridging the intellectual abstract, pure computer science problems with engineering problems. I think Kafka Streams, Kafka all it kind of checks those boxes, right? That's what really makes it interesting to work on.

Tim Berglund (06:02):

That makes a lot of sense. I like how you talk about a system and I guess it's really ... It's not just that it's distributed, but it's infrastructure and most infrastructure is distributed these days. But Kafka is infrastructure, and so it has to provide, we'll say a very small standard deviation, in terms of some of the things that quantify its contract like latency.

Tim Berglund (06:31):

And when you don't meet that, I was thinking, well, of course, nobody would ever build a system that was so brittle that if Kafka slowed down, it crashed. But well, no, pretty much all of us would. And that's why you have to work really hard to make infrastructure not have that variability to be extremely predictable.

Tim Berglund (06:50):

And of course, that's never perfect either, but you'd like to impose catastrophic failures on people as little as possible. And you really have to approach this with the understanding that people are going to make that assumption of infrastructure, even though it's a "Bad architecture." It's going to happen.

Apurva Mehta (07:10):

Yeah, absolutely. And also implicitly over time, if you don't even advertise a certain contract, but you always are within a certain bound, people will start assuming can we even built for it, and not just outside of it, and that's when the problems happen.

Tim Berglund (07:24):

Right, right. It's the fact that you deliver on low variability so frequently makes it kind of rewards that behavior, that people optimize for variation for the probability of variation, and they don't get a lot of variation. And so they never think about the probability of catastrophe, and the consequences of the catastrophically different latency, and then things just die at that point.

Tim Berglund (07:46):

Thank you for doing that. You mentioned some differences between a distributed systems engineer and say a full-stack engineer. And every time I do one of these episodes, I really want to be clear that, I'm asking this question, I'm really interested in your thoughts, I'd like to dig into it. This isn't any kind of like hierarchy where the distributed systems engineers are the cool ones, and the full-stack engineers are the lame ones.

Tim Berglund (08:11):

And if all you do is react in JavaScript, you're not a real developer. There's never that, I don't buy that kind of thing. It's not how I look at the calling of software development, but we still have different disciplines. And it's interesting to dig into how they're different. And so how is distributed systems engineering different and warning, I'm going to ask you about engineering management next. Just talk about the old IC life, and we'll get to management next.

Apurva Mehta (08:41):

Sure. I think, no, I totally agree with you first of all obviously. It's not a popularity contest, it's not like ... There's no objectively better discipline or best discipline, it's all equally fun and challenging in its own way. So totally get that. With that said, I think distributed systems, I think especially ... I think, I'll be honest, I had not actually worked on front ends of full-stack much.

Apurva Mehta (09:07):

I guess I don't have that point of view to kind of be able to compare necessarily the challenges there. And as you said, it's a totally different discipline. It has its own set of challenges. I can't really get into that specific differences from the full-stack point of view, but I can at least comment on what is challenging and I presume kind of unique about distributed systems. And we can just go with that.

Apurva Mehta (09:36):

I think, I kind of touched on it previously as well. I think, if you think distributed systems, it's like you look on across multiple commodity pieces of computing units connected by not sometimes not very reliable networks. That's kind of makes distributed ... And that's really the core of what distribution is like. Distributed computation around [inaudible 00:10:01].

Apurva Mehta (10:02):

That's kind of half of designing protocols and designing ... When you think about consensus algorithms, and arriving on the shared view of what is the current state of the world? For Kafka, who is the leader of a given partition, right? For Kafka Streams, who is the coordinator of the group? Who can get members in and out? All of that is a distributed consensus problem which is kind of the many different flavors of it.

Apurva Mehta (10:33):

And as a distributed systems engineer at some point, you have to kind of really understand them. You may not actually build them anymore, these things kind of have gotten more commoditized, but you have to understand the complexities of these things that it's very possible. They have also the failure modes, like split drain problems where two groups have different views of the world because they have a network partition between them.

Apurva Mehta (11:00):

And how do you handle that? [inaudible 00:11:01] that I would consistently incorrect results, or detect and stop, right? I think those are the kinds of the second degree, third degree, fourth-degree thinking. Like what are scenarios? You have to kind of become very fluent at how to build the systems, or to even modify, extend, or develop the systems further, right?

Apurva Mehta (11:22):

Not everyone's going to write their own consensus algorithm, but everyone in this, as an engineer, if you spend any amount of time, has to kind of develop the mindset of thinking about, if my network links are broken, if half of my cluster has a certain view half doesn't, and I'm making this change, what's going to happen, right?

Apurva Mehta (11:41):

You should be able to extend your thought and have that imagination. I think, and that's kind of the keyword, right? It's kind of multitask programming, you cannot actually foresee all ends. And you can't force all into leavings of your program, but you have to kind of push yourself to the abstractions, and your algorithms are elegant enough to kind of cover the things you don't foresee, right?

Apurva Mehta (12:04):

And that's kind of a discipline, it's a skill, that only comes from experience, and only comes from practice. But I think that's kind of unique, at least that's challenging for industry or the systems. And then, I think, that's part of it, the abstract they're just like it takes years. And the best engineers kind of have a mark, not everyone has it, right?

Apurva Mehta (12:26):

It's also not just like this talent to be able to think multidimensionally like that. And then I think the other parts are systems like Kafka, systems like Kafka Streams, ksqlDB, they're all staged full systems. They all maintain a replicated state, and then the main thing there is it has weight. In the sense that it's hard to ... If you lose a 10 terabyte Kafka broker or whatever, right? To rebootstrap, it to recreate it, just takes time.

Apurva Mehta (12:59):

There's a hard but baseline on how long it's going to take to get it back up. I think thinking about that, and designing around the fact that moving state, maintaining state, is expensive. Getting it corrupted is catastrophic, right? That's its own set of challenges, you have to be super conservative. You've got to assume the worst is going to happen at all times.

Apurva Mehta (13:28):

And designed for that, I think, again, that goes back to the tolerances and sometimes it's data integrity, or availability. If you have to rebootstrap it and you're unavailable for the whole time, that's like a really bad system, right? That's what makes it challenging from that perspective? It's just you cannot work around this, right? If you lose a hard disc, or you lose two different hard disks on two different machines, it's going to take time to recover.

Apurva Mehta (13:54):

And how is your system built to tolerate that, right? I think those are very interesting questions, especially as you want to extend these systems, add features, you don't want to violate certain principles that the users depend on. And then the last thing I think, again, it's a pure engineering, like none of these systems are academic systems.

Apurva Mehta (14:13):

They're all are deployed in the real world, and in the real world they satisfy certain business needs. Cost is always a factor, and there's a premium on efficiency. And that is its own discipline, that's not quite distributed systems, that's just systems engineering, I think. And again, if you're working on a system like Kafka, Kafka Streams, having that mindset of really, really understanding what your machine is doing, understanding where the bottlenecks are, being able to design improvements to kind of address the most inefficient parts of the system.

Apurva Mehta (14:48):

I think that's its own discipline, right? As an engineer on these teams, on these kinds of projects, you have exposure to all these kinds of hard problems that I feel are fairly unique. And as compared to full-stack, I think, definitely, you move slower. Every decision we take especially at an architectural level is expensive. If you get it wrong, the cost is going to be felt in a big way at some point in the future.

Apurva Mehta (15:19):

You tend to move, think long before you move, right? Versus if you are working on a stable system and you make a mistake or you ... Generally, I think full-stack, they can iterate much faster, right? You can get something out to users in a day. You can probably never get a feature out to use it in any kind of Kafka deployment, right?

Apurva Mehta (15:41):

I think that's one of the big differences, right? Revolves a different type of aptitude in some sense or mindset. But I think those are the things I think, which are unique, right? I think the premium on quality, not quality is obviously in multi-levels, but the premium on long-term thinking, the premium on very thorough data-driven design is very high in distributed systems versus maybe in full-stack the cost of a mistake is low. And so you move differently, right? Quality, you're still delivering high quality, but over the long ... But you can still move faster while doing it.

Tim Berglund (16:19):

It's interesting that you say none of these systems is academic, that they're all in use and that's probably ... That's definitely true in anything you're going to get a job doing. That's not a university lab, right? There are university labs that will pay people to build things, but I've always thought that distributed systems were more academic software discipline.

Tim Berglund (16:41):

You are more likely to need to and want to read papers on things as a distributed systems engineer and keep up with papers to be successful. It's still absolutely granting your point that especially when it comes to Kafka, several people in the world use it for commercial purposes, and there's money on the table. But what do you think of that? What do you think of the academic angle from that perspective?

Apurva Mehta (17:11):

No, absolutely. I should have said they're not only academically. I think, especially if you get a job in an industry working on the systems that you're not looking at that ... I mean, it's not academic at that point, but I do agree with what you're saying. I think definitely right. I think we do even at Confluent, we have a people's we love channel, where people are posting a lot of different new research whether in stream processing, whether in distributed systems new consensus algorithms, new database technologies, all of that.

Apurva Mehta (17:43):

I think this is definitely a group of people who ... I think the thing with academics is that, it's about also understanding the problems other people have faced and being able to incorporate that the best engineers, that's what they think, right? I remember, how do you prove a particular replication group protocol is correct? Right?

Apurva Mehta (18:04):

I think that is something which is academic research, right? I think there's a TLA+ offering book or something. I think one of my engineers kind of applied it to the Kafka protocols and found bugs. That's like a very concrete example of taking an academic construct, applying it to a real world thing, and finding real world impact.

Apurva Mehta (18:25):

And definitely, in some senses like a cross section of, or the intersection of applied computer science, and practices applied computer science, who work as a distributed systems engineer. That's really what it is, right? Being on top of the latest research is definitely a value add. And it can't even have an impact today. It's kind of in some sense you're right. It's as close to applied computer science, as you can get probably today, which is not purely academic, and it's not purely just taking ideas, and implementing them, right? There's a lot of innovation happening in between industry.

Tim Berglund (19:00):

And I wasn't really thinking about distributed systems when I was a young engineer, kind of just getting started my career out of school. But I thought some things that I think, might apply here a little bit. I worked on firmware, I was a Firmware Developer, and most of the systems I worked on were somewhere relatively close to telecommunications, just some sort of proximity to telecom.

Tim Berglund (19:28):

And wrote a lot of just simple firmware, eight bit and 16 bit microcontrollers that did fairly simple seeming things, but you get some distributed systems like problems in there like well, concurrency. This was in C in the '90s, and actors, and interesting synchronization, and any synchronization primitive more complex than a semaphore. Wasn't really a thing in little realtime colonels we used back then.

Tim Berglund (19:56):

How do you get good at it, right? You had to just be able to think about threads and concurrency, and what the processor was doing with shared pieces of state. And so there are some parallels there. And I remember thinking then, that the processes that underlie the software that I was writing, that micro-controller code tended to be mathematical in nature. I wasn't doing any direct DSP code or anything, but usually some control over that was close to that part of the system.

Tim Berglund (20:33):

And so what you're automating is already some sort of fairly primitive mathematical abstraction like there could be a Z transform, or a phase lock loop, or some ... Those could actually go together, there could be a phase lock loop around that discrete system. But there are these sorts of process that you're dealing with, and so the code that you're writing is basically already an abstraction over some math in some simple way.

Tim Berglund (21:04):

And when you think about front end code, by the way, this is my observation on the difference between the front end and distributed engineering. Not just front end, but any sort of business software, which most of my time as a Java Developer was as a Full-stack Java Developer doing some sort of business software or other.

Tim Berglund (21:22):

And in that world, rather than writing code of that create some layer of abstraction on top of ... I'll just say math, it's some layer of abstraction on top of business. And what's interesting here is, there's the kind of complexity that you've just been talking about, and academic papers that it's a good idea to read and understand, and they are literally mathematical and very abstract.

Tim Berglund (21:52):

And there's some math that you probably didn't learn in high school that you might have to master to be able to follow some of them. And so they seem like they're this super difficult mathematical thing and they are. You work with a lot of smart people who are capable of doing that, but the interesting thing is, I'll just say business software, the kinds of applications that people build on top of the infrastructure that you make.

Tim Berglund (22:13):

The systems underneath those are actually quite a bit more complex. It's pretty hard to think about the math behind, I think you mentioned TLA before. And then the real have a link in the show notes, because of blanking on the name of the episode right now, but yes, that has been an episode on Streaming Audio. There'll be a link to that in the show notes.

Tim Berglund (22:37):

And that's a formal validation system where there's some interesting mathematics that might be difficult to crunch through. But crunching through the stuff underneath business software, doesn't even have the courtesy of being math. It's just a bunch of crazy stuff that people decided to do together. Since you're automating a business process, which kind of a bummer about business processes is, they're just not mathematical in nature.

Tim Berglund (23:00):

And so when you try to write code to automate them, you end up with the weirdness of business software, when you try to write code to automate event streaming it's still super difficult, but at least you have the comfort of their being able to be math there.

Apurva Mehta (23:18):

I agree. I think that's true. I think, I maybe have a different slight view on what you're saying. It's not like, I think maybe it's true what you're saying as a business process cannot be formalized or cannot be reduced. Basically, that's what you're talking about, when you say, this math underlying something that basically you've been able to abstract the theory into a formalism, which then you can leverage to kind of design your algorithms or build your systems and reason about it and in a more complete way, right?

Apurva Mehta (23:47):

That's what it gets you. And it's like testing can only reveal the presence of bugs, but never the absence. And then to show the absence, you kind of have to prove the correctness of your algorithms, right? And that's kind of where mathematical formalism has entered computer programming, right?

Apurva Mehta (24:05):

What I think is not so much that business process cannot be formalized. I think, it probably can. And I think what you'd find is it's probably, it's kind of very simple and not worth formalizing because your intuition is good enough. The reason that it's correct. And then the cost of failure may not be that high versus you mentioned concurrent programming. And I also mentioned in the past, I call it the multi-threaded programming, but essentially you know that all the interlinings of two concurrent threads are infinite, right?

Apurva Mehta (24:35):

And you cannot then intuition fails you, and it's the same thing with distributed systems. The failure modes are infinite and your intuition is not good enough, right? You just cannot actually reason about that space. That's when you actually need some sort of formula, whether that formalism is explicit or implicit. I feel all the best engineers kind of have a method to think about these problems.

Apurva Mehta (24:58):

They have some implicit sense of abstraction, which is clean enough, and elegant enough that gives them a sufficient grip on the topic to solve problems with confidence, right? I think that's the big difference in my view, like maybe the business process, the midyear of business applications, the premium on formalism, and having that rigor is not that because you could probably do it without it.

Apurva Mehta (25:22):

And hence you don't go looking for it. Whereas in distributed systems, you probably cannot build a good system without actually developing that to start with, because your intuition is going to fail you very early on. And that's kind of how I see it.

Tim Berglund (25:39):

And even with the formalisms, we have like TLA, it's running those models is a matter of how much do I want to spend on compute and just time? How long am I willing to let this run before I think we're probably good? It's always probabilistic, right? It's expensive for in principle, inaccessible to actually compute all the possible outcomes.

Apurva Mehta (26:04):

Right. I think that's probably true. I think it's not just that. I think it's also an individual processor in a distributed context, who's making a certain decision, based on certain information? What is that algorithm, right? I think if you can reduce that to something extremely simple and if all the processes, this is basically what consensus algorithms do, right?

Apurva Mehta (26:25):

If every processor follows the same script, the system will converge on a certain outcome, right? I think that is kind of extremely simple, but it's only simple after the fact getting to that realization is extremely rigorous, and even formal process, right? I think, but without that simplicity, you just have no idea what's going to happen, right?

Apurva Mehta (26:48):

And then things are going to fail in a bizarrely embarrassing way if you don't do it right. Anyway, I think that's kind of adding to what I said, not quite a response to you, but I feel that's kind of where the premium comes in, right? That mathematical elegance and substance, even though it may not be formal, it isn't mathematically elegant. There's a huge premium on that in the work we do, which may not be true in other domains.

Tim Berglund (27:12):

Right, right. What's weird is it, it makes it probably harder work to do in that sense, in that now it's this extra skillset. That's really what we're arriving at here is that you need to have the skill set of being able to do that sort of discreet math is really what it is, it's a little chunk of discrete math.

Tim Berglund (27:37):

And you need to be comfortable there, to be a distributed systems engineer, at least able to handle that stuff a little bit, because that's a part of the job. That's not a part of the job with business software. Business software you do get to intuit things, the ... What's the word I'm looking for? That's not irony, but I'll just say, the twist on that is that, yeah that math is hard to do, and not everybody's comfortable doing it.

Tim Berglund (28:07):

But given that you've got that math, that introduces simplicity to the discipline, and access to formal validation, and actual literal axiomatic formalisms that don't exist in the business world. Because and here I'll just say this, and we can go onto the next question. This is like for 10 years of providing, I've had this idea in my mind for a talk that I've never quite built because I don't know who would listen to it.

Tim Berglund (28:33):

But I think that the fundamental idea here is that business software is an abstraction over mental processes, and things like telecom firmware, DSP code, Distributed Event Streaming are abstractions over mathematical formalisms. And one, you're building a machine to do a mathematical thing, or you building a machine to do a mental thing? And the business software thing is you're building a machine to do a mental thing.

Tim Berglund (29:02):

And if minds aren't machines, then that's an impedance mismatch that will always make it really, really hard to write business software. It's just hard at a different way to write Distributed System Software because effectively you have to be extra good at that mechanistic thinking. You have to be a little bit of a mathematician to be able to play with the formalisms for that part of your workday or work month, or quarter, or whatever that you do.

Apurva Mehta (29:29):

Absolutely, absolutely.

Tim Berglund (29:30):

Would that be a great talk somewhere? Tell me if it'll be a great talk, I can build it.

Apurva Mehta (29:33):

No, it's a great talk.

Tim Berglund (29:34):

[crosstalk 00:29:34], here for you to encourage me. Thanks, Apurva.

Apurva Mehta (29:40):

No, no, no. Actually, it's kind of close to my heart too. I started in computer sciences with some mathematics, and formal proofs of algorithms, and informal proofs of multi-threaded algorithm and stuff. So yeah, I think, the role of mathematics, and mathematical extraction, and mathematical elegance in our work is something I liked talking about, but yeah, probably for another time.

Tim Berglund (30:07):

That'll be a different episode. Before you were working on Kafka Streams, you did other things. And how as you look back on your career, how do you see previous steps in your career was preparing you for the period of time in which you were a Distributed Systems Engineer?

Apurva Mehta (30:27):

I think it's kind of sort of college, I kind of just graduated undergrad. You don't have to really have a specialization. I don't even know if I took a disability course, systems course. I don't know if it was offered back then as a discipline where I grew up back in India. I know I didn't compile those operating systems databases, I don't know if there was ever distributed systems.

Apurva Mehta (30:46):

But in any case I think ... Anyway, I got a job at Yahoo, and there we had to build an internal Esri service, that was like back in 2008. The Esri was kind of becoming popular and everyone said, "Oh, we need to build an internal cloud." That was arranged, anyway, at Yahoo, I actually built that ... I worked with another person who I have a lot of respect for and they kind of ... That ask us to build this replication system for that Esri service.

Apurva Mehta (31:18):

I think that was my first exposure, I didn't even know that it was like ... It was not quite like a discipline, maybe it was, I just never heard of it. But I got a bunch of good people, Jim Gray, I remember the dangers of replication. I think that stuck with me, it's kind of you have Byzantine General's and other examples are there. But replication has been a fundamentally hard distributed systems problem, and you have in distributed stale systems, that was my first exposure.

Apurva Mehta (31:48):

I learned a hell of a lot there, and I think that kind of positioned me well. And also the person I worked with was kind of outstanding and in the abstract sense. They really got the design down and that's why I really appreciated the elegance of that algorithm for application protocol. It was so simple, I still remember it, and I can still write down the algorithm on paper in half a sheet, maybe in the back of an envelope literally.

Apurva Mehta (32:15):

And I think that was for me over those years, just going from the abstract, understanding it, to implementing it, to testing it, to rolling it out. I think that was kind of around the biggest, steepest learning curves for me. But I think that's ultimately what positioned me, right? And then after that, I worked on LinkedIn search, which is its own different type of distributed systems problem.

Apurva Mehta (32:38):

I worked LinkedIn's craft database, which is kind of a distributed key value store for graph queries such as again, an inverted index distributed query system. But on both of those things, I only worked on the single node aspects of it. A particular node in your distributed systems has to work on ... Do something, and doing that something itself is a hard problem in many cases.

Apurva Mehta (33:02):

Like search, you have to go through the whole index, compute something like in my ... You have to do it fast on LinkedIn, there's a premium on low latency for search, especially for the recruiter product and stuff like that. It was interesting, different, because the distributed systems had like, who gets out to do what? And what has industry [inaudible 00:33:23] happened and quite work on it, but I could understand it.

Apurva Mehta (33:26):

And then I came to Confluent, exactly once it's kind of both at a broker level, changing the message format, and add all the metadata you need to kind of do exactly once processing correctly. And I did a lot of the client work for the producer, which is kind of its own set of problems to do in-order delivery with multiple panel requests, and all that stuff. Not quite distributed, but distributed-ish in-between kind of pure distributed systems design and local work that you have to do within a distributed system.

Apurva Mehta (34:00):

So yeah, I think, I don't know if that answered your question, but that's kind of has been my journey. I think, that working on the replication protocol at the beginning, I feel was the biggest ... As distributed systems go, was the purest thing I've worked on that really taught me a lot.

Tim Berglund (34:17):

Yeah, no, for sure. That definitely answers the question. That's a great answer. An Esri clone, I mean, certain parallels to Kafka is a distributed store of chunks of immutable data, right?

Apurva Mehta (34:38):

Right.

Tim Berglund (34:38):

That's a nice place to start. A graph database that is, graph search functionality over some kind of data that's maybe going from thinking about the distributed state to distributed computing. It sounds like what the graph problem would be, is that right?

Apurva Mehta (34:55):

Yeah, pretty much it's a combination. Definitely a lot more compute than just an Esri, but not a whole lot. I think the search was the most amount of computing in a distributed context.

Tim Berglund (35:06):

Sure.

Apurva Mehta (35:07):

The gap was just look up, who is in my network? Who works at this company? Who is in my second-degree network? It's kind of more to just collect a lot of data, and aggregate it, and send it back was versus actually compute it right, versus rank results in searches. Totally different kinds of things.

Tim Berglund (35:23):

Yeah, it seems insanely compute in terms of that. [inaudible 00:35:28] especially at the scale that LinkedIn is operating. If you were going to advise somebody if there's somebody listening who thinks that distributed systems engineering sounds like the work that they want to do. Like, "Wow, I would love it if that were my job, and it's not right now."

Tim Berglund (35:44):

What advice would you give them? What should they be doing now? And even think ... Now, put on your engineering manager have, like you run these teams, you make hiring decisions about these people. How can a person get to the point of being a desirable candidate? What should they do if they're not there yet?

Apurva Mehta (36:04):

I think definitely contributions to open source, I think that's a big advantage people have today. If you even look at it right now, all of our exactly once work was purely an Apache Kafka, everything we're doing in Confluent right now, in terms of removing so [inaudible 00:36:19] and that massive shift in the controller, shutting the controller, and this massive next-gen Kafka architecture is 100% in the open, right?

Apurva Mehta (36:30):

I think, taking advantage ... I think, what I'm trying to get at is that you cannot substitute actually working on the systems. There's no substitute for it. If I'm looking at someone and then, going to the second half of what you asked, I think nothing beats as I worked on something, I've delivered a result, right?

Apurva Mehta (36:49):

That's kind of what makes you qualified, if I was looking for a distributed system skillset, right? But I think, we also hired people who haven't worked on this stuff, right? It's not like you have to have worked on something significant to get a job at Confluent in these teams. But I think, in general engineering, if you've ... Working on any kind of system, if you work on any kind of stateful system, for example, it's a natural progression, right?

Apurva Mehta (37:18):

If you understand systems engineering state, the problems of the state as in isolation, then you've shown the engineering capability, and the ability to write high-performance code, concurrent code, I think that's the ingredients you'd need to be a good distributed systems engineer.

Apurva Mehta (37:36):

That's like a stepping stone, I would say. But also just getting involved, right? I think there are tons, we've tons of Apache projects which builds very fundamental distributed infrastructure that does nothing, I think, participating in those discussions, and even understanding the design, and evolution of the design, and being able to talk about it in a good way, is for me a very good sign, right?

Apurva Mehta (38:02):

And also show to have actually having done it. Those are the things I would say, get involved, we'll come to some aspect of systems like all of these distributed state of systems is distributed systems plus actual engineering. Even if you do the engineering in some other context, it transfers, right?

Apurva Mehta (38:17):

That's what I'd say. I don't know if that's a good answer to your question, but it's kind of also not a perfect way to ... It's not easy to answer such a question better, but I think getting involved, I think there's never been a better time, that everything is in the open Apache. Projects being the way they are, invite participation, and there are tons of them to participate. Kafka, not least.

Tim Berglund (38:37):

Basically, get started doing the work is kind of what you're saying. If this is work that you want to do, you don't have to wait for somebody to offer you a job. If you've got excess capacity in your life, and that this all presumes that you do, but if you've got that excess capacity put that into studying and contributing to the code itself.

Apurva Mehta (39:00):

Yeah, I absolutely think so. And even just reading the list of some of the best engineers in the world of Kafka mailing list, the best-distributed systems engineers, they never say things lightly. Where are they coming from? Why did they say other things they're saying, trying to piece it together? I think it's there, I think doing the work using all these resources, and then just ... that's about it.

Apurva Mehta (39:29):

I think the opposite is, I think, reading papers is important, but only reading is kind of not a differentiator that I think that's ... It's harder to actually understand things unless you actually ... I think that's my experience. I think you have to work on it in some capacity, in a very specific context to actually learn something meaningful.

Tim Berglund (39:50):

We had a talk at Kafka summit this year, the Online Kafka Summit 2020, about becoming an Apache Kafka committer. And we'll link that in the show notes. So if that's the thing you're interested in doing it, if you're not already an open-source committer, if you're not the Distributed Systems Developer and Apurva says, "Well, just get in there mail listing, and reading, and participate, and get in the code."

Tim Berglund (40:13):

You're probably thinking, yeah, right buddy, I'm not going to do that. But you actually can, I can say a couple of things. Number one, that talk has some tips on how to do it. If you're not Apache's already there. Number two, community-wise the Kafka community is nice. I mean, there are open source communities where if you're going to get involved, you need to be prepared to take some hits, and maybe people won't treat each other nicely.

Tim Berglund (40:45):

That is much less true of Kafka, both the community of people who use it and the community of people who build it, who actually work on the code. I just want to encourage you. I mean, that's the bottom line, what Apurva just said, is that, yes, it is accessible, it is open. You can become involved, you can start studying, you can get to the point of contributing.

Tim Berglund (41:06):

And that takes probably time over and above your day job, but obviously, you're trying to build a new skill. That presumes that you have margin in your life to be able to build that skill, and I guess, what I'm saying is don't be afraid. There are communities where you ought to be afraid of getting involved in the mailing list, and opening a poll request. And this is one of them.

Apurva Mehta (41:30):

Absolutely, and read all these this KIP-500, for example, it's the latest one. And the older ones, expositions of distributed systems design and-

Tim Berglund (41:39):

Yes, particularly older ones. 500 and I don't have numbers, but anything associated with EOS, and then ... Well, I guess some of them are JIRAs before there were KIPs, but like JIRA 50 comes to mind right now and replication. These are solving fundamental distributed systems problems. So if you get in there I'll-

Apurva Mehta (42:00):

Absolutely.

Tim Berglund (42:01):

Include a link to jurors and camps.

Apurva Mehta (42:04):

One thing I learned, I don't know if you have time, but I think also there's a lot of now, which wasn't true when I was studying all of these MIT OpenCourseWare, and all of these universities, the best courses actually I think, are available for free online. If you want a theoretical foundation that's also there, right?

Apurva Mehta (42:20):

And I do think it's important, I think I kind of muddle my way to it, reading papers while I was on the job, but I think those courses outstanding. I know people coming through those systems now have such a good foundation to think about these problems that I think it's worth spending time doing them. Maybe Coursera has something, I don't know. There are probably tons of options. Pick anyone and do a course, I think that on distributed systems I think would help.

Tim Berglund (42:43):

Excellent advice, excellent advice, if you got a little bit of time. I've been watching some cooking videos while I cook, using that little bit of time to learn something. Also possible to watch the student assistance videos, while you cook, if you cook, if that's a thing, you can find this little kind of hacks ways to use existing chunks of time in your life to learn things. If that works out depending on your household situation, could be lots of little kids yelling and it's hard to concentrate on certain systems while you're cooking. Try and find some other time if you can.

Apurva Mehta (43:19):

I have a two-year-old, and a four-year-old, there's no way [crosstalk 00:43:21].

Tim Berglund (43:25):

Don't get you started, right? This is all easy talk coming from the empty nest here. Oh, just use that time when you're doing some cooking in a leisurely fashion with a glass of wine, and everything's quiet. Right, right. Not if there are little ones at home. Last question coming up against time here, obviously you do work with some pretty smart people.

Tim Berglund (43:49):

Are there any qualities that you see in the people around you that you find yourself wanting to imitate? Particularly now, and I know this could be painful now that you're a manager, you're not writing code anymore, and you see these people leading this charmed life, just heads down coding all day. You're not in meetings with VPs, and things like that. And you kind of remember how great that was. But what do you see, even among other leaders that you rub elbows with? What do you see in the people around you that you would like to have in yourself?

Apurva Mehta (44:21):

I think that's an interesting question. I think, in terms of sticking with a team, maybe you can get more general if there's time. But I think with the team of being distributed systems, this podcast, and the topic of the podcast of this particular episode. I think there are the best engineers I have I've seen. There's I think one of the traits, I think, which makes them stand out. is that they will never, ever, ever stop thinking about something, or investigating something, or looking at a bug until they have been able to easily explain all of the causality that led to that event.

Apurva Mehta (45:00):

If there's an incident in production or they've written a new piece of software, and the test failed in some weird way like they had this distributed systems tests, and it's very hard to figure out, and it's very easy to say, "Oh, well maybe it's something flaky, maybe something else is that we don't know." Let's close it, for now, come back to it later.

Apurva Mehta (45:22):

I think that's something which I feel we have people at Confluent who so tenacious, I think that always is, I feel like always a good ... I mean, I feel it's necessary because it's what you don't know that gets you. And then spending the energy to find out when you're surprised, and really, really, sometimes it's hard to get the answer. But being a link to put in the work, I think that's ...

Apurva Mehta (45:47):

I keep getting inspired by people to do that. And never say, "I don't know, it's never good enough no matter what." I think if you really bottom line it, you will build all your theory, you'll build all of your everything else if you have that attitude. And I think over here in this space, I think that's something that differentiates the very, very best from the rest.

Apurva Mehta (46:11):

But I feel like they'll never ever stop until they're satisfied with some very hard problems, right? I think it's easy to say skip at it, but I think it's really ... And I think that's what I see that follows intellectual honesty, being able to say I really don't know, being able to articulate what you don't know, being able to articulate what you do know, technical communication, being able to think about the right concepts to explain to other people that's another problem, right?

Apurva Mehta (46:44):

You can really solve problems in isolation when you have a hard one, how do you get your team, other people to brainstorm with being able to talk at the right level? All of it follows, right? I feel like if you really dig in, and try and solve when the opportunity presents itself in a very specific context to solve the hard problem, and you dig in and do it again and again, you need all of the rest, what I just mentioned to do it over time.

Apurva Mehta (47:09):

And I feel for me, that's the ... Like you asked me also earlier, what do I look for? I think that's the trait I feel in distributed systems engineers, that tenacity that I'm never going to give up until I explain to a five-year-old what happened or what is happening. That kind of thing I think is something which I feel inspired by many of the people here at Confluent. And I think it also stands in this discipline people in good stead.

Tim Berglund (47:36):

My guest today has been Apurva Mehta. Apurva, thanks for being a part of the Streaming Audio.

Apurva Mehta (47:40):

Thank you. And thank you for having me, it was great to be here.

Tim Berglund (47:43):

Hey, you know what you get for listening to the end, some free Confluent Cloud. Use the promo code 60PDCAST. That's 6-0-P-D-C-A-S-T to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021 and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit and there are limited number of codes available, so don't miss out. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack sign-up link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast, where ever fine podcasts are sold. And if you subscribed through Apple podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So, thanks for your support and we'll see you next time.

What's it like being a distributed systems engineer? Apurva Mehta (Engineering Leader, Confluent) explains what attracted him to Apache Kafka®, the challenges and uniqueness of distributed systems, and how to excel in this industry.

He dives into the complex math behind the temporal logic of actions (TLA) and shares about his experiences working at Yahoo and Linkedin, which have prepared him to be where he is today.

Apurva also shares what he looks for when hiring someone to join his team. When you're working on a system like Kafka and Kafka Streams, really understanding what your machine is doing, where the bottlenecks are, and how to design improvements to address inefficiencies is critical.

EPISODE LINKS

Continue Listening

Episode 128November 12, 2020 | 52 min

Why Kafka Streams Does Not Use Watermarks ft. Matthias J. Sax

Matthias J. Sax is back to discuss how event streaming has changed the game, making time management more simple yet efficient. He explains what watermarking is, the reasons behind why Kafka Streams doesn’t use them, and an alternative approach to watermarking informally called the “slack time approach.”

Listen Now

Episode 129November 18, 2020 | 50 min

Distributed Systems Engineering with Apache Kafka ft. Roger Hoover

Roger Hoover, one of the first engineers to work on Confluent Cloud, joins Tim Berglund to chat about the evolution of Confluent Cloud, all the stages that it’s been through, and the lessons he’s learned on the way.

Listen Now

Episode 130November 23, 2020 | 44 min

Multi-Tenancy in Apache Kafka ft. Anna Pozvner

Anna Povzner kicks off the conversation with Tim Berglund by explaining what multi-tenancy is, why it is worthy to be desired, and advantages over single-tenant architecture. By putting more applications and use cases on the same Kafka cluster instead of having a separate Kafka cluster for each individual application and use case, multi-tenancy helps minimize the costs of physical machines and also maintenance.

Listen Now

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog