January 25, 2021 | Episode 140

Distributed Systems Engineering with Apache Kafka ft. Guozhang Wang

  • Transcript
  • Notes

Tim Berglund:

There are a lot of specialties within the very broad vocation of software engineering. And all of them are hard to do. Distributed systems engineering is one corner of the discipline that poses a particular set of challenges. What's it like to build a distributed system? What special problems arise? How do you land a job doing it? And that's the conversation on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud. Hello, and welcome back to another episode of streaming audio. I am your host, Tim Berglund and I'm joined today in the virtual studio by my co-worker, Guozhang Wang. Guozhang, welcome to the show.

Guozhang Wang:

Thank you, Tim, for having me.

Tim Berglund:

So you are a tech lead in the, we'll call the streaming department, in engineering here at Confluent, is that right?

Guozhang Wang:

That's right.

Tim Berglund:

Awesome. So I want to talk to you today. Honestly, I'd like to talk to you about some of that stuff. See if we can dig into some [inaudible 00:01:03] thing, but this is a part of our series on being a distributed systems engineer. And so before we get to that stuff, I want to ask you just a little bit about how you came to do what you do and your thoughts on distributed systems engineering as a separate discipline. The idea is that, there are people who think it sounds like a really cool thing to do, and you want to talk them out of that. You don't want to talk them out of it.

Guozhang Wang:

Yeah.

Tim Berglund:

You want to encourage those people, because it is actually got really fun branch of engineering. We used to give people an idea what it's like. So what do you think ... You work in streaming and you can elaborate on that a little bit if you want, but what makes it interesting to you?

Guozhang Wang:

Well, I think actually the most interesting part of working as an Infra Engineer, especially for distributed environment Infra Engineer is that you basically had to deal with all kinds of failure scenarios, right? Or be more general you basically have to deal with all kinds of hardware issues that are a single known environment it's much less likely to happen. I think that actually is one of the most challenging, but also very I will say professional satisfying challenges that I'm actually enduring from day today.

Tim Berglund:

That's interesting. That really has come up in a lot of these, dealing with failures and dealing with the state. Those are really the two themes that emerge as the primary challenges of distributed systems.

Guozhang Wang:

Yes. And actually, I can maybe talk a little bit more about dealing with failures, one infra topic, all of that is how we actually can achieve agreements from multiple nodes that is connected in a cluster, but the connection in practice is asynchronous, right? I remember in my graduate study, there is a topic from my university at Cornell called the replicated state machine. And that is actually years of research, trying to understand what scenarios can we actually achieve with some assumptions under a practical environment. And so because of the various failures that we have to deal with, how to achieve agreement, for example, in Kafka, how to achieve the logs that is replicated across different replicas consistent and agree to each other is actually a very challenging part. And if we assume that there are no failures i.e it's like failure-free, then actually it will be a much simpler problem to solve.

Tim Berglund:

Right. But that assumption is not going to hold in a practical environment.

Guozhang Wang:

Of course.

Tim Berglund:

Things are going to break. So yeah. Dealing with failure and you were just describing, coming to a consensus about the state is what you're preoccupied with doing. So tell us about your background a little bit. How did your career prepare you to do this work that you're doing now and your educational background and everything?

Guozhang Wang:

Of course. I actually got my PhD in computer science, but the specialty area is in databases from Cornell University. And that's where I actually started to study in distributed systems. And actually you have a bunch of the professors, including my own advisor, actually have shown me basically very interesting and the typical problems that has been under study and has been researched in the past years, but most of them is, what we do research on paper. We basically build the prototype to demonstrate some of the ideas. What actually really intrigued me to basically look into the real-world problems is when I actually joined LinkedIn as an intern first, but then as a full-time. I joined LinkedIn and I work on Kafka, which is actually my first project out of graduate schools because I LinkedIn back in 2012, Kafka had already been widely adopted and with massive deployment.

Guozhang Wang:

And at that time, I have to say the stableness of Kafka is actually the crucial part for the whole site for LinkedIn website and hope for the whole company, because once Kafka is now stable for them if it has on the ability issues or some of the logs are diverse everyone knows because the CEO watch some dashboards, Jeff basically watch on those ones which are basically dependent on Kafka delivering those data. So and there's a lot of the real world issues, like what I mentioned about all kinds of hardware failures that you have to deal with, and you have to try to also heal from them so that the company, the site is all stable. So that's actually where I started to realize the challenging, but also the very interesting problems that you need to look into, especially from a real world life on a daily basis.

Tim Berglund:

Yes. I suppose it might be a good enough approach to failure in an application, depending upon the scale and the user community and your relationship with them. It might be good enough when something goes wrong, say to log a stack, trace and expect or hope that the user retries.

Guozhang Wang:

Yes.

Tim Berglund:

And then it'll be okay, but here in infrastructure you have to be that layer of this always works-

Guozhang Wang:

That's right.

Tim Berglund:

So that you're not a source of those exceptions for application developers to deal with.

Guozhang Wang:

Yeah.

Tim Berglund:

So Kafka was the first distributed system that you worked on and it's still directly or indirectly the system you work on.

Guozhang Wang:

Yeah.

Tim Berglund:

What do you think makes it interesting? You talked about how it gets used at LinkedIn and we can all assume we know what Kafka does for listening to this podcast, but what is interesting about it to you?

Guozhang Wang:

Well, I think the most part that is interesting about me is to really consider about how basically data persistence really coming to play in a distributed environment. Just to give you some context, like before Kafka was introduced, right. I LinkedIn and I believe actually a lot of other companies in Bay area at that time, we have actually also try a lot of the messaging systems which are primarily based the memory data structure basically, and they actually can give you a pretty good efficiency, but when it comes to availability, durability and also especially in messaging system where we have this political question about how we deal with back pressure, meaning that when the consumer cannot keep up with the traffic, a producer is generating. In memory data structure does have this limit and the Kafka at that time, no I'm talking about back in maybe 2010, 2011 or 2012 was the first idea that basically introduced ... Persistence, basically, meaning that will persist the data as log structures on the Kafka brokers, before they are being consumed by the consumers.

Guozhang Wang:

And that actually is a very, at that time, it's a very interesting idea to me because I had been working in databases, which basically I was dealing with persistence for a long time. And the rule of thumb is that hard disks is never going to compete with the memory storage when you come to efficiency, right. But actually when we're working with persistent data start with in the right way by freedom, but in Kafka, we try our best to do batching, to do the sequential reason wise. Persistence is not as expensive as we usually thought. I think that is the first impression and the first interesting idea that I encountered when I was first looking to Kafka architecture.

Tim Berglund:

And what kind of problems does persistence introduce? Once you start writing things on disc you said that the performance isn't as bad as you think, and obviously you have to do things to optimize. And you mentioned sequential rights and trying to be smart about what buffers you copy and do things in a way that will use the page cash efficiently. And there's all the IO stuff, but what happens in terms of state management in the system? What gets hard when you start writing things on disc?

Guozhang Wang:

I think writing things on disc itself is not that hard. As long as you try to batch enough data and you want to basically do everything in sequential. What actually is hard is basically how to maintain the metadata of it, because in order to do things like sequentially, right? Meaning that from a single producer, if you are sending data which contains a bunch of records, you potentially as writing those bunch of records into the same broker so that you can achieve the batching effects and you have achieved sequential rights. But writing them all to the same broker, meaning that at least at a given time you are basically having a single writer, meaning that you have not scattered the rights across multiple brokers, right?

Guozhang Wang:

So well, those data have to be rebalanced because various issues, because you basically scaled out your cluster or you're scaled down your cluster, or there are some error happen and then you have to basically fill over. How you basically maintain the metadata consistently so that all the clients they will basically learn about the metadata, for example, who will be the new broker that I should talk to in order to produce data or consume from it, and what ranch of the produced data is available on the broker. I think that is actually a pretty challenging part that basically result from this [inaudible 00:11:07].

Tim Berglund:

What message should be readable, what offset should be readable based on how replication has worked. So, yeah, metadata, that's coming to consensus about metadata like that.

Guozhang Wang:

Exactly.

Tim Berglund:

Yeah, fair point. We should know how to write things to disk and make that work, but you are still, and this is always a subtlety that I try to grapple with. Events are immutable and so from an application architecture perspective sharing topics between microservices is not terrible thing to do because we're basically sharing immutable data structures, and that's in general a safer thing to do than sharing mutable ones. Of course, topics themselves, or partitions themselves are mutable because we appended to them, we put things in them and it's caucus job to come to consensus around that mutable state.

Guozhang Wang:

Yes.

Tim Berglund:

The state being what's in a topic.

Guozhang Wang:

Yeah. And I also like to just bring up this is the newest updates from the Kafka community because this mutable metadata we use to basically store basically the most critical ones of the ZooKeeper. And the recently we are actually working towards a ZooKeeper free architecture where we actually maintain the metadata, which are mutable metadata and some of them can be actually much frequently updated as well from ZooKeeper to Kafka logs as well. So you can imagine the Kafka log structure now will actually be used for both maintaining the data as well as the metadata of it.

Tim Berglund:

Nice. Yes of course. That's building Kafka out of Kafka and using topics to exactly distribute metadata, right.

Guozhang Wang:

Yeah.

Tim Berglund:

Kafka always tries to build Kafka out of Kafka. So I'm glad that this is a part of the broader KIP-500 effort to use topics in that way because it just seems like the way we do things.

Guozhang Wang:

Yeah.

Tim Berglund:

How about if we just backup and think about distributed systems engineering, it's what you've done. You got a PhD in a related field and it's what you've done your whole career since your original internship and you're doing it now as an engineering lead. This might be hard because it's the thing you've done. And what I want to ask is how is it different from say being a full stack application engineer of the sort that are most of the people who build things on Kafka. And that was me 10 years ago. The last time I had a just a pure engineering role. I was a full stack developer and obviously there was not a lot of Kafka then there was still a lot of basically monolithic application architectures. But I was, you're basic full stack Java web developer. So what's the difference do you think? What are the properties you're looking for in a distributed systems engineer?

Guozhang Wang:

Yeah. I think maybe I can talk about that from just like how we, I personally talk to candidates for full stack engineer versus, Infra Engineer during our hiring process at Confluent. So I think for full stack engineers, right? Basically what we talk with them and during some exercise technical designs or coding exercises, what we are looking at is how you basically can how to put it, like how you can basically understand what is off the shelf for you, like different tools, different systems and how you can stitch them together and leverage what is actually available to you to basically solve specific questions, right. So particularly if you want to build something from scratch, like we say, Oh, if you want to build you a link shortener from scratch, you can actually say, Oh, I want to use a non sequent database, right.

Guozhang Wang:

I want to use Cassandra or I want to use other systems and you understand what actual Cassandra or other system can provide to you in terms of persistency in terms of durability, in terms of data publication properties, things like that. And you basically use that and you may actually stitch multiple of those systems of the shelf and you basically build them together to achieve your goal. Whereas for Infra Engineers, you are actually the developer of Cassandra, you are the developer of Kafka, so you basically can not say, Oh, I can just assume that there are some other underlying system which provides those properties of the shelves for me. You basically have to build those, provide properties yourself. I think that basically, especially from my personal hiring experience what we are actually looking differently for candidates in full stack engineer verses Infra engineers.

Tim Berglund:

Got you. And that really goes to being an infrastructure engineer, distributed infrastructure engineer in particular, you are building the infrastructure, so you don't get to integrate these nice pieces. The pieces that you get to integrate are like, okay, Raft as a thing.

Guozhang Wang:

Yeah.

Tim Berglund:

We're going to build our own Raft implementation. We need a distributed database. Oh, here's Cassandra. Thanks. That is if you will have the privilege of the full stack engineer to take advantage of the universe of infrastructure options of which, of course actually there's one

Guozhang Wang:

Yeah, leveraging off the shelf systems and stitching them together and to actually achieving your specific goals is also very challenging, right?

Tim Berglund:

Yes, I was just going to say that.

Guozhang Wang:

Yeah.

Tim Berglund:

That's very much specialty and requires broad knowledge of the things that are available, which is a constantly moving target, right? There's this real ongoing continuing education thing that you have to do to know what are the pieces out there that other people are using and they're being successful and that are they're current state-of-the-art, that's kind of where you want to live or somewhere near there. And so that's a challenge. I think for that skill set is being good at knowing what those pieces are, understanding enough about them and then integrating them. And then on the infrastructure side, well, you're building all that.

Guozhang Wang:

Yes. In fact, actually if you're looking to the Kafka community, right. Many of the contributors to Kafka are actually not 100% full-time developer of Kafka, but users of Kafka.

Tim Berglund:

Exactly.

Guozhang Wang:

Basically leverage on Kafka, but you actually also know a lot about Kafka internals and you know them so well, so that sometimes you find some issues you can actually fix them yourself and the contributed back to the community. So I think this is also pretty valuable characteristics for an engineer either in infra or info sec.

Tim Berglund:

Yes. That's such a good point. I occasionally have this conversation. You're an Apache Kafka PMC member, yes?

Guozhang Wang:

Yes,.

Tim Berglund:

Yes. So not just a committer, but a PMC member. So sometimes when I'm talking to you and we say community, it's easy to think. Well, community of the people who are building Kafka with me, so some of your co-workers on Confluent and colleagues who work for other companies who are committers and PMC members for Kafka. That's a community of the people who build it, there's a much bigger community of the people who use it.

Guozhang Wang:

Yes.

Tim Berglund:

And well, that's what this podcast is all about, is helping take care of those people and helping them understand the ecosystem better and know where things are going. And that's a big set of concerns. And most of those people are never going to open up a PR. That's just not how life works. It's hard enough to get to know the infrastructure in terms of its API and operational surface area and do your dang job, building things with it. So both important definitions of community. I appreciate that you highlighted.

Guozhang Wang:

Yeah. And sometimes not submitting API is totally fine. Sometimes just creating a ticket reporting. Oh, I encountered some problems. I suspect maybe it's due to some out of the bugs in Kafka or some of the designs in Kafka, that is equally helpful. And that is equally highly valued in the community as well, right. You don't only need to provide a solution as long as you can provide some issues or reports, some issue that you observed that is also very helpful.

Tim Berglund:

Everybody. I want you to hear that Guozhang is a PMC member and he wants you to open Jira. Now look, everybody there are good Jira and there are bad Jira. It's nice to do your homework. It's nice to be specific and have some evidence of what you think the problem is as much as you can. But the core folks are very eager to hear your feedback if something seems to be not right.

Tim Berglund:

Yes. Don't hesitate. Since I have you here, can we dig into some of the ksqlDB stuff you do?

Guozhang Wang:

Oh, of course. Yeah. That's a thing that I started working on since I joined Confluent.

Tim Berglund:

Yeah. Why that? First of all, why did you get into that?

Guozhang Wang:

Yeah, actually it's a pretty interesting question for me to ask myself as well. So when I was about to join Confluent I was already working at Kafka and I LinkedIn for almost three years. And I basically work on Kafka course, including the replication layer, some of the security features, throttling mechanisms and things like that. And to me at that time my thinking is that Kafka has a storage engine you can consider that as the HDFS in real-time world, right? So the HDFS is for batch processing, Kafka is for real-time streaming, but they are mainly for storage. And at that time, I believe Jane, June, they already have ideas that we should really grow up in the software stack from storage to computations as well, right. If I take the same analogy, it's like from HDFS, the storage layer to computational layer like Hadoop, right.

Guozhang Wang:

And I also have the feeling that, I have been working on storage for three years and it's really a good time for me to personally move up in the software stack to work on the computational layer. And also I think stream processing is actually a very challenging topic. It has been under academia research literature for like 20 years. But only recently in industry, it has been getting of very tremendous trend for a lot of people have moving from large volume data side to pay attention to velocity of the data. How fast can I get my value from the data that I collected so far, right. And I think at the time back in 2015, that's the time where I think it's a really good time and also the co-founders also think it's a good time. And so that's why I started building Kafka Streams and now ksqlDB.

Tim Berglund:

All right, cool. Yeah, I guess fundamentally there are two things there's storage and there's computation and well, you did storage time to move on to computation.

Guozhang Wang:

Yes.

Tim Berglund:

And that makes sense. It's extremely important thing that Kafka as a platform has grown and important set of capabilities that well, that responded to circa 2015 the broader community of users is maturing and the kinds of things people on the leading edge are doing with Kafka, we're getting to the point where they were saying, "Hey, you know what, we really do need our own compute here or our own solution to compute."

Guozhang Wang:

Right. Yeah. Because back in the time people are all doing stream processing on top of Kafka, either in DIY. I think that is a majority of people who are using some other system like store, I think back at that time is a popular one that people are using. But I think that basically providing a fully integrated ecosystem with Kafka Streams in the picture is actually a must have that we need to do for the Kafka community.

Tim Berglund:

Exactly. And for those folks who were doing it DIY back then, I think it's time to cue the sad piano music, because it really was the thing that they had to do. The only choice they could make at the time, but it's rough as you know being a core developer on this stuff. It's a lot of work to build this, and that's not something an application developer should have to take on, in addition to his or her primary duties, we should probably get rid of that sad music I'm going to cry. So talk to us about some of the key ideas behind well, streams ksqlDB, if you're a regular listener to the podcast, you already know what the differences. Guozhang in case anybody is picking up with just this episode and they haven't heard maybe quickly differentiate between Kafka Streams and ksqlDB, and just what are some fundamental architectural principles behind them that you have access to as a person who's driving the design?

Guozhang Wang:

Sure. I think the number one principle of Kafka Streams than ksqlDB, because they are actually fully integrated with Kafka, I think is that big leverage the persistence characteristics of Kafka to its full extent in its architectural design. And by talking about that, I can give you some companies examples. For example Kafka Streams aka the wrong time off ksqlDB use Kafka topics in both ways to basically maintain the updates of the local streaming state stores, as well as using that as basically the channels for doing data shuffling, right.

Guozhang Wang:

Basically, if you compare that with some of the other streaming engines, which are using in network shuffling depending on RPCs, we basically persist those shoveling data, as well as change logs as Kafka topic as well. And the main idea again, behind that is to process the data, as long as you are doing that in the right way is not as expensive as you think. And actually, it also gives you a lot of benefits by basically processing those data. And one of them is what we have achieved a couple of years ago to basically achieve the exactly-once semantic. Exactly-once is basically like a processing guarantee, making sure that you are input data is only processed and their results in both the processing states, as well as the [inaudible 00:26:15] output is reflected exactly-once even if there is failures. And that we actually are relying highly on basically the persistent data inside Kafka to basically achieve the exactly-once semantics, and I don't know how much I can tell because if I talk to you about the details, and maybe take another one hour to do that.

Tim Berglund:

Yes.

Guozhang Wang:

But we are having a series of blog posts and some articles explaining why actually leveraging the persistence of data storage on Kafka actually can reduce this semantics implementation and design to a simpler way that we can actually achieve this with guarantee.

Tim Berglund:

Okay. We will include whatever of those blog posts a live by the time this episode goes live. We will include those in the show notes because that sounds like excellent extra reading. And I know we have maybe 15 more minutes to talk here, so unfortunately we can't dive into EOS. I'd love you to give us a good account of that, but we'll also link to previous shows we've done on the topic. So if you are interested, you can chase that down, but it's a good thing to understand how Kafka does it because you start to dig into it and you starting with, to do that you'd need two phase commit, isn't that slow. And then you look at the actual solutions Kafka uses and you're like, Oh no okay they pull this off in a performant way. Then Hacker News told me that was impossible, but it turns out it isn't. So you got to be careful what Hacker News tells you.

Tim Berglund:

And so where you originally, you originally were working on Kafka Streams before ksqlDB was even a thing, talk to us about that transition, again, from a career perspective, why was that interesting to you? And then give us an idea of where the interesting technical challenges are for you as a Tech Lead inside ksqlDB.

Guozhang Wang:

Okay, for sure. I think to me technical wise, I will say that the ksqlDB consider its functionality. And of course then it's encouraged challenges as like a super set of Kafka Streams. So ksqlDB provides three different types of queries that you can basically submit to. And one of them, which we call it the persistent queries is basically compelled into Kafka Streams. And the main use case of that can be categorized as streaming ETLs, where you have some of the input streams, and you have some other database change logs. And you want to basically build a streaming ETL where you can clean up your data, you can modify your data, you can join those data in real-time and then dump the result streams or tables into different data warehousing engines for query serving, right?

Guozhang Wang:

That is basically the primary usage for the first type of ksqlDB that is part of Kafka Streams. But then with the second and third types of queries, namely the push and the pull queries will also allow ksqlDB to basically go beyond the capability of Kafka Streams to allow query serving. You can imagine that if you have those streaming states, that is basically being aggregated from the input, we also basically allow those states be queried in real time and not only to query their current state potentially users can also query their past states like in their historical snapshots as well. And this actually can enlarge the use cases from, for example, like stream ETLs to different other streaming use cases like monitoring, real-time dash boarding, anomaly detection things like that. So I think if you ask me about functionality, I would say that ksqlDB definitely enlarge the capability of Kafka Streams in how people can actually interact with it and build their applications on top of it.

Tim Berglund:

All right. Yeah. That query capability, the pull query capability in particular was year and a few months ago as of the time of this recording this, by the way, everyone it's early December at the time Guozhang and I are speaking. So September of 19, I want to say we announced to pull queries, which was a way to, from outside the ksqlDB server, interactively query, the results of a table, which was a signal of, "Hey, you know this is a thing for applications, not just for streaming ETL, streaming ETL is good. But it also is venturing into streaming application type features as well.

Guozhang Wang:

Yeah [inaudible 00:31:01].

Tim Berglund:

How about the KIP-500 stuff? This isn't so much streaming per [inaudible 00:31:08], this is more of a core Kafka thing, but the Raft implementation that is merged as of right now, right?

Guozhang Wang:

That's right.

Tim Berglund:

Okay. Merged as for a few months ago if I recall correctly.

Guozhang Wang:

That's all right.

Tim Berglund:

How is that like and different, we'll put a link in the notes to some good articles on Raft. Guozhang I'm going to rely on you to tell me what you think the best articles are since you're the guest today. I think those should be the articles you like plus the Wikipedia one. But how is Kafka's Raft implementation similar to and different from what one would know if one with literature?

Guozhang Wang:

Yeah, so I think, there's a lot in common from the literature from academia. So I think maybe in here, I will just maybe emphasize some of the differences that we have in our design. I think one of the main difference in our design compare with literature, is that seems we'd rely on Kafka blogs as the backbone for the Raft implementation. And the Kafka replication mechanism is based on the pull-based mechanism, meaning that the follower replica will proactively pull data from the leader instead of the push model where the leader will send those records to the follower. Basically that means that in this Raft implementation, we also have to follow the pull-based replication mechanism, whereas being the literature basically if looking to most of the papers about Raft implementation, they are based on pull-based sorry, push-based in designs. And that difference actually can result in a lot of the, I would say [inaudible 00:32:53] in the details on how we basically still make sure that the whole design is correct and in all kinds of edge cases.

Guozhang Wang:

So if you ask me one main difference, I will say that this push versus pull-based replication mechanism is the key difference that we have in our implementation.

Tim Berglund:

Okay. And it is push-based. Kafka is push-based.

Guozhang Wang:

Sorry. Kafka is pull-based. Literature is push-based. Yes.

Tim Berglund:

Okay. All right. I was going to say that doesn't sound like how topic replication is pull-based so that makes sense. All right. I was going to drill into that, but now I'm okay. I don't have to. Yeah. And that is by itself six or seven year old design decision that seems to be working out fairly well.

Guozhang Wang:

But also I want to take this as opportunity to give our appreciation to a lot of the researchers in academia from US and UK who actually that have talk to us about the Raft implementation the state of art research, but I also actually nurtured a lot in our own design. So yeah. Thank you for, I can't give the actual names because they are a long list, but I just want to say, we are really grateful for all kinds of help that we got from the academic and also from the industry as well.

Tim Berglund:

You know who you are and we're standing on your shoulders and we're explicitly acknowledging it.

Guozhang Wang:

Yeah.

Tim Berglund:

Back to you as distributed systems engineer imagine you're advising someone who wants to get into this line of work, right. They think this sounds like fun. I aspire to do this. It seems like the cool stuff. I don't like to create dichotomies in engineering. I think rebuilding things is difficult pretty much, no matter what you're building. We're talking about the difference between full stack engineers and distributed systems engineers and the problems and difficulties in endemic to both and they're all there. So it's not like this is the cool kind, but some people think it is, right? And they're like, "Hey, I want to get into that." What would you tell somebody? What steps should they take? Should they get a PhD? How do they start? What's the-

Guozhang Wang:

I don't think that getting a PhD is necessary.

Tim Berglund:

It's just like nobody with a computer science PhD says, "Oh, definitely get a computer science, PhD."

Guozhang Wang:

Yeah.

Tim Berglund:

Sorry. Dr. Wang, I interrupted you. Go on.

Guozhang Wang:

I think, yeah. I have been thinking about that also for a while. Maybe my answer is a little bit atypical. I think that if we really want to work in this resistance, what project you'll work out is definitely very important, but who you work with is equally important, if not more important. That actually is what I'm thinking now from my experience. I think to work with ... When I start my career after grad school to work with people like during Nihon J is actually very lucky thing for me personally. To be able to really quickly get familiar with the current state of the art with the industry and get to know how you should think in terms of making design trade-offs and how you should actually communicate and hear other opinions in this area and how you actually find resources about other approaches in the literature.

Guozhang Wang:

So I think definitely picking the area or the project that you really feel excited and that you feel that professional satisfied is very important, but find the people you work with on daily basis, who can actually be your mentor and like basically help you grow is equally important at least when you start your career.

Tim Berglund:

So who you know above what you do. I like it. Of course that poses its own challenge. That's been a thing that you had at LinkedIn the ability to network with some folks who were true leaders in the space. If you didn't have access to them, what would be a way like say I don't have any super cool mentors, what would be a way I could get my toe in the water? Yeah.

Guozhang Wang:

Maybe this sounds like advertisement, but I would say people at Confluent today, especially all the engineers that we worked together, actually not only Apache Kafka, but also whereas open source or now open source project Confluent is, I find myself very grateful and very lucky to working with them on a daily basis as well. So yeah the sound to me sounded like an advertisement, but I will say if you join Confluent in your earlier career in stage I think you actually may already have a pretty good edge.

Tim Berglund:

This message brought to you by Confluent recruiting. Please see our open listings at Confluent.io/careers link in the show notes. And you don't need to apologize for sounding like an advertisement because there might be people who want to do that. And I even want to dig into that a little bit. So you're an engineering lead you interview engineers. What do you look for?

Guozhang Wang:

I think I look for people who besides all the technical skills, right. I think I'm looking for people who can actually, basically be good at how to basically incorporate feedbacks, how to incorporate suggestions from others. I think that is actually the number one non-technical soft skills that I'm looking from the sessions I talked to people talk to the candidate, because at Confluent we have a lot of smart people, but I feel like more grateful to work with them is that people are really good at incorporating others feedbacks because when you come to design distributed systems, right. Very unlikely that you will basically come to agreement in your first try. There will be different opinions on different trade-offs floating around. So how to actually address other's concerns and how to incorporate their feedbacks. And basically come to the final solution, I think that is actually very important. And like I said, although people are Confluent here inspiration one that I work with on daily basis are all very smart.

Guozhang Wang:

I thought of them as also very good at taking others opinions and incorporate other feedbacks into your own driving work. So I think that is, I would say the qualities that I'm looking for from the candidate as well.

Tim Berglund:

I love it. So that is a soft skill. The ability to give and take feedback and argue about things in a productive way. As I try to explain to people ever talking to say a young person, who's asking me, "Hey, should I get into software development?" I'll say, well, it's mostly not programming. It's mostly arguing back and forth about what to name things forever until you retire or die. So it's key that you'd be good at that, and that you be an empathetic person, a humble person, somebody who wants to serve the people that they work with and the team that they're on and not who thinks they're always right because we have worked with folks who are very in love with the fact that they're right about everything.

Guozhang Wang:

Yeah. I think especially in Infra engineering, when you are building systems and you are building distributed systems, it's very unlikely to be a one man show, right. Building such systems always require a large group of people working together. So that's why I think this quality is very important for you to be successful in this area.

Tim Berglund:

Yeah. It seems to be some metaphor there we're building systems that are composed of lots of pieces working together. And we ourselves in fact have to be lots of pieces working together and doing it in productive ways ideally without having to tolerate too much Byzantine fault tolerance. It's what you're saying, you'd like any failing nodes to be nodes that fail accidentally and not nodes that are intentionally trying to harm you. What are you looking forward to in the future? In the near term of what you can talk about in ksqlDB or what are some interesting challenges in front of you right now?

Guozhang Wang:

Yeah, I think in terms of ksqlDB so the intermediate challenges or the near term challenges that we are trying to look into is basically improves on the operability of the system. I think this is equally important to, for example, like providing new features where new functionalities of the system as well. When we are planning for the near term, like the next quarter, the next half years, it's always come to my attention that we may actually adding a lot of good features. Like we are adding like pull query capabilities. We're adding like IOS into Kafka Streams and ksqlDB. Those are very good, but at the same time, keeping the system easy to operate, easy to maintain including easy for deployment either for provisioning. And when there are issues it's easier to troubleshoot to debug. That is also very important in addition to providing new features and new functionalities. So I think coming to 2021 improving on the operability and the usability of the ksqlDB system will be going to be one of the primary goals for the whole team and also for myself as well.

Tim Berglund:

My guest today has been Guozhang Wang. Guozhang, thanks for being a part of Streaming Audio.

Guozhang Wang:

Thank you so much for having me.

Tim Berglund:

Hey, you know what you get for listening to the end, some free Confluent Cloud use the promo code 60PDCAST. That's 6-0-P-D-C-A-S-T to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit. And there are limited number of codes available so don't miss out. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out in our community Slack. There's a Slack sign-up link in the show notes if you'd like to join, and while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there that helps other people discover us, which we think is a good thing. So thanks for your support. And we'll see you next time.

Tim Berglund picks the brain of a distributed systems engineer, Guozhang Wang, tech lead in the Streaming department of Confluent. Guozhang explains what compelled him to join the Stream Processing team at Confluent coming from the Apache Kafka®  core infrastructure. He reveals what makes the best distributed systems infrastructure engineers tick and how to prepare to take on this kind of role—solving failure scenarios, a satisfying challenge. 

One challenge in distributed systems is achieving agreements from multiple nodes that are connected in a Kafkacluster, but the connection in practice is asynchronous.

Guozhang also shares the newest updates in the Kafka community, including the coming ZooKeeper-free architecture where metadata will be maintained by Kafka logs.

Prior to joining Confluent, Guozhang worked for LinkedIn, where he used Kafka for a few years before he started asking himself, “How fast can I get value from the data that I’ve collected?” This question eventually led him to begin building Kafka Streams and ksqlDB. Ever since, he’s been working to advance stream processing, and in this episode, provides an exciting preview of what’s to come. 

Continue Listening

Episode 141February 1, 2021 | 50 min

Examining Apache Kafka Performance Metrics ft. Alok Nikhil

Coming up with an honest test built on open source tools in an easily documented, replicable environment for a distributed system like Apache Kafka is not simple. Alok Nikhil (Cloud Native Engineer, Confluent) shares about getting Kafka in the cloud and how best to leverage Confluent Cloud for high performance and scalability.

Episode 142February 8, 2021 | 48 min

Building a Microservices Architecture with Apache Kafka at Nationwide Building Society ft. Rob Jackson

Nationwide Building Society, a financial institution in the United Kingdom with 137 years of history and over 18,000 employees, relies on Apache Kafka for their event streaming needs. But how did this come to be? In this episode, Tim Berglund talks with Rob Jackson (Principal Architect, Nationwide) about their Kafka adoption journey as they celebrate two years in production.

Episode 143February 10, 2021 | 9 min

Confluent Platform 6.1 | What’s New in This Release + Updates

Confluent Platform 6.1 further simplifies management tasks for Apache Kafka® operators. Based on Apache Kafka 2.7, this release provides even higher availability for enterprises who are using Kafka as the central backbone for their business-critical applications.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free