Guozhang Wang is a software engineer and a computer science Ph.D. who I work with here at Confluent. He's recently written a paper on consistency and completeness in stream processing in Apache Kafka and submitted it to an academic journal. Now interestingly, this isn't just a thing he did in his spare time. It's a thing that is actually a part of his work here. And you might wonder why would a product company, a cloud company do a thing like that? Well, I ask him that question and I get him to talk us through his paper on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud.
Hello and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined in the studio today by Guozhang Wang. Guozhang is a software engineer at Confluent and a returning guest. You've been on the show before, right, Guozhang?
Yeah. I've been on the show a couple of times. This is the first time I've done that on video though. I have been on a couple of audios so.
That's right. By the way, audio-only listeners, you are absolutely the majority of the listening audience, but you can check us out on YouTube. You can see what Guozhang looks like. You can see what I look like. There's a teddy bear in the background of my ... What do you call this? My frame. So all these things are visible to you if you watch this on YouTube, but really no pressure. For most of my podcasts, I listen as audio.
I see, Tim, your frame is much better than mine.
Well, it's a professional obligation in my case. Anyway, you have recently published an academic paper, a paper entitled, reading now, Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka. And consistency and completeness are some rather big words, I mean important words. And I want to talk about that paper, but first I want to start with a bigger question or a higher-level question. Why does an engineering team at a product company, at a vendor, why invest in academia? I mean, this is computer science, and so people with PhDs don't go by doctor. You are Dr. Wang. But why does the company do this? Why do you get to spend your time doing this?
Mm-hmm (affirmative). Well, I think, as we are building software in the stream processing fields, I think both the company, the department, engineering, and myself feels that it is actually good to share our lessons and share our thoughts with the whole community because we also view ourselves as like the thought leader in this area. So, as a thought leader, I think it's also an obligation to really tell the folks, "Hey, this is what we think is important to address. And this is how we think is at least one way to address those challenging problems in the literature." So actually I think publishing to academic conferences is definitely one good way to do so.
When you put it that way, it just sounds like an extension of my team's work in developer relations. We're trying to explain things, on the one hand. Like, "Hey, here's how to use ksqlDB. Here's how to use Streams. Here's what Kafka is." And also, we're trying to get ideas out there. Like, "Here's how to build systems." And you're doing the same thing, just a slightly different audience.
Yeah. I think there's maybe one I will say slight difference, in that the communication and the message exchange is more neutral because a lot of papers and work has been published in this area. We actually have learned a lot from all other peers' work, and we're basically just pushing the target like one small step ahead within this whole community. That's basically how you see all the references on the paper and say, "Hey, we are actually stepping on some gen's shoulders to really making one small step on top of that." And the other people reading our paper will actually have this mutual idea exchange and also pushing forward the development in this area.
Yeah. And to some degree, everything is a remix, but that ... What's the word? That dynamic, that culture of citations and building on work and mutual conversation and support, it sounds like just the academic way of doing that. That's how academia functions. So you function as an academic in this space and say, "Here are some ideas. Let's talk about them and try and move the state of the art forward there," which I think is key because academic computer science produces things that can sometimes have a huge impact on what we call industry. So yeah, it's good. I'm glad you're doing it.
Yeah. Especially in data management areas. I think the connection is even more strong.
Yeah. So tell us about the paper. Folks, that's what this episode is going to be. We're going to talk about this paper and we're going to learn things. So give us the abstract.
Sure. Well, we published a paper on ACM SIGMOD 2021, this year's conference. That is [crosstalk 00:05:40].
Link in the show notes.
Sure. It is a virtual conference this year because of the COVID and pandemic, but I will say it's the number one academic conference in the area of data management. So we are really glad that our paper gets accepted by the committee, and that we can share really our thoughts and our lessons learned while we actually are going through the past three, four years of developing in this area.
And actually, the main focus is basically trying to discuss some key design principles or challenges around delivering correct results for stream processing. Because in the past, stream processing is really been considered as an auxiliary system to normal batch-oriented systems where you trade off accuracy for low latency, which means that your results can be sometimes lossy or approximate, but it can actually give you real-time data results. And in modern times, actually many of the-
Absolutely. 10 years ago ...
Go ahead. Sure.
I was going to say, 10 years ago when it was current to talk about the lambda architecture, the stream processing arm of the lambda architecture was described that way. Like, "This is approximate, lossy, inaccurate." You need the batch system for the real thing, but the stream thing is like ... it's got an asterisk on it.
Exactly. Yeah. But I think things have really changed a lot in the past decade, actually. Modern systems, including Kafka Streams and ksqlDB from Confluent, but also many other peer systems, actually have been thrived to really try to say, "We should really consider stream processing as a source of truth in your data-driven applications, in your real-time pipelines." And to do that, it should really provide you this strong guarantees, delivering correct results, even in the face of failures. That's actually the key point of this paper.
And many methods actually have been proposed to try to tackle those problems. In our paper, we are basically trying to say, "Okay, this is another alternative way that we think is also worthwhile to discuss about to basically tackle the correctness guarantees in stream processing."
So walk us through it. What's the problem and what's the solution? Yeah. I mean, the problem is correctness in the face of failures, but more specifically.
I can elaborate on that a little bit more, of course. So in our paper, we discussed two major properties within this umbrella of correctness guarantees. One is called consistency. Another is completeness. For consistency, we actually are referring to exactly when there's a failure. Stream processing frameworks basically be able to not only recover automatically from the failure but also mask this failure as if there are no failures that happened at all. The masking means that we should really give you consistent results, such that there's no data is duplicated and there's no data that is lost during the failure.
So many systems that are proposed maybe five or 10 years ago, they basically gave you this at-least-once semantics, so to speak, where it means that, if there is indeed a failure happening, upon recovery, sometimes manually, sometimes automatically, you may actually get duplicated results because you do not have resume the whole application from the consistent snapshot of verification.
Whereas, right now, we are actually striving to basically provide this Exactly-Once Semantics for consistency, which means that even if you have any failures, all the records in your data streams are processed once and exactly once. That is one aspect of this correctness guarantee.
Another big aspect of the correct guarantees is basically completeness. And for companies, what basically we are referring to is that, in data streams, data is continuous and it basically can income forever. There are no real boundaries on your so-called input dataset. And because of that, there's a key concept of [inaudible 00:10:08] ordering or time skewness in your data streams.
In the ideal world, we believe that all the records that are basically sent into your source data streams are ordered. This means that the record that is generated later is also inserted into your data streams later. But in reality, it is not always the case, because you can have delays on your sources where you collect those data streams. You can have artificial re-orderings in your operators as well. So when those [inaudible 00:10:41] auto data occurs where some data which actually it generates first gets arrived later. How you can [inaudible 00:10:48] about the completeness of your input and generate the correct results is a key factor in stream processing. And that is another aspect, basically, we are discussing in our paper about how to really tackle this challenge.
So I want to ask some questions about just how academia functions in this way because I'm not an academic and I never have been. So, you published this paper in ... You said it was ACM?
In ACM SIGMOD, yeah.
Yeah. Okay. So you published the paper with the ACM and it's printed. And are you expecting dialogue? Are you expecting merely references? Like, "Okay, now it's here and people are going to build on it." I mean, you're trying to advance the state of the art and share knowledge into the academic community that we've developed as a company, but what happens at this point?
Well, at this point, basically we definitely want to see more adoptions of our ideas, like with discussing the paper. If you look into the literature, there are many big systems and great systems that are actually oriented from academic papers. Academic papers provide you this proof of concept design, which gets groomed into a big system.
Whereas nowadays, it's more mutual, means that the industry, like in Confluent, companies, actually already have this pretty mature system called ksqlDB and Kafka Streams, which is basically the key role that we discussed in our paper, is the runtime of ksqlDB. And it actually has been out there and running in production for many years, and we basically want to really summarize what we have achieved and what was our lesson learned through all this journey back to academia, so that academia can actually also have mutual idea exchanges within the stream.
So I think there are maybe two patterns. One pattern is that you groom some proof of concept idea from academia and getting it more matured into the industry. Whereas the other pattern that we are observing and we are doing now is that you have a system that you have been developing and pushing in production for some years and you actually are basically feeding back to academia, sharing your thoughts and your ideas.
But it's not like somebody responds and says, "No, you're wrong." I mean, it's not like if it was-
Well, it could be.
... the liberal arts. It could be? Okay.
It could be. I think it's actually a good thing, because like I said, there's really no golden standard way to say, "Oh, there's only one way to tackle certain problems and all the other methods are just wrong." I think in the real world, it's just like there are many ways to tackle the same problems, and for different scenarios, for different use cases, some approaches may actually be preferred over others. And it's really up to people to really be aware of all of those different ideas and different approaches. And when you are designing your system or when you are picking which [inaudible 00:14:16] systems you want to use, you can basically have the full knowledge on all of those. So I think even some debate on, "Okay, which system is superior on certain use cases, certain scenarios," I think that is totally expected and it's normal.
Got it. I interviewed Matthias Sax a few months ago, we should put a link to this in the show notes, on watermarks and why we don't use those in Kafka Streams. That watermark or no watermark debate would be that kind of informed decision, and you can draw on the academic literature, or 12, 15 years ago, vector clocks or last write wins for managing consistency in a distributed database, of course.
[crosstalk 00:15:01] golden standard ways to basically track progress, actually. As I said, it's not saying that "Oh, we think that vector clock is not the right way." To me personally, I still feel that in many cases, using a vector clock is actually still a very good way to tracking progress. I just feel that there are many other ways we should consider.
Yeah. I shouldn't have brought that up, because that's an interesting one, because from a formal perspective, such as what ... And you're not a full-time academic. You're a Ph.D. who works in the industry. You write code for a living. But if you are a full-time scholar of computer science, that's the kind of situation where you could easily convince yourself, "This is formally a stronger solution." And then you go to developers trying to build applications based on systems that use vector clocks and managing their own consistency. There are conflicts between rights and you end up with code that's buggy.
So the industrial experience and the formal evaluation can also be different, which I suppose that's a vacuous thing to say. "Sometimes industry and academia don't agree." "Yes, Tim, you're right." "Thank you for that insight. My guest today has been Guozhang Wang." We all know that, but ... So it's not so much debate. I mean, yes, there's debate, but it's more like, "Here's a contribution. Here's a thing to think about. Now you can build on it."
So listening to you describing the problem, it sounds like the Exactly-Once Semantics of four years ago are just the backbone of the solution. Can you walk us through that a little bit? So you laid out what the problem is. What is the solution that you propose?
Yeah. Totally. I think, because Kafka Streams, aka ksqlDB, is built around Kafka, and Kafka is considered as a [inaudible 00:17:13] log replicated across different machines. So our approach for tackling Exactly-Once is also heavily leveraging the fact that we have an immutable order and persistent log that we can leverage.
So the key idea here is that whenever we are actually trying to do any of the operations that is atomic, meaning that let's say, when you are assessing a record, you want to maybe do operations such as you want to commit its position on the source, you want to generate some outputs into your downstream streams, or maybe you want to update some of those states. And Exactly-Once requires that all of those operations need to be done atomically, meaning that, even if there are failures happening, you basically can guarantee that those operations either succeed altogether or none of them succeed. You don't want to have partial effects. Partial [inaudible 00:18:10], for example. If you have already sent your output downstream, but you fail to commit to your input stream, then upon recovery, you will basically process your input record again, which will generate duplicate results.
So by actually leveraging the log, we basically can actually achieve top atomicity by basically reducing all of those operations when you are processing one or more records into log appends. And by leveraging a two-phase transaction protocol, we can make sure that all of those logs append, which is basically representing as committing your input, sending and acting on your output, and also updating on your states can be done in a transactional way, and they are all atomic. That is basically the key idea that we are proposing here.
If you compare that with many different literature approaches, which is leveraging check-pointing, or for example, watermarks that are specifically delivering through as global consistent snapshots, I think one, I will say, benefits from using this log-based approach, in that you can actually consider having shorter end-to-end latencies because you do not really need to couple the problems of handling consistency from, for example, handling completeness. You need to actually [inaudible 00:19:36] waiting a bit longer to basically achieve those properties.
Okay. I've lost the question I was going ... Okay, no, the two-phase commit thing you've described. Again, that sounded like EOS that you were describing, which is Kafka's answer, and this is how we do it and it's, inside the Kafka world, a four-year-old thing, but so many people still don't understand it. Just a lot of people don't use it and it's mysterious. And I just want to separate out, at a high level, since we don't have a huge amount of time and we're not sharing a whiteboard ... And the paper is linked in the show notes and you can read it. You don't even have to pay, so just go read it. But using the log is intuitively satisfactory. We've got this great distributed log and we know that logs of immutable events are a good way to agree on progress and converge on an account of the state in a system. And all that is okay.
Two-phase commit sounds like a bad word. And again, this is not a new thing in Kafka, but it strikes me that that might be some of the criticism that comes our way with this approach. So, in the 2000s, when people were doing that kind of thing in enterprise systems and you had the Java transaction controllers and things like that, it was death. Nobody liked that. Why is this different?
I think actually the main difference is that, although I call it a two-phase commit, maybe I need to consider using a different term, but the two-phase commit is not really on the stored data on the states themselves. The two-phase commit is on the logs, indeed. And there's a reason that we use logs for two-phase commit, is that, as I said, logs are ordered. That's why, when you basically use a two-phase commit, we do not really say, "Okay, we have to wait on, for example, making sure that all of our operations before are finished." Because once you do appends, all the appends are naturally ordered by the time they get into the log.
And more details are in the paper, for sure, but by basically using those so-called transactional markers that we actually use as log appends, basically semantics is basically meaning that all the log-append that is before me on the log can be considered either as committed or ordered. And the leveraging on this natural ordering between different records in the log, the two-phase commit, we leverage on Kafka only requires to write once on the log. Whereas if you look into the standard literature for two-phase commit protocols, you basically do the write twice. One on the states itself, and another time on the log. I think that is a big difference that we are actually advocating in our paper.
Got it. And I don't think it's the wrong term ... Well, okay. It's a term with a bad name and a bad history. So if there's something else we could come up with, that would avoid the raised eyebrows that we get. But it seems faithful to me, because you still have a transaction controller that is ... you've got these parts that are doing things and they succeed, but the thing they succeed at is appending to a log. And then there's this other, "Well, I'm the leader of this thing and I'm going to make sure those appends happened, and they have." So it's still there, but I guess each individual contributor to the transaction still just logs appending. And [crosstalk 00:23:19].
Just like I said, I won't say that our proposed ideas are totally new. We are basically standing on the shoulders of many gens. We are just pushing one step further, we think, that can actually ... In the new streaming world, by leveraging on the log even heavier, we can actually make some more nudges around different [inaudible 00:23:39].
Absolutely. And that's always the case in industrial settings or academic settings. You are not doing this all by yourself. Somebody taught you things. Somebody helped you. Somebody gave you something to start with. I didn't write my own text editor. I didn't write Z shell. IntelliJ is something I bought from somebody. All these things are true, and we build on each other's efforts.
What's next? You got anything in mind that you're able to talk about that might be your next academic contribution?
Well, we actually do have a couple of interesting projects that I'm working on. Maybe in the future, we want to actually feedback into academia as well. One project we are working on is a so-called KRaft protocol. KRaft stands for Kafka Raft.
I think we have this pattern of naming everything with a K at the moment. But basically what we are doing about it is, as part of this ZooKeeper removal project, where we want to actually let Kafka itself manage the metadata, again, like a log, instead of using ZooKeeper. Because when you look at what really we are using ZooKeeper for, we are using ZooKeeper to manage the whole metadata of Kafka. Things like, which brokers act as a leader for those topic partitions? What are the ISRs for those partitions? What are the credentials, or maybe ACL configs for those topics and [inaudible 00:25:14]? Those are all metadata we store in ZooKeeper.
And we actually try to basically build this metadata as special logs. I call it the log for all logs because all the Kafka data is also in logs. But these special logs using for metadata are going to be used to maintain all the meta-information within Kafka, which we used to basically store them in ZooKeeper. And that'll actually help us to basically provide this quorum controller instead of a single controller today that is single, I will say, scalability bottleneck within Kafka.
Of course, the Raft algorithm is not even in [inaudible 00:25:52], but we actually are making it to be a little bit different from the original paper of Raft to make it really work smoothly within Kafka. I think that is something we want to also maybe feedback into the academia. Just like I said at the beginning, I think the main purpose for us, for Confluent, to write papers in an academic conference is to say that we have actually observed and learned so much from the literature from the community. I think it's really our obligation to really share our thoughts to feedback into the community as well.
So true. And yeah, KRaft is the emerging name for that, capital K, capital R, little A-F-T. And I know it as KIP-500, which is imprecise, but the KIP-500 thing or the quorum controller or these names that it goes by. KRaft or you see it in print, Kraft, always makes me think of a garishly orange macaroni and cheese. And I don't know why. That association is there somehow. And again, that's merged into the main of Kafka and it's released for non-production use as of Kafka 2.8. This is a thing that's out in the world and you want to get it into the minds of academia so they can think about what's going on here.
Well, my guest today has been Guozhang Wang. Guozhang, thanks so much for being a part of Streaming Audio.
Yeah. Thank you so much for having me.
And there you have it. Hey, you know what you get for listening to the end? Some free Confluent Cloud use the promo code 60PDCAST, that's 6, 0, P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit. And there are a limited number of codes available, so don't miss out.
Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter, @tlberglund, that's T-L-B-E-R-G-L-U-N-D, or you can leave a comment on a YouTube video or reach out on community Slack or on the community forum. There are sign up links for those things in the show notes if you'd like to sign up. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover it, especially if it's a five-star review, and we think that's a good thing. So, thanks for your support, and we'll see you next time.
Stream processing has become an important part of the big data landscape as a new programming paradigm to implement real-time data-driven applications. One of the biggest challenges for streaming systems is to provide correctness guarantees for data processing in a distributed environment. Guozhang Wang (Distributed Systems Engineer, Confluent) contributed to a leadership paper, along with other leaders in the Apache Kafka® community, on consistency and completeness in streaming processing in Apache Kafka in order to shed light on what a reimagined, modern infrastructure looks like.
In his white paper titled Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka, Guozhang covers the following topics in his paper:
For context, accurate, real-time data stream processing is more friendly to modern organizations that are composed of vertically separated engineering teams. Unlike in the past, stream processing was considered as an auxiliary system to normal batch processing oriented systems, often bearing issues around consistency and completeness. While modern streaming engines, such as ksqlDB and Kafka Streams are designed to be authoritative, as the source of truth, and are no longer treated as an approximation, by providing strong correctness guarantees. There are two major umbrellas of the correctness of guarantees:
Guozhang also answers the question of why he wrote this academic paper, as he believes in the importance of knowledge sharing across the community and bringing industry experience back to academia (the paper is also published in SIGMOD 2021, one of the most important conference proceedings in the data management research area). This will help foster the next generation of industry innovation and push one step forward in the data streaming and data management industry. In Guozhang's own words, "Academic papers provide you this proof of concept design, which gets groomed into a big system."
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us