Levani Kokhreidze is a principal engineer at Wise. Wise is a company that facilitates international currency transfers for ordinary people, like for travel use cases and things like that. They're big Kafka users and big stream processing users using a lot of Kafka Streams and some Flink. And Levani's going to tell us about how they work through the process of deciding whether to use one system or the other.
Before we get to that, let me remind you that Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io. It's a website with everything you need to get started learning Kafka and Confluent Cloud. We've got interactive courses, video courses. There are executable tutorials, a library of stream processing patterns, index of episodes of this podcast. Everything you need. When you do the interactive stuff, you get your hands on the keyboard, as you need to do. You'll want to sign up for Confluent Cloud using the PODCAST100 code to get you an extra free $100 of usage credit. It's pretty cool. Check it out. But for now, let's get to today's conversation.
Hello, and welcome to another episode of streaming audio. I am your host, Tim Berglund, and you may notice I have a slightly different background. I'm in the middle of a month-long trip across Europe as a part of a series of events that we call our Data in Motion Tour. By the time this airs, that will be in the past, but I'm coming to you from a hotel room in Frankfurt, not from my office as usual, but I'm delighted to speak today to Levani Kokhreidze. Levani, welcome to the show.
Thanks Tim. Thanks for having me.
You bet. And since I'm here, we get to speak at a nice early time. Normally, I'd be asleep now, but we have more of your day available to you. So we're going to talk a little bit about Kafka streams and Flink today. That's really our subject, but I would love to start by hearing more about you and Wise, the company you work for. Tell us about Wise, tell us about your role, how you came to be in that role. [crosstalk 00:02:13].
Yeah, sure. Yeah. My name is Levani. I work at Wise for about four years now, and I've joined the company in the beginning as a fraud... Into the fraud team. Fraud prevention team, not fraud.
Fraud prevention? There's always a distinct the fraud team. There's the fraud.
[crosstalk 00:02:33] an old Dilbert joke about that. You must be thinking of the flu prevention darts. Anyway, yeah. Fraud prevention?
Yeah, exactly. But then I moved to the platform and into the platform team, I was with my team. We were building the stream processing platform internally twice to help to scale our usage of our Kafka Streams and our real-time aggregational queries. So Wise itself is a multicurrency account. So it's one of the biggest startups in Europe. Just last quarter, we've moved around 18 billion pounds and we have around 11 million customers. So this gives a bit of perspective of the scale that we have and how important it is for us to be competitive in this business. We have to be as fast as possible to transfer customers money from one part of the world into the another part of the world and in this-
[crosstalk 00:03:33] the business is money transfer [crosstalk 00:03:36] particularly between currencies?
Yeah, and there is also plastic cards, so you have a Wise debit card, so you can use it in different currencies. So you can spend your, for example, you can have your money in USD. And when you go to the US, you can spend the money in a local currency. When you go to Europe, you can spend it in a euro... You can have your Euro account and spend money in euros. So we also have a debit card.
I feel like it would've been good for me if we would've had this conversation a month ago before I started my trip across half a dozen countries.
Probably. And yeah, Kafka plays a very, very critical part in this whole money transfer and nothing happens at Wise without having Kafka event.
Oh, very nice. I like the sound of that. How did you come into this role? So what sort of background do you have?
Yeah, when I joined the fraud team, the fraud prevention team back then there were scaling issues in terms of real-time queries and aggregations. As you can imagine, since we're a money transfer company, we also have to do all the compliance checks, make sure that the bad people don't use our platform for bad things. And fraud prevention is also one of the crucial parts of it. And back then, there were challenges of... Wise already was using the Kafka as an event bus. But we were still a bit more into the monolithic architecture. We already had microservices four years ago, but still, we had a big chunk of our business logic was living in a monolithic application. But since then we have gradually moved into the microservices and now we have around 400 microservices working together.
Are those all events... Would you describe them as reactive or event-driven or whatever sort of a word you'd want to use?
Yeah, I'd say most of them are event-driven. So pretty much most of them emit some sort of events. We do have RPC calls through invoice and through our service mesh. So that's also a very big part of this, but pretty much any background thing that doesn't have to be... That users don't have to see on the webpage, for example, or when they use the app, it's happening through events.
Nice. So stream processing. You're Flink users, and you're also Kafka Streams users. So kind of walk us through the evolution of that. How did you come into those technologies? And obviously, what we want to talk about eventually is, how do you decide when to use one or the other? Because that's the thing that people think about.
So we are mostly Kafka Streams users. So we've been Kafka Streams users for around three years now, a bit more than three years. So first we were ordinary Kafka brokers... Kafka client users. And we were using Kafka as event bus, but then Kafka Streams came, and when we started to... Then there was a need for a stream processing platform. And we decided to go with the Kafka Streams just because it was way easier to adopt than the Flink because Flink is pretty much the whole clustering framework. While in the Kafka Streams, you just have a... It has a library and you can build necessary tooling around it. So it was very appealing for us. And yeah, we are mostly right now Kafka Streams users. We have around 300 stream processing jobs running based on the Kafka Streams. And we process a lot of data with this. And we store a lot of states as well.
So in total, we have around maybe 50 terabytes worth of state based on the Kafka Streams. So I think we are one of the biggest users of Kafka Streams that they're out there. But yeah, as there are more use cases where stream processing jobs have to store a lot of state. And in those cases, we do see a benefit of Flink mainly because one of the great advantages of the Kafka Streams is all to store the changelogs into the Kafka as a backup, right? But it's a bit of a weakness in a sense that we have to scale the Kafka brokers proportionally to how many stream processing jobs we have in order to scale the number of changelog topics and repartition topics we do have. So in the cases where there are some applications that have to store terabytes worth of state the stream processing complications... So in those cases, just to ease the pressure on our Kafka cluster, we want to go with Flink in those cases.
And also Flink has a few nice features that at the moment are missing from the Kafka Streams. The safe point, I guess that's one of the biggest ones. So in the Kafka Streams right now, there is no way to roll back your state in a consistent manner to some point back home, like picking time, essentially. So if you produce the bug, you release a new version of your stream processing job, and there is a bug you have to... If the state gets corrupted, you have to pretty much wipe out your state and start from the beginning. In Flink, there is a notion of the safe points and in the safe points, we can, for example, roll back your state to yesterday in a consistent manner. That's big. That's, I think, a very big advantage in the cases where we have to store a lot of states and just replaying everything takes a lot of time, but we do prefer overall Kafka Streams just because it's way easier to maintain and it's way easier to kind of integrate it in the overall stack that we have at Wise.
That's always encouraging to hear because that's the thing everybody says, right? When you're thinking, which one of these should I use? Well, Kafka Streams is a library and you guys were already writing Java microservices. I assume Java, right?
But certainly a JVM language, then perhaps be Java. And well, gee whiz, here's a library that gives you stream processing capabilities. So by default, that's the easier choice. You mentioned changelog topics and repartition topics. Now, I'm assuming everybody is listening audience that you basically know what Kafka Streams is. I will try to link the show notes to an introductory podcast or two. We have had even recently, Bill Bejeck, has been on the show, giving the overview of his Kafka Streams course that he wrote for Confluent Developers. So if you need a refresher, we've got a course on Conflict Developer and we've got a recent podcast episode that kind of will take you through the basics, but what's a repartition topic? Assuming you know, audience, what Kafka Streams is.
So the repartition topic is essential when you want to change the key. So for example, let's say we have a stream of transfers coming in into our stream processing job, and one of the payloads of the... One of the fields in the payload of the stream of transfers, we have a profile, for example, and we have a number of transfers this profile has made. Now, if we want to aggregate the number of, let's say if we want to sum all the invoice values that the profile has made in the given time, in that sometimes either we have to repartition the stream based on the profile, that is so that the data will be co-located into a single in the bound... In a specific thread, essentially. This thread may be on another process and another instance or another JVM instance or whatever. But the point is...
We call it another JVM.
Yeah. The point is to co-locate the data. So if you want to aggregate based on a certain key, you have to co-locate the data and repartition against the operation to select based on what you want to do, essentially stateful aggregations.
Basically, the way that has to happen is you have to create a new topic that has a new key.
That reflects the grouping that you're going to want to do for the aggregation that you're computing next. And that's, I guess generally, you co-locate because you're going to want to group and because you're going to want to run an aggregation reducing function across the group. Is that generally the case?
Yes. Yeah. Essentially that's the case. Yes. In the Kafka Streams, it also does the... In some cases, it does it very smartly. So it has optimizations in terms of topologies. So it may group different... So for example, let's say you want to do different aggregations, so you have different types of aggregations based on the same key. In the earlier versions of the Kafka Streams, it would create as many repartition topics as much as many aggregations you have, but in the newer versions of the Kafka Streams, we have now the topology optimization. So that will group for different aggregation, repartition topics based... It will create single repartition topics. And that repartition topic will be used for different applications if it's on the same key. So that's a very nice optimization.
Oh, yeah. No, that is because... It's not... Repartitioning is not the worst thing in the world. I mean, it's a thing you do in Kafka. Either you let Kafka Streams do it for you or if you're writing your own consumers, I mean, it's a thing. So it's not like some kind of failure in life, but it does have a cost. You have to move that data around and you have to store it again and those things actually happen. So it's nice to do it as little as possible, I suppose.
Okay. And that repartition topic is a thing that gets created upon the re-keying of an input stream and changelog topics, which are the topics that back tables, right? So if you have a KTable, you have to remember what's in the KTable and that's a changelog topic. And these things are stored. Anytime you've got a state or you have to repartition, it's like extra Kafka that you're using. Now, this is... And by the way, please feel free to disagree with me on this. I'm kind of expressing my opinion of the architectural decisions that Kafka Streams has made. Kafka in every way possible, Kafka is built out of Kafka, right? Anytime you need to do a new thing, if there are the existing pieces lying around, you take them and try to use them. Well, in Connect, when you have to remember input, source offsets, or something like that, well, you use a compacted topic. It uses an internal topic for that.
So if you need storage, well, you have... If you need distributed, fault-tolerant replicated storage, you've got some lying around. So you use Kafka. You'd make a topic. Again in Kafka Connect, when you have to rebalance workers between Connect or tasks between Connect workers, it's something a lot like the consumer group rebalance protocol, basically, that repurposed. So you got Kafka pieces lying around. You make more Kafka out of that. And this is that. So now Kafka Streams need various places is to store kinds of internal state and well, there's Kafka, so make more topics. And so the decision, the key... In my mind, the key architectural opinion of Kafka Streams is it's a library. So it enhances your code with stream processing functionality, but stuff has to come along for the ride. And that stuff is extra computed. You're doing computation now in your application that you weren't and extra storage. And with the combination of those two, you have extra IO between your application and the cluster.
And I guess what you're saying is, if I heard you correctly in the first real big question I asked, most of the time that works out in favor of Kafka Streams, it's like usually, that's the thing that you want, and writing a job and deploying it to some other cluster where your compute happens isn't usually what you want. [crosstalk 00:16:34] but then sometimes it works. So kind of tell us the story of how you got there. What did you bump into that made you think, wow, I really need a separate cluster for this?
There was a certain use case where one of our product teams... So just to explain a bit more about our platform, I think for the context. So we are kind of... We don't write the streaming jobs for others. We are enablers so product teams can write the stream processing jobs for themselves. So we don't take on business logic, but rather we try to bring the libraries to the product teams at Wise. And technology and the tooling and self-services. So the product engineers can do their jobs themselves. So we are very autonomous. So in a way that we don't dictate. So we don't dictate what you have to use, but rather we try to build it in a way that, okay, this is already built and I don't want to do anything else. So just can fit our philosophy. So the autonomous teams is very at the core of Wise engineering and that's what kind of enables us to scale.
And based on that, since we have that many clients, we don't really have either capacity or we don't really want to know exactly what each stream processing job does, but rather we want to trust product engineers that they will, and they're doing their jobs. They're doing their jobs based on whatever governance we will provide and whatever tooling we will provide. And one of the use cases that we had a while back was that there was a stream processing job that had to aggregate a lot of state. And by a lot, I mean a few terabytes. So it was a pretty much consuming the whole history of the TransferWise transactions. Sorry. And when I say TransferWise it's Wise, because we just rebranded. And so I may sometimes mix things up. So I'm sorry about that.
I understand. I happen to be wearing an on-brand shirt today. Sometimes I might have something on with the old logo. It just slips through. I totally get it. Anyway, go on. Wise.
I'm assuming the whole transaction history.
Yeah. And because so, we have to fill financial ledger or based on those things and all the complex things, which I don't know how it works, but I know that it's a complex thing, but anyway, they-
There's a product team that knows that?
Yeah, there's a product team that knows that, but they have to consume the whole history and they have to join with other input streams and they have to join with other input streams. So overall this whole joint, again, then later application translates into a very, very big state. Usually we have this with the Kafka Streams and it's not a problem, really. So we store that much state, but in this particular case, there was... The thing was that they wanted to fix their state from time to time. So what this means is that in finance, it's very common to add the correction into your calculations, later.
So when you're in the stream processing, usually when you have a window, this window has some kind of time up until that window closes. When this window closes, you can't really do much about it. So if you sent a new event and it doesn't fall into this window, it discards the event, but in the finance, they would love to... And what they wanted to do is to do the correction, let's say, a few months back, sometimes even three, two quarters back in time. That's obviously is a problem because then you would have to create... Have a window that large so it can accept the correction events. But having a window that large translates into a very, very big state.
But alternatively, what one can do with, for example, with Flink, is that when you do want to the correction, you roll back your state to last quarter, for example. You don't have to replay everything in that case, but you can replay only a partial stream of events that was for this last quarter. And then you will get your kind of correction events and your aggregation state back to the real-time stream. So this is one of the biggest use cases that we saw where Flink kind of had an advantage over Kafka Streams.
Gotcha. So in short, you might say a very large job where suddenly the cost of deploying a cluster and the relative friction of deploying a stream processing job to that cluster, and not just a part of your regular application and everything, that's now worth it because of the large amount of state.
That's associated with the job, but an ordinary microservice wouldn't feel, right?
No. No, definitely not. Because with an ordinary microservice, there are a lot of benefits that we can get from the Kafka Streams, like Interactive Queries. And we can build the... In Flink the results so the notion of queries, but it's way more complex the reason about it, that then you would with the Kafka Streams because it needs whole cluster framework, but with the Kafka Streams, it's way easier for us to integrate it within our service mesh essentially and let the teams also use this way of querying the state.
How is... Just a question for you about deployment automation. What's your normal way of deploying a new microservice? And what's it like to change a stream processing job in Flink? Compare those two.
Yeah, so we run pretty much everything on Kubernetes. We also run our Kafka clusters on Kubernetes. Recently published a blog post on how we immigrated from EC2 into the Kubernetes. So mostly, we are running our microservices also stream processing jobs on Kubernetes. And the provisioning of new Kubernetes deployment is very easy. You just GitHub pull request, it gets our merged into the main branch, and then creates a whole deployment for you. It's the same for stream processing, but there is, of course, but since stream processing is a stateful application, we have some other automation around this. For example, as I said before, we are kind of a library providers and tooling providers. So we kind of everything that product engineers are doing, they are doing through our libraries. And one of the parts of it is also, we provide tooling to build the Docker images for specifically for stream processing, and based when they build those Docker images for product engineers is completely transparent.
But when they release those Docker images into the production, they also have their different set of capabilities that they can do from this. We have released UI, they can go and really stop in production. So there they have an additional function like reset. And the reset is a very nice feature because they can reset their stream re-consume the events from the original topics. So they can reset their state and start re-consuming their stream if they want to deploy new version of their application, for example, and-
That's general Kafka Streams applications and Flink, or just one?
Right now it's mostly Kafka Streams. We are developing self-service for Flink as well, to do the same thing. And the one-
[crosstalk 00:24:35] any thoughts about...Oh, go ahead.
Yeah. One important note why it's possible is because we store all the original event streams into compacted topics. So we have pretty much the whole history of Wise of all the core domains stored in the compacted topics and those stream processing jobs use those topics to rebuild their state when they need. So that means that whenever somebody deploys a new version of the stream processing job, they can just set the... Yeah, consume this topic from this point in time, like I want one year's worth of data, and they will reset the... Inter reset self-service. They will just say, yeah, this is the beginning. I want to set my topic of set store and then stream processing job will start consuming from that point in time.
Nice. Now I know you're in the money transfer business across countries and currencies, and that's probably slightly regulated. One or two regulations, right? And that can make the cloud complex. Any thoughts about managed services in the future? Is that a thing that you can think about or not?
Right now at the moment, I think not because of the way we implemented security around our Kafka infrastructure because we have quite a unique security setup in a way that... And we are kind of proud of it as well because what we do is that we have integrated Kafka authentication with a service mesh, meaning that when the service authenticates with Kafka through the material TLS, we will get the same identity as the service would be authenticated with other services through RPC call, for example. As a result, so we have dynamic, certificate rotations, and all this stuff. And it's all very tightly integrated within our service mesh. So that comes with the cost of having a bit unique setup. It has its benefit. But on the other hand, it also comes with the cost that it's harder to migrate to something else if we ever decide. But so far it's been working very well.
If you had one word of advice that you could give to somebody who was looking at the same set of problems, a few hundred of microservices, maybe some large stateful jobs, what would that advice be if you had it to do over again, and you wanted that to be smoother? What would you say?
I think this, I guess, applies more to the team, to companies that work with autonomous teams. The mistake that we made, in the beginning, is that not to delegate ownership of the production of the source events to the product teams because we tried to be the team that sources all statements and then replicate it into our streaming Kafka cluster. But obviously, that comes with big problems because we then now have to track all the changes that are happening in the source domain, is which is of course not feasible. We have changed this approach a few years ago, where we started to... Not a few years ago, sorry. A year and a half ago, something like that, where we now delegate the ownership of the... When some microservice comes up, for example, in this core domain, we are launching new investment product, for example, now.
And one of the things that they have to do as well is to produce their stream of events into the streaming Kafka cluster with all the history so that others can build the necessary stateful processing on top of it. So now we kind of offload it into the product team so they can move at their own pace. So I think this figuring out that part of how to deal with the source data and building the proper foundation for this is very crucial because it enabled us to build so many stream processing jobs just by having all the events into the Kafka cluster and rebuilding all the stream processing jobs based on that was a huge win. But of course, having a proper obtation of actually sourcing those events into the cluster with the public representation of the domain is very important. And this is almost a foundational work.
It sounds data meshy.
We were doing the data mesh before we knew it was a data.
My guest today's been Levani Kokhreidze. Levani, thanks for being a part of Streaming Audio.
Thank you. Goodbye
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code, PODCAST100 to get an extra a hundred dollars of free Confluent Cloud usage.
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at TL Berglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community Slack or forum. Both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel, and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support, and we'll see you next time.
What’s it like building a stream processing platform with around 300 stateful stream processing applications based on Kafka Streams? Levani Kokhreidze (Principal Engineer, Wise) shares his experience building such a platform that the business depends on for multi-currency movements across the globe. He explains how his team uses Kafka Streams for real-time money transfers at Wise, a fintech organization that facilitates international currency transfers for 11 million customers.
Getting to this point and expanding the stream processing platform is not, however, without its challenges. One of the major challenges at Wise is to aggregate, join, and process real-time event streams to transfer currency instantly. To accomplish this, the Wise relies on Apache Kafka® as an event broker, as well as Kafka Streams, the accompanying Java stream processing library. Kafka Streams lets you build event-driven microservices for processing streams, which can then be deployed alongside the Kafka cluster of your choice. Wise also uses the Interactive Queries feature in Kafka streams, to query internal application state at runtime.
The Wise stream processing platform has gradually moved them away from a monolithic architecture to an event-driven microservices model with around 400 total microservices working together. This has given Wise the ability to independently shape and scale each service to better serve evolving business needs. Their stream processing platform includes a domain-specific language (DSL) that provides libraries and tooling, such as Docker images for building your own stream processing applications with governance. With this approach, Wise is able to store 50 TB of stateful data based on Kafka Streams running in Kubernetes.
Levani shares his own experiences in this journey with you and provides you with guidance that may help you follow in Wise’s footsteps. He covers how to properly delegate ownership and responsibilities for sourcing events from existing data stores, and outlines some of the pitfalls they encountered along the way. To cap it all off, Levani also shares some important lessons in organization and technology, with some best practices to keep in mind.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us