What's the scariest word in the whole of computing? It might be caching. It might be NullPointerException. Is that a word? That's one word. I'm going to say that's one word. NullPointerException is a very scary word. But I think possibly one of the scariest words is migration, that process of moving the whole company from one system to another without falling off the tightrope and dying.
In this episode of the Streaming Audio Podcast, I'm joined by Dima Kalashnikov, who has been very carefully and very successfully migrating the online grocery store Picnic from AWS Kinesis to Confluent Cloud. We talk about the hows and whys of that transformation and what's next for the future of their business.
But before we start, let me tell you that the streaming audio podcast is brought to you by Confluent Developer, which is our site that teaches you everything we know about Apache Kafka, from how to start it running and write your first app all the way up to the realities of architecting production systems at scale. Check it out at developer.confluent.io. And if you want to take one of our hands-on courses you'll find there, you can easily get Kafka running using Confluent Cloud. Sign up with the code PODCAST100 and we'll give you $100 of extra free credit to get you started. And with that, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it.
My guest today is Dima Kalashnikov, who's the technical lead at Picnic, a company over in Amsterdam that I'm hoping to visit soon actually. Dima, welcome to the show.
Yes. Hello, everyone. Very pleased to be here and talk today with you about our migration from AWS Kinesis to Confluent Cloud.
That must have been a long journey. Before we get started, tell everyone what Picnic does. What's the actual real-world problem you're trying to solve out there?
Yes. Sure. So Picnic is an online grocery supermarket and we are striving for best-in-class delivery to all families in the Netherland, France, Germany and we are very keen on sustainability. All our vehicles are electrical. We strive for the greenest process possible and at the same time, we are ensuring that our customers have the lowest price guarantee and can pick up groceries delivered to their homes at a time really suitable for them. This is what we're doing.
Right. And you're entirely online. There are no physical shops. It's all delivery.
No. That's completely online. That's it.
Right. So very tech-heavy company-
Yes. Yeah.
... I imagine. So how long you been working there, by the way?
I've been working here for around one year and eight months or something like this.
Okay. Okay. That's that's fairly long in internet years.
Yes indeed.
So you came in presumably... Actually, before we get into that, tell me where event streaming fits into your particular business.
Yeah. So in Picnic, we are trying to be a very, very data-driven company and make all our decisions based on what is going on in the real world and how our customers are using our applications, how our processes are running in data warehouses during distribution computation, during picking in the warehouse itself. So we are collecting events on everything that is happening within the company.
For instance, when new stocks arrive at the warehouse we automatically receive events on it and then our analyst can analyze it. We also analyze what people see in the applications, how they click and interact with it for the purpose to improve usability of the application, to see if something went wrong and we can improve it. We also do similar things with applications for our delivery people, who are running on electrical vehicles across the country, to not only help them, but for instance to also drive safer driving practices. It's called like that, so...
Oh, okay.
Yeah. We are very much for safety, want to avoid any incidents on the road or anything bad happens, so we promote safe techniques and track it.
So it's... I really find it interesting when a system crosses all those different domains from logistics to UX to driving safety. So that's another one to me. So you came into this company and they were already an event streaming data-first company, I assume? And you're using Kinesis, right? AWS Kinesis.
Yes indeed. At the first place we were using AWS Kinesis and this system was set up long ago before I arrived in the company. Maybe almost from the beginning of it.
Okay. And so how... So I know it changed. So what was your experience of that in the early days and what made you think something has to change here?
Yeah. Sure. So first when I arrived, it seems to be a bit fine in most of the places, but when we started to scale up and Picnic is growing rapidly, and there is even more and more demand on what should we track, how we should analyze, how we want to see analytics platform within the company. We started to face lots of issues with Kinesis and limitations, unfortunately.
So one of the major ones is that Kinesis allows us to keep data for one week within the stream. We seem to be fine in most of the cases, but in our situations long term, first of all, we want to do in-stream data analysis and sometimes you need lots of data from the past to draw a decision on what is going on now or in future. And this is the first one of limitation and also it doesn't help with safety net. So sometimes we see a few notorious incidents within our path on which sometimes happens. And if you could keep data for more than one week means you'd avoid lots of, lots of manual labor to recover, because we strive for zero data loss in all situations. So not a single event can be lost. Of course, safety nets are really crucial for us.
Yeah. That seems fair to me.
One more thing which was problematic is that there is not much ecosystem unfortunately around Kinesis, because when we want to do something, we almost always have to develop it ourselves. And there are not lots of open source technologies around Kinesis outside from what AWS provides each-
Oh, really? Okay.
In my opinion, it's really not that thriving ecosystem as like, to say with Kafka, for example, so...
What kind of tools are we talking about? What kind of things did you need to build yourself?
Yeah. First of all, we really wanted something like Apache Kafka Connect, but there are already things in place to grab data from one place and move it to stream and then to our side because we don't want to develop many connectors ourselves. Our team is really small at the moment and we cannot afford a huge development burden long term to maintain lots of components ourselves. This is one thing. Now I think that we would really love to see on top of Kinesis is, for example, things like ksqlDB, because there is definitely more drive for in-stream data analysis in foreseeable future in Picnic and we don't want really to develop something like ksqlDB on our own. It would require, I guess, too much effort from even our company.
As much fun as it would be to write SQL from scratch as a hobby project, you probably don't want to do it at work to take it into production. Out of interest, which language or languages are you mostly writing this stuff in at Picnic?
In Picnic Java is the primary language and nowadays Python is also a first-class citizen language. Yes. So this is the primary language for development on the backend.
Okay. So you would definitely have access to Kafka Streams, but there's still this... I find it quite nice in KSQL you've got this kind of, it's like Kafka Streams you can deploy in seconds and play with really quickly and easily.
Yeah. And the great thing about KSQL is that since it's just a plain SQL, you can give access to analysts and all our analysts know how to write SQL. This is what we are striving for. So business analysts should know how to do it. And in that case, they can also play themself with Data Stream. That's why it's really important for us to have SQL-like tooling.
Yeah. Yeah. In my career I've not always met business analysts that know SQL, but when they do, it's just like, "Oh, this is going to be great."
Yes. That's right.
So, okay. So add all that up, but it's still a big job to say, "We need to migrate to a different event streaming platform." Take me through that point when you said, "We got to bite the bullet."
Yeah. Sure. So that's, of course, not the only thing which drove us to migration. One very important thing for us which we wanted to have is better monitoring of SLA within our platform, since I have already said we strive for zero data loss, but we also strive that all events are delivered within 15 minutes to a data warehouse, which is not very complicated SLA sometimes, but still we need to track it.
And we found that monitoring KPI at AWS Kinesis is not that suitable if you don't use Cloud Watch, but Cloud Watch is not truly adapted within Picnic because we're using Prometheus, for example, for our monitoring, which is getting a bit complicated. And it's really hard to see where your consumers in the streams because for these positions, you need to go to DynamoDB and somehow figure out how to decode all these consumer positions in DynamoDB, which is not very convenient, and with Kafka, you just, it's like a snap. You get your consumer positions instantly and can monitor them. It's very easy to do and very suitable.
And besides, Confluent Cloud provides great monitoring KPI, which nowadays automatically provides you an endpoint to export metrics to Prometheus. So we cannot be more happy to do it. So this is definitely one thing on top of it. Now I think that with Kinesis in particular, we found it a bit hard to scale it horizontally.
First of all, rules on shards scaling is a bit obscure. Let me be honest. In common practice it's not sometimes match what AWS support told us, which is quite not comfortable sometimes. And we really wanted to have seamless horizontal scaling when we can just allocate consumers, producers, scale horizontally, distribute yourself across all shards, partitions or whatever you have in your hands and don't let me bother about manual assignment or manual things do. And with Kinesis we would need to develop something like, okay, you are up, we need to redistribute shards across different consumers, which is not so easy to do.
Oh. Because I don't know a huge amount of about AWS Kinesis. It doesn't have that kind of automatic rebalancing.
At least when we were working with it we haven't managed to achieve that, especially together with Java Spring Framework. So since we are using Java and Spring Framework, we would really love that Spring, for instance, can do it for us but we haven't managed to achieve, so that's why it also become sort of problem and vertical scaling can all have its own limits.
Okay. Yeah. Okay. So... It's funny, isn't it? How often it's not the pure technology itself that makes the difference. It's the integration with other things and the day-to-day maintenance of it that's more important than the core.
Yeah. That's totally right. That's why I also have already mentioned the ecosystem, but you should... Technologies around this core should give you also lots of value, provide easy integration, be always up-to-date and do what you want them to do. And this wasn't always the case with AWS Kinesis.
Okay. So putting on your project management hat, okay. You've decided to move to a different... You've decided to move to Confluent Cloud, right? How do you actually tackle that? Because that's a project that in the best of intentions can still go wrong. How did you make a success of that?
Oh, well, that's a good story to tell. So we start with where should we go from AWS Kinesis? Because that's... You had one technology but how do you choose where to go? Seems like there are lots of platforms which seem to do the same and in Picnic we have, what-so-called, technical radars. So we define all technologies we want to use or there are on the market and seem feasible for Picnic. On this special tech radar where we track them, we register and then we start the process of, okay, we need to do assessment of this technology, like just reading the documents in community, what is going on then we do trial and only then, we adapt the technology.
And this is a collective decision of people within company who can participate in discussions around the technology itself, how suitable it is and all things around it. So when we originally pick up also Pulsar, for instance, and we looked into Pulsar and Kafka and after some elaboration over technologies, Kafka is definitely a winner us because of the thriving, flourishing community and much more mature ecosystem to our opinion. So, okay. Selection process is done, but on the next stage, in Picnic we really don't want to run many things ourselves if there is a cloud option.
Yeah.
Because-
Because you've got a small team, right?
Yeah. Well, nowadays we are just, including product owner, me and developers, it's four people, but back in the time so it was-
Four people?
Yeah. Four. But back in the times when we started the migration we were, just the three of us. So it's not really much.
Gee. I'm surprised. I mean, if you're covering... You said you're covering the Netherlands, France and Germany-
Yeah.
... with a heavily data-driven business and there are four of you.
Yeah. But it's-
Yeah. Okay. I can see why you're not managing too much in-house.
So, yeah. And to not managing in house, we need a cloud provider and that's the next stage. With who should we go? If we decided to go with Kafka and we elaborated again across different providers, like MSK, Confluent Cloud and a few others, we decided to go with Confluent Cloud for, first of all, because Confluent Cloud is really feature-rich. We have managed connectors. We don't need to keep it in mind. We have zero burden maintenance, like it just running. I don't need to upgrade anything, can just deploy my things, don't care about it. There is an API for which I can automate and, of course, monitoring API, which is really suitable for us. So this was an almost completely easy decision for us to go for Confluent Cloud in this situation.
Okay.
So, and then we decided to go with Confluent Cloud. So, okay. We then need to run probably a few PoC with few technologies we have in hands to see that things will run smoothly and you don't want to find something bad in production exploding and giving us sleepless nights. So...
Yeah. Start with the pilot, right?
Yes. We decided to do quite an extensive pilot running production-like things in it, so we can assess costs. We can assess how our technology integrates because we also use, for instance, Snowplow technology for tracking, which is a sort of at the same time framework and technology, to do some tracking on the customer application side and our delivery application. And it took us a bit to play with Kafka Connect, connect Snowplow to all things, run production-like workload to see how performance are, is where anything bad with costs or something not going right. And once, of course, we are done with PoC it's not the end of the story. Now we need to move away to the next level to Confluent Cloud.
Yeah. So before we get off that, I just... So the proof of concept, was that an existing use case that you sort of ran in parallel or was that a new project you took on and tried?
We run existing case in parallel to compare all the things. And this is really important also for production environment because we were running for quite a while two systems in parallel. So we had Kinesis in parallel to Confluent Cloud while we were doing migration to Confluent Cloud. And I think this is a good approach for such large scale migrations because first of all, you don't need to rush with migration. Yes. You may be spending more, but the most important, and what I always think about our analytics platform, we are as good as we are stable and as good as we have zero data loss, because this is the most crucial. If data are bad, who will use it? It's not suitable for anyone.
That's why we decided to run two systems in parallel and set up great acceptance criteria on that yeah, there should be no data loss when we compare two systems. There should be no performance degradation. And of course, when you are migrating, you're still quite new to the technology and of course, you, yeah, might misconfigure something or might do something not so good and set up too few Kafka partitions. That's why I cannot scale to the right level.
And that's why we started to do health checks and sanity checks on the database level to compare to streams, see okay, we are not missing anything. Confluent Cloud setup is now working better than Kinesis in terms of performance. We avoided much more data loss, so we are good. We can move. And we did that separately for customer app-side, because this is quite a separate stream from what is happening on internal systems like warehouse and delivery, picking, logistics, whatevers.
Right. Yeah.
And we started to move things one by one and yeah. When finally it's over, of course, you also see just for a while, when even you feel like, okay. We are good. Nothing is failing. Just give it a bit more time to make sure that you haven't missed anything indeed. It was running stably yesterday and today and tomorrow when, yes, for sure, you are good to migrate. Because sometimes I saw things happened in our pipeline, but yeah. Before we are good, but when, oop, it snapped and in two days something went badly, so you need to restart it.
So you really took your time de-risking this?
Yes. We really took lots of our times de-risking and from pilot step to the actual final cut on migration, it took us half a year to make things stable, to ensure everything runs smoothly and everybody is happy. And this is not involvement only on our side, the analytics platform. It also took great effort from the data warehouse team because they actually need to use this data. But the format is changing a bit. We need to set up a new set of tables. We need to propagate these metrics. So this is a huge involvement from their side. We also needed support, a lot of it, from our infrastructure team, which helped us set up some common things in Picnic infrastructure, decommission old Kinesis setup. So, yeah. This is lots of moving parts and lots of collaboration with our teams to make things happen. It's not like a single team process.
Right? Yeah. How did they feel about it by the time you'd switched over?
I felt so great because finally we got things over. You don't need to worry about two systems. We don't need to pay twice for things which is, of course, great.
Yeah. Yeah. It's worth paying for a while to take the risk out of it, but you don't want to do that forever.
Yes indeed.
Okay. So when was that? You say it took about six months to do. How long ago did you reach that point?
Well...
How long have you been just... Confluent Cloud?
That's a good point. So for our warehousing systems and all internal systems, what we call, so everything that is not customer-facing systems, we reached this point in January. So it's really recently, we did the final cut.
Okay. So that's about a month and a half ago, time of recording.
Yeah. I guess so. Something a little just like a month. For our customer-facing part, we are running all data through Kafka as of the last week. This was the final cut, I guess.
Oh, crikey.
It was the stream from the customer side is in order of magnitude bigger than from internal system. So it's 45 millions of events per day. And for internal systems it's almost five or four millions of events per day, which is-
Is that because you're capturing a lot more fine-grained user interaction stuff or-
Yes. And of course, the Picnic is growing. There are more customers, so they produce more events-
Good.
... long term. Yeah. This is great. But it also means that we need to take much more care about a setup and make it stable, sure, that analysts are happy and everything runs smoothly. So, yeah. At the moment we only left a few pieces to demolish from our old setup, but that's the last part to do.
So what else, apart from demolishing other things, has this opened up new ideas, new plans for going forward or...?
Yeah. It opened up... Confluent Cloud opened up for us many new paths to go this year. At the moment we are running a new pilot on Confluent Cloud only for our automated fulfillment center. So in Picnic now we have just fulfillment center, which are sort of big distribution centers and smaller local hubs across the country, from which we actually do deliveries and fulfillment center, a normal one, is just lots of deliveries, quite a lot of manual work from our amazing Picnic team. And we wanted to automate it to a greater extent, so... And we build from scratch a fully automated warehousing system, where there are lots of robots moving toats with products and to do the deliveries. And this is amazing.
I can say the most brightful project I saw in my life, how it works and how it was built, so... But we also need to deliver data from it to our system. And you can imagine that robotic systems produce lots of events constantly. Conveyors are moving, someone is picking, something is going on, temperature here, different zones, some things like that.
Something's going wrong there and... Yeah. Yeah.
Yeah. And that's why we are building, for instance, a system to capture events from a transport system in this automated fulfillment center and we are doing it with Debezium, with Confluent Cloud only. And at the moment we are on the verge of final cutover for the trial process, but we are okay. Everything looks good. We can move forward and start to think about production setup. So this, of course, one of the opportunities that we got with Confluent Cloud now, I think, is that we really want to try ksqlDB long-term as I already mentioned. There are a few use cases already in the company. I can imagine, for instance, we can use it for to detect malicious bereally want to try ksqlDB lhavior in events for security reasons because if someone is-
Actually what kind of malicious behavior would you expect?
Oh, for instance, if hackers take control of someone's phone or someone's application, you need to detect this to protect your customers. And of course, it means that something will highlight is in the data, like something, someone changing too frequently payment details or delivery address or things like that. So of course, you can spot it and notify the respective team to, okay, something's going on, let's take care of it. This is, of course, an early ideation phase, but it wouldn't be even possible to ideate previously for us without Confluent Cloud.
It's one of those use cases that's actually figuring out what the logic is. It isn't that hard? But implementing it can be easy or a real pain depending on the technology. Right?
Yeah. That's totally right. Because ideation is always possible to do, but if you cannot implement that thing, then there is no point in ideation.
That's the trick. Yeah. I must admit, I thought when you first said malicious events, you were talking about the robots going rogue.
Well, hopefully this will never be the case, but...
Yeah. I'm reading too much sci-fi, clearly.
We can still detect something going wrong for instance, in an automate fulfillment center by similar analysis. So this idea and approach can be implemented in many places.
Mm-hmm (affirmative). So does this mean that you're... You said your customers generate more events than your internal systems, but presumably the robots now are going to catch up and you'll have two towers of events to work with.
That's indeed right. Because automated fulfillment center will be almost... The events from it will be almost as big as events from customers. So this is why we needed a great cloud solution to run because as you can see, the more projects come in and the data is growing, not 10% per month to just, yeah, 50, but it doubled or tripled in some scenarios. So indeed, you definitely need a good cloud solution to tackle this problem.
Yeah. Are you finding that it's actually happening that other departments are dipping in with KSQL and running their own private analysis or is that still a thing you're teaching them about?
I'm not running yet, to be honest, but we are planning to do more talks on it and introduce this technology to our machine learning team, to our data analyst team. So this is indeed just things are happening in the company. So not yet, but I can definitely foresee it in the future.
Yeah. Yeah. It's like... Again it's not just the core technology, it's the integration, it's the dissemination of knowledge, understanding what's possible, right?
Yes. That's true-
...
True. And of course, support-
I think I sometimes lose sight of that in the technical world and support. Yeah. It's just this whole big package that we'd love to believe it's just programming, but it's a big ball of life.
Yes.
Or handling on it.
That's totally right. Because if you don't have great support for your business, like for instance, for your streams and platforms, it's really hard for developers, for analysts and for everyone to solve the problems. And that means you will be always getting behind. And with Confluent, of course, you get it, you get support, people always are always there, you can ask them all the questions, not only on Confluent Cloud, but on Kafka, on something close to it, on similar technologies which uses Kafka and they are always replying and trying to solve the problem. We found some amazing improvements in our support that we had. So, yeah. Don't forget about support. This is really crucial.
Yeah. So I believe we're going to meet up in person in Amsterdam next month at the time of recording. Is there any chance I can get a tour of the robot factory?
I would be very much pleased to provide you such tours and definitely we will ask around if we can achieve it. But I'm very much looking forward to our meetup which we plan in March, so we can share with more people and events, what is going on and why should we also think about Confluent Cloud and such technologies like Kafka.
Mm-hmm (affirmative). Yeah. I'm looking forward to it. I will see you then, but for now, thank you very much for joining us, Dima. It's been a pleasure.
Thank you so much too. The pleasure is totally mine.
And with that, we leave Dima to manage his fleet of robots. That sounds like fun. Well, they are hiring. So if you fancy being the fifth person on that team, check the show notes. We'll put in a link to their jobs page. Now if you're piloting, trialing or looking to put Kafka into production, let me tell you that you can easily get started with Confluent Cloud by going to confluent.cloud. And if you register with the code PODCAST100 you'll get $100 of extra free credit to explore with.
Meanwhile, if you want to learn more about Confluent Cloud, Kafka or event streaming in general, head on over to developer.confluent.io, where we'll teach you everything we know about making a success of event-driven architectures.
If you have any questions or comments about today's show, we'd love for you to get in touch. If you're watching this on YouTube, there is a comment box and thumbs to click. Spotify has stars and it's the fifth one on the right that you're looking for. And my Twitter handle is always in the show notes. We'd love to hear from you. And with that, it just remains for me to thank Dima Kalashnikov for joining us and you for listening. I've been your host, Kris Jenkins, and we'll catch you next time.
What are useful practices for migrating a system to Apache Kafka® and Confluent Cloud, and why use Confluent to modernize your architecture?
Dima Kalashnikov (Technical Lead, Picnic Technologies) is part of a small analytics platform team at Picnic, an online-only, European grocery store that processes around 45 million customer events and five million internal events daily. An underlying goal at Picnic is to try and make decisions as data-driven as possible, so Dima's team collects events on all aspects of the company—from new stock arriving at the warehouse, to customer behavior on their websites, to statistics related to delivery trucks. Data is sent to internal systems and to a data warehouse.
Picnic recently migrated from their existing solution to Confluent Cloud for several reasons:
Dima's team was extremely careful and took their time with the migration. They ran a pilot system simultaneously with the old system, in order to make sure it could achieve their fundamental performance goals: complete stability, zero data loss, and no performance degradation. They also wanted to check it for costs.
The pilot was successful and they actually have a second, IoT pilot in the works that uses Confluent Cloud and Debezium to track the robotics data emanating from their automatic fulfillment center. And it's a lot of data, Dima mentions that the robots in the center generate data sets as large as their customer events streams.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us