If you're a programmer, you probably use Git for source control these days. I think Git won that argument fairly convincingly a few years back, because what Git does at the heart, it does really well. The command line for it can be a bit weird at times, I'll admit that. But at the core, it has this really well thought out data model based on Merkle trees that just naturally pops out all these useful features, like cheat branching and atomic commits, and rolling back, and merging other changes in, and point in time recovery, and reliable tagging. All this great stuff just pops out as a consequence of how it's designed internally.
And I've long been a fan of the design of Git. I encourage you to look it up. We could almost do a whole podcast on Git's internals. Diagrams might be a bit hard, but we could almost do it. But today we're asking a slightly different question. If Git's approach to data is so useful, why aren't we using it in other places? We're using it for source control, sure, but are there other applications?
Well, joining me today with an answer to that question is Adi Polak, who's the head of developer relations at lakeFS, and I'm going to let her tell you how they've been taking a Git-like approach to dealing with massive datasets for the sake of development and testing, quality assurance, a bit of chaos engineering, and more. This is a really fun one to record, because we did it in person for a change. Adi and I were both at Current in Texas last October. And I think I'm going to tell you more about my time at Current at the end, when I will also tell you more about the fact that Streaming Audio is brought to you by Confluent Developer, our education site for Kafka. But for now, travel with me backstage to a small room in the Austin Convention Center to learn from Adi. I'm your host Kris Jenkins. This is Streaming Audio. Let's get into it.
Joining me on today's Streaming Audio is Adi Polak. Welcome.
Thank you so much for having me today, Kris.
Oh, it's a pleasure to have you here. I've been excited to have you on for a while now, because we're doing some stuff here at Current. We're doing a panel and things like that. You are, let's get this right, you are a developer advocate at lakeFS, right? Which is where we get this super cute axolotl guy, little plushie mascot. What do you do at lakeFS?
Well, I am responsible for the devex team, so we're doing advocacy, [inaudible 00:02:41], and most importantly building a technical community around lakeFS, which is the open source project we're advocating for.
I know what it's like being kept busy, community stuff and trying to build and explain and all that kind of thing. Right. But what do lakeFS do? I have this impression it's something to do with data lakes, right? But it's a lake file system, that's where it gets the name?
Yes. So lakeFS, FS stands for file system, like you mentioned. And under the hood, what it really does, it gives you data version engines or data versioning capabilities for all your data regardless of the format of the files. So you can benefit from similar, very similar to Git that has version control, so you can benefit from that exact experience and features for data.
So if I have an object store data lake [inaudible 00:03:39], Azure Blob, GCP, I can create a version of my data without copying all my data, so it basically a shallow copy, which means I'm creating pointers and metadata for my files.
So [inaudible 00:03:59] of Git, like I can take this point in time snapshot of my data and then say, you need to go and take a copy of, here's some ID, make sure you get exactly the same data on your system?
Yes.
Like that?
Yeah, exactly. So it comes from the database world, because you mentioned snapshot. So snapshot was very strong in the database world when we're managing for example Oracle database. So a similar experience. Instead of calling it snapshot, we're calling it branches, and there's a good reason for it. And so what we realize, that developers are much more keen on the Git terminology, and so instead of saying snapshot or copies, we're branching out of our main branch of data. And we also added capabilities that are very much Git capabilities. So after we're branching out of the data, we have commits. We can do commits for every write that we do for the data, we can do merge, so if you want to merge back.
So you can branch in new pieces of data, you can presumably share clones, and go forwards and roll back, and all those kinds of metaphors.
Exactly.
Okay. So the question is going to come up, can I use Git for that? Where is it not Git-like?
That's a really good question. So the core engine and everything behind it is not Git. And the reason to it is because Git is not able to store large amounts of data.
Define large. Are we talking-
Petabyte.
Petabytes.
A petabyte of data basically. A data lake. When we're storing code source, we're talking about megabytes. It's relatively small. Sometimes binaries can be a little bit larger, because the whole process that's going on to produce the binaries.
Mostly it's nicely compressible text.
Mostly, yeah. And so this is great for Git. However when we want to work with data, then we need better systems, we need better tools. And so while it's a similar experience and similar terminology, actually behind the scene the model is very different.
Okay, tell me a bit about that. I love Git's data model, I could nerd out about that, but I'll refrain. But tell me a bit about your data model, [inaudible 00:06:20].
Yes. It's a really cool, innovative data model. Basically behind the scenes there are a cryptographic tree, and a cryptographic tree, so do you know B+ trees?
Yeah, I know B+ trees and I know Merkle trees.
Yeah, so it's a Merkle tree. A Merkle tree is a cryptographics tree. So basically what it knows about the data is if something got changed, because it stores hashes. So hashing the file, and then we can know if something in the file got changed. And we don't care about what got changed, because we're looking at a binary at the end of the day. So if anything changed with the data, it's because we did some operations of copy and write, because something got changed with the data on storage. So taking a step back, when we think about different compute and linux streaming, we have data that is on storage, and then we have data in the ROM, right? In the machine memory itself for the compute. So after writing the data back to our storage, this is when basically copy and write takes place. So this is when your Merkle tree is being updated. Something happened with the storage, with the way it's represented on disk, and so this is when we know this file got changed, something happened. It's either uncommitted, and then we can commit it or the different operations that happens along the way.
Okay, and so you're building up this ever larger tree of blocks that cryptographically link to other blocks.
Pointers, not blocks. Yeah. We don't maintain the files themselves, so we don't look at the content of the files. Only pointers, and we're saying, something here got changed.
Okay, so all the content is still backed by S3, the way you were always backing it. And what you're doing is building a clever index into all that data.
Exactly.
Okay, okay. I can see that. So that brings up, we've gone down into the rabbit hole a bit with the technology. Let's come back up. What are the data kind of challenges people get around this kind of stuff that necessitate different ways of thinking about data warehousing and versioning? What problems are you actually solving for people?
Yeah, that's a really good one. So there's many challenges today in the data space, and one of the big ones is actually reproducibility. So we produce some results of the data, and then we have backfilling, or we need to reproduce the exact scenario because we need to troubleshoot something that happened [inaudible 00:08:48] data. So imagine I'm a system engineer, data engineer. I have production with tons of data that I'm processing. I know there was something going on with my end result, with the data product that I'm exposing to my customers. I know something bad happened. It's really hard for me to troubleshoot when and where in the system something got corrupted, a data record got corrupted, something along the system, because it's always changing. Our data is constantly in this lifecycle, and it evolves, right? I'm taking a record, I'm updating the record. When did I update, what exactly happened?
And so once I have version control, when I have versioning and I can roll back to the exact state, I'm able to reproduce the data the way that it came into the system, and the way it was processed, so I can actually find the problem and solve it. And that makes my system much stronger, and as an engineer it makes me much happier, because I'm able to find the problem, solve it, and hopefully won't have it ever again.
Yeah, because the best way to diagnose a problem is to step into your time machine and go back to when the problem happened, right? The second best way is to be able to reproduce it exactly as it was, right? Okay. So is it largely about, for in your mind, is it largely about the reproducibility of bugs? Or does it come into other areas?
So there are different use cases that customers use lakeFS for. So one of them is reproducibility, so again, troubleshooting. We also have customers that use it for FDA regulations, because-
FDA? Hang on, that's the Federal Drug Administration.
Yes.
Yeah, okay. Where does that come in?
So they're doing cancer research and cancer diagnostics, and as part of their job, a lot of the processing they do on top of that data, they need to be able to reproduce the exact states of that data in different stages along the line. So instead of saving it manually or creating different scripts to save that data so they will be able to show the FDA, here, we're compliant, right? We have everything that we need. So they're leveraging lakeFS, because it give them a much smart system to manage everything that they need in one place, so they won't have to build everything from scratch.
Yeah, so it brings in this idea of almost auditing, being able to audit the lifecycle of your data.
Yes, exactly.
Okay, that's interesting. Anymore? [inaudible 00:11:21]?
Yes.
I bet you've got a whole host of them I'm going to go through.
Yeah. Another interesting use case that a lot of customers are using lakeFS for is actually a test environment. So we know that when we put together a data platform, there's a lot of moving parts in our system, right? We have streaming, we have event driven, we have some microservices on the side, we have someone that's doing batching for some different operations. We're connecting a lot of different platforms together, because we love open source and we love the ability to connect different things together and make them all work.
And our biggest challenge is actually developing confidence in the code. So let's say I'm working on a legacy code, or I'm working on a completely data pipeline that I'm building. I want to be able to test it in a way that it's clear, that I have... I don't need 100%. Give me 70% confidence. This thing is going to work on production, on my data. And so what do we do today?
Usually we end up doing really dumb test datasets that are tiny and not really representative. That feels like the normal way, like whatever the developer that was writing the test happened to be bothered to type in at the time.
Yeah, and that's a good thing. The more unfortunate scenario is people copy production data to their local machines. They sample it, and then they copy it to their local machines, is actually is a security breach many times.
Yeah, PPI type stuff, right? Yeah.
Yeah. So no one manages it, there's no governance around it, I'm able to SSH into my machine or I have access to my S3 bucket. And so I'll just download a couple of files to something around then, they'll figure it out, hopefully will give me some testing. Still not representative because it's a small set, but I can do that.
Yeah, and then you get to the point where every now and then someone emails a test email to all the customers, right? I've seen that happen. I've received that test email from developers that really shouldn't have gone out. Yeah.
Yeah.
Okay, so are you saying lakeFS deals with some of the copy and transform it enough to anonymize it then?
So because it has a zero copy mechanism, what a lot of our customers' doing is actually they're creating a branch, so they're doing kind of a zero copy operations on top of the data. And so they're able to test all their code there, so that actually they create the staging environment for themselves that leverage production data, yet an isolated place, so they won't harm production data. Nothing is being exposed to production. So that gives them the confidence, so they're able to really test their code on the scale, volume, variety, velocity, throughput if they need to.
So you're saying what you're actually doing is, you're not transforming the data at all. You're just taking your own zero copy snapshot in an isolated environment that wouldn't be able to send the email out, and it can run, and it can even change that data, because it's just going to be thrown away. You've got your own branch.
Yeah.
Okay, yeah I can see that working. So let me see if I've got this list right. We've got regulation auditing, compliance type stuff. We've got, what was the first one you said?
Reproducibility, yeah.
Reproducibility for diagnostics testing. You got any more for me? I mean-
We talk about troubleshooting, right? [inaudible 00:14:48].
Yeah, troubleshooting, those are pretty big ones.
Yeah.
Okay, yeah. I can see how this works. I'm going to ask you a wildcard question quickly. Why an axolotl? How does that come into lakeFS as your mascot?
That's a fantastic question. So axolotl is a Mexican salamander.
Mexican salamander, okay.
They live in the Mexican lakes, and their sole job is to clean and organize the lake. Does that make sense?
Yeah, okay. Okay, that's actually quite clever, as well as your plushies being adorable. I'm sorry, if you're listening to this podcast, you can't see quite how adorable the lakeFS plushies are.
Yeah, it's definitely a cute one. And they did such a fantastic job putting it into a plushie form, and people love it, so it's fun.
Okay, so let's get onto your Current talk, because you're talking here tomorrow, right? Second day of the conference.
Yeah.
What are you going to be talking to people about?
I'm going to talk about how can we borrow best practices from chaos engineering to data engineering.
Cass engineering?
Chaos engineering.
Oh chaos engineering, sorry. Yeah. Okay, can we?
Yes.
How?
Yeah, so quick, yes. Longer is our session tomorrow, so let's make it brief. It's basically we have different principles, and the principles are saying don't ever trust your testing environment. You have to start testing and doing automations in production so you can develop the confidence and remove the weaknesses from the system. Similar aspects happens in the data space. Only the things we need to tune is what are data requirements day-to-day? So what is the data product that we're producing, what do we expect to see of data, what is a good stable, healthy state of our data? So once we figure that out, we can leverage the different tools like branching out of the data, really testing everything that we want to do, and then making decisions and qualifying that the data reach the desired state at the end of the day together with the code that produced it. So we're actually removing weaknesses from the system, because we're able to finally see how the code is changing the data and the different stages.
Right. So hang on, let me see. I don't think I've entirely got that. So chaos engineering is like this idea that in production, you pull the plug on one of the production machines to test that your system actually recovers and can cope with a machine going down?
Mm-hmm.
And you're saying there's no point doing that just in test, you've got to be brave enough to do it in production to see if it works, right?
Yeah.
Otherwise you'll never know if it actually really works until the day things blow up. So I get that part. You're saying... how does that work with data? Is that like we try and insert bad data and see if our schema mechanism actually works the way we think it does, things like that?
Similar aspect. So what happens, and chaos engineering brought to the world because of distributed systems. So we wanted to test our mechanism of recovering from failures, basically. And so if our system is resilient for failures, we would be able to recover no matter if you pull the plug. As long as there are one machine alive somewhere, should be fine at a very high level. So what happens with data as we're developing code, and when we're developing code we need to make sure, oh, there's some schema changes. Something happen in my code, I need to enter validation, I need to do some quality gates in between the different stages so I will know the data is in a good state to move to the next step.
And many times we forget about it. We forget about interfaces between these different operations, we forget about making sure we are really getting to the end result of the data requirements and the data products downstream. And so lakeFS enables us to test it. And we know bad things are going to happen all the time in production, and the more we do that, the more we're able to figure it out and develop the code that can solve it automatically.
Give me a specific example, like what kind of data problem will you be testing?
Yeah, so one classic example is when we ingest data, let's say we're using Kafka. We're ingesting data into our data lake, it's often we're going to save it in JSON. And JSON is a semi-structured format basically, so it's very easy to introduce schema changes. It's very easy to make an [inaudible 00:19:13] turn into a string, like we completely change the format, the type of the column. And so we want to be able to introduce the right mechanism for schema changes. You want to introduce the right mechanism to validate the different columns along the line.
So in order to do that, we need to be aware of these things happening to begin with. And many times, we can't as engineers, we can't really cover everything because it's very hard, especially when working with data, if you start covering all the potential cases that can happen with data, it might not necessarily be the case for our system. So we want to be able to discover the weaknesses of our system, and we want to be able to describe the weaknesses of our system as the data changes through time. So it might be that I'm introducing another source, I'm ingesting new data from a new source, it might be at the beginning so it'll be a rough start, we still haven't figured out all the things we need to connect it, we need to make sure it really gets in at the right time. And so this is where it became unstable for the system many times.
So we don't have to push, pull the plug on data, although we can. Although we can. But from my perception is, there's enough issues already happening in our production that we can start working on right now.
Yeah, that's true. Just the simple act of maintaining data quality over time is a job we could use more tools for, let's say that. We could definitely have more tools available in our toolkit to deal with this evolving mass of data we want to deal with, right? Okay. Yeah, I can see that.
So that's a good point. I don't think it's necessarily more tool, it's more about tools that enforce best practices. A lot of engineers today are copying production data. Either they're copying it to their local machines, and then it's a problem, but some of them will actually create another bucket, another S3 bucket, and copy it to a new bucket, and that's fine. And they're able to test their code, develop their confidence, they feel much more comfortable, they're able to do the whole process and really make sure they're removing the weaknesses from their system.
The challenge is, there's no one managing it. There's no one system. You hook together some scripts, and then every engineer can create their own branch. Imagine a data lake, it's petabytes of data being replicated by a hundred engineers...
Yeah, you can't do that. You can't have 100 engineers replicating, yeah. Even if you could the time it would take to copy it for all of them will kill you as well.
So this is actually happening with some of our customers.
Really?
They used to copy all their data again and again, and it will take them to create a data environment to test their logic, will take them a good couple of hours just to create the environment.
Every time.
Yeah.
Oh, God. And then you're saying you can do it instantly.
Yeah.
Because you're essentially just a pointer.
Yeah.
Yeah, okay.
On demand.
Yeah, that's good. You could've led with that, that's a nice feature. Okay, so that leads me into, what kinds of customers have you got? You said like drug trials. I would assume a certain amount of banking.
Yeah.
They're dealing with large historical datasets. Do you have a typical kind of customer or industry?
Everyone who are producing data products at the end of the day. Most likely if the company is selling a data product, analytics, dashboard, research for medicine, cancer, a lot of these, and they need regulations. Yeah, every company that produce a data product.
Okay, that's lots of people, an increasing number of people these days. So that should keep you busy. I have to ask this, is there any particular synergy between lakeFS and Kafka? Do you have particular support I should know about?
That's really interesting, we just had a lovely conversation with a bunch of people during lunch about that. So because the environment, because Kafka is never on its own, there's always more moving parts along the line. So the ability to create a testing environment, having Kafka there as well can really help smooth the whole process. So once you have that, you're able to have Kafka as a large data architecture or the data platform, and then you'll have the testing.
And more specific what we've learned is on ingestion. But it's not connecting directly to Kafka, it's Kafka saving data into a data lake, and then on ingestion we're able to detect if anything got changed in the schema, or anything... like if it doesn't detect it on its own, but it gives the data versioning capabilities so you can actually connect the dots and be able to roll back and revert to the specific change that got into the system before we were deleting it basically, or even after we're deleting it, because there's still an older version that we can-
Yeah, and you can always roll back to that point in time and grab that and see what happens.
Yeah.
That leads me on to another thought I had, which is, so an interconnected web of different systems, and you mentioned having different databases and microservices. Is there a lakeFS thing that somehow coordinates all those different datasets as a single thing? Could I snapshot my S3 store and my MySQL database, and say I want a single point in time when they were both exactly the same?
That would be a really, really cool feature. At this point, it's only object source. So everyone who implements the S3 API, which is a very basic API. Even Azure Blob is-
Yeah, [inaudible 00:25:18] de facto standard, the S3 API, right. Yeah.
Yeah. So everything that implements the S3 API, we can version it.
Okay, yeah. I imagine that's a larger and larger thing. And as you're saying, you can always dump things across to that and bring it in.
Yeah.
I wonder, you can get your product manager to pay me for this later. I wonder, because we've got this thing in Kafka with tiered storage, where the hot set stays on a SSD, and all the older data is now being archived off to S3. So anything that isn't the hot set is an S3 object. So there must be some kind of synergy waiting to happen between Kafka and lakeFS, right?
Yes. So looking at Kafka Connect, right? We can define the store, right? There can be a store URL in Kafka Connect, and my store URL can be a link to my lakeFS, basically how I connect to it, is just a URI. So I can connect to it, so basically if I need to save data for a later process or something along those lines using Kafka Connect from a producer point of view, I can definitely do that.
That's a thing I can do today?
Yeah.
I think more for the future, I could see that. Is there anything else that we haven't covered that I need to know about lakeFS-
Oh, there's so many things.
... or Adi Polak?
So many things. lakeFS is awesome.
You seem happy there.
I'm very happy.
You're enjoying your work.
Yes, it's a lot of fun. So some of the things I love is great people and great tech, and when you have these two comes together, it's wonderful. It's like when magic happens. So this is fantastic. My book is-
You've got a book? I didn't know.
Yes.
Tell me.
Scaling Machine Learning with Apache Spark, so it start with Apache Spark and everything we know and [inaudible 00:27:20] around it, and it goes all the way to Spark ML. and then for people who want to do deep learning, it bridges into deep learning platforms. So it actually takes you through the process of going to building the data in the right way so you can actually leverage that with Spark, and then TensorFlow, and then PyTorch, all the way to deployment. So I'm very excited about that.
Oh, are you writing that now? It's available now? What?
I finished writing it, still the O'Reilly team, they're organizing the format so it will be ready for print soon.
Oh, cool. Is that your first book?
Yes.
You're about to be a published author. Are you going to send me a signed copy of the book?
Of course.
Okay, great.
Yeah.
I'm going to look forward to that.
Thank you.
Adi Polak, thanks for joining us on Streaming Audio. It's been cool.
Thank you so much for having me Kris, was wonderful.
Cheers.
That was Adi Polak. And Adi, if you're listening I'm looking forward to my signed copy as soon as it's ready. I have every intention of paying for it of course, I just want it personalized. As I say, we recorded that podcast at Current, which was our big conference we put on in Texas at the start of October, and the idea for Current was we would take Kafka Summit US but increase the scope to talk about Kafka as usual of course, but also the whole of this emerging realtime event based big data space. And for me, I think it was a huge success. It really delivered on Kafka Summit++. We had some familiar faces talking about more familiar technologies. We also had plenty of new ones talking about new things around the ecosystem, and perspectives on it that I hadn't heard before. And that's always the point of conferences for me, the ideas you hear and the people you meet. So I have every intention of getting as many of those onto the podcast for forthcoming episodes as I can.
I hope we see you there at Current '23, and in the meantime, I shall as always remind you that Streaming Audio is brought to you by Confluent Developer, which is our site that teaches you everything you need to know about Kafka, whether you're getting started or you want to dig into some deeper architectural guides.
There was a new course launched recently by our own Adam Bellemare on event modeling. That's worth checking out. So take a look at developer.confluent.io. And if you're there and learning about Kafka, you may need to spin up a Kafka instance. So for that, head to Confluent Cloud, which is our managed service for Kafka. It is the easiest way to get started with Kafka, and it's also one of the easiest ways to scale with Kafka. If you sign up for an account, you can add the code PODCAST100, and that will get you $100 of extra free credit, so give it a look.
And with that, it remains for me to thank Adi Polak for joining us and you for listening. I've been your host Kris Jenkins, and I'll catch you next time.
Is it possible to manage and test data like code? lakeFS is an open-source data version control tool that transforms object storage into Git-like repositories, offering teams a way to use the same workflows for code and data. In this episode, Kris sits down with guest Adi Polak, VP of DevX at Treeverse, to discuss how lakeFS can be used to facilitate better management and testing of data.
At its core, lakeFS provides teams with better data management. A theoretical data engineer on a large team runs a script to delete some data, but a bug in the script accidentally deletes a lot more data than intended. Application engineers can checkout the main branch, effectively erasing their mistakes, but without a tool like lakeFS, this data engineer would be in a lot of trouble.
Polak is quick to explain that lakeFS isn’t built on Git. The source code behind an application is usually a few dozen mega bytes, while lakeFS is designed to handle petabytes of data; however, it does use Git-like semantics to create and access versions so adoption is quick and simple.
Another big challenge that lakeFS helps teams tackle is reproducibility. Troubleshooting when and where a corruption in the data first appeared can be a tricky task for a data engineer, when data is constantly updating. With lakeFS, engineers can refer to snapshots to see where the product was corrupted, and rollback to that exact state.
lakeFS also assists teams with reprocessing of historical data. With lakeFS data can be reprocessed on an isolated branch, before merging, to ensure the reprocessed data is exposed atomically. It also makes it easier to access the different versions of reprocessed data using any tag or a historical commit ID.
Tune in to hear more about the benefits of lakeFS.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us