In this week's Streaming Audio, we're talking to two platform engineers, Doron Porat and Liran Yogev about how their platform and their understanding of what they're supposed to be building has evolved. Because a few years back, the heart of their platform was Apache Spark. They very reasonably thought that their job was to make life easy for Spark developers. But as they started to succeed with that, they realized there's actually a bigger picture going on when you're developing a platform team. Because the people writing data into the system may be getting their job done, but they aren't really successful until the people reading it back out for whatever purpose are successful. The readers need to be enabled by the writers.
They need to be able to answer questions like, what data is available? What does it look like? Where did it come from when I need to debug it? What's the quality of this data? Readers need tools to be able to answer those kinds of questions, these lineage schema, governance questions without having to have a conversation with the writers every time. That takes a good platform team seeing the bigger picture.
That's what this week's episode is really about. Leveling up from being a first order platform team that's enabling individual departments to get their work done to being that second order platform team that keeps the whole company moving. It needs some technical changes. It needs some supporting tools. It needs some mindset changes around the organization. Doron and Liran are going to talk us through their journey on it. In some style, I have to say, these two have got a really good rapport. It was a great phone conversation. Streaming Audio is brought to you by Confluent Developer, more about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it.
I'm joined today by two partners in crime, Doron and Liran of, is it Yotpo? Am I pronouncing that correctly?
Yeah, I have 50% of us are from Yotpo. Yeah.
Yes. The other 50 are from a company called ZipRecruiter.
ZipRecruiter, okay. We're going to have to go through that. But we're mostly talking about the data infrastructure stuff you've been doing at Yotpo, right?
Yeah. We're going to talk about both, I suppose.
This is going to be great. You've already hijacked the direction we were going in. It would be more fun, I can tell.
Yes, this is going to be an improvised session.
Okay. In that case, let's start off. In two sentences, let's start with you, Doron. Tell me what Yotpo do.
Okay. Yotpo as a B2B SaaS company. We build an e-commerce marketing platform that serves e-com businesses with different products that we offer as part of the platform, with very strong synergies among them. These products can be review solution, communication channels such as SMS, messages or emails, customer data platform to do segmentation over referral programs, loyalty and so on and so on. Yeah, we're based at Tel Aviv, but we're a global company.
Okay. If I understand that from a man on the street perspective, you are doing, if I've got an e-commerce shop, you're the people who would do a mixture of customer analytics, remarketing, support messaging and all those things I need to make the business actually marketable when I'm selling stuff.
Yeah, but it's a SaaS solution that's like onsite, widgets and all kind of a B2B interface that includes analytics as well on all your data. But yeah, that's basically the point.
That's one of those things. I think from my point of view as a programmer, it's almost more interesting, the infrastructure. There's a huge amount of data flying around. From a programmer's perspective, it's a huge infrastructure business.
Yes, for sure.
As much as it is a marketing platform, right?
Yeah. We have all the B2C activity with the shoppers and visitors, that's a huge amount of data. Then again-
Yeah, Christmas traffic.
... we have all the orders, products. Yeah, we have November holidays coming in.
Oh, yeah. You've got Black Friday and Christmas coming up and all that fun.
Yes, every Monday.
We hope we all have that, right?
Right. Good job we're recording this podcast before that tsunami arrives.
They're going to be super successful. It's fine. Nothing bad is going to happen, right?
Yeah. We're going to understand how your infrastructure can scale to that kind of load.
Then we move on to you, Liran. You were saying your ... Oh, God, I've already forgotten the name. I'm terrible.
Yeah, we used to work together up until three, four months ago. Then I left. I went to work for ZipRecruiter, which I've been working for the past couple of months. Yeah, but we're still working together. We have a podcast that we host together. It's in Hebrew. You can check it out if you like. It's called a Data Swamp, but under that, we're still buddies, I suppose.
You did a talk recently at Current about the infrastructure for Yotpo and moving into data modeling infrastructure choices you made along the way, right?
Yes. It's the modeling platform that we've dreamed of while we're doing things otherwise.
Maybe we should start there. Give me an idea of what state it was in before you started on this reinfrastructure project.
Okay. Basically we started, it was 2016, I think that we've built this self-service modeling infrastructure based on Spark.
You can call it just ETL.
Yeah. It's an ETL framework. We based it on Spark. All people had to do was write their SQLs into Yaml files and configuration, spin up their Airflow DAG or whatever it is. They can run streaming jobs as well. We were very, very focused on enabling generalist developers and building their own data pipelines to feed our data lake and to offload all kinds of processes that were very heavy on the service side onto the data lake and analytics, of course. Because it also addressed BI developers, analysts and whoever can write SQL basically.
Back then and up until a few years later, we were really focused on just build, guys. Just to build, give us more.
I think we were so proud of this. We actually open-sourced this project because we thought everyone can enjoy it. We actually did talks about it. We were really proud of this project. Just super successful by the way. We had hundreds of data pipelines created this way.
What's it called?
It's called Metorikku.
It sounds like somewhere I could go on a holiday.
Or if you like a place called Metric in Japanese, fine, you can go on holiday there, yes.
It's actually metric in Jap ... It's true.
Oh, I see. Okay. Metorikku.
Yeah, it's catchy.
Is it catchy though?
I wasn't expecting to learn a little bit of Japanese in this podcast.
Yes. I see, I see. We don't know Japanese. It's important for us to know ...
Just this one.
Yeah, it was very, very successful. But I think that with time, I think it's a global data thing that we kind of understood that there's more to just enabling people to ...
Yeah. That's another thing. We are super, super focused on producer's side, which I think was juvenile. We just wanted to be popular and getting developers happy, building those pipelines. But we didn't really think of the whole downstream effects of whatever pipelines we were creating.
Yeah. You can simply call it a data swamp. Of course, it wasn't that extreme, but we lost control over what was being created and produced. All the different producers, they were writing these Spark jobs, they weren't really aware of what exists elsewhere. I could be building a certain pipeline and Iran would build the exact same pipeline only with slightly different names or slightly different metrics. We wouldn't be aware of each other. It's really a matter of governance, but also a matter of understanding the holistic picture of what happens to the data, where it comes from, where it goes, who's using it and continuency in that term.
Yeah. There was no centralized place to create those. To see what there is, so not really a good ...
Actually decentralized it all the time.
Because again, we were so focused on our engineers, we wanted to give them freedom like microservices. They can create whatever they want in their environment. They don't really care about each other, so they don't have usability. You don't have a centralized space. You see everything. You don't understand how to consume data. Again, we were missing a lot when we wrote this.
You've gone from that first order problem of we can't even write the data to, oh my God, we've written all this data and we don't actually know what we've got, right?
Yeah, exactly that.
The second order problem is actually managing it at scale.
Yeah. I think maybe two other things. One, we also started to think about developer experience, how fast it is to ship data to, I don't know, to production and have it ready and available all the way up to analytics or the application that's using it. We wanted to improve that.
The last thing I think is the coupling that we created between our business logic, like the organizational business logic and Spark, which is great, but I think that for a really long time we were positive Spark is the way to go and then we shouldn't look right and left. But as time went by, we started seeing that it's not the perfect solution for everything. Then we realized that everything we've written down in Metorikku is Spark. It's Spark SQL, but it's totally coupled to the underlying technology.
I think it's really apparent in the streaming world where Spark's structured Streaming, which is the Spark way of doing streaming, is basic in a lot of ways. It's not as advanced as us other streaming engines. That's where we thought, "We want something else here. We're going to have to write it outside of our common way of writing data pipelines, which we really hated because wait, so now we're going to have two methods of writing it out." That's the challenge where it triggered us. "Okay, there's something wrong here."
Right. Yeah. Just quickly, for those that don't know it well, how would you categorize Spark? What does it do? What's its strength and weakness?
Spark is a distributed compute engine. It is, I suppose helpful to perform batch operations on top of data, on top of small to big data. It does that pretty well. It has a lot of optimization along the way to help create very complicated data pipelines on top of very large data sets. It's infinitely scalable.
Batch and micro batch.
Batch and micro batch, yes.
Yes. Batch and micro batch, yep. We were recently thinking about trying to coin the term nano batch for five records at a time and really ...
I like it. It's actually a good solution for a lot of problems.
Yeah. I think there are few things that are problematic with Spark. Some are solvable, other just in technologies, some aren't. But first of all, it's not that easy to understand for people who do not know and understand Spark, which is part of the thing that we did. We made Spark available. But once things break, they go to the data team and say, "Oh, it broke. Can you help me read the logs? I don't understand what happened."
You need to be able to dig in into the process to understand where things went wrong and what you should change and all those beautiful configurations that you put into your job and configure it, you have to ... We really tried to generalize this. I think we did a really good job in generalizing, but still you have those edge cases where you need more and it's not for the common developer.
Yeah. For example, when I moved to ZipRecruiter, it's very different here. Here, we have people called data engineers. It's different than what we used to have at Yotpo. These people that are really experts in Spark and experts in data technologies, there's a lot of knowledge to have. When we try to make it simple, we actually reduce a lot of the knowledge you have to know, but also a lot of the really cool features that you can do if you know Spark and if you understand the technology better. This was different here when we work.
Yeah. You gained velocity across a large crowd of developers that can move forward now. But you meet it again in terms of cost and performance ...
Yeah, and scaling the separation in a responsible manner. The second thing I think is that sometimes we would feel that Spark acquires a lot of resources for something that would not require that many resources under a different technology. Sometimes it just feels really bulky and heavy.
Give me an example.
This huge hammer that you can put on a lot ... I think a lot of organizations have this. You can solve a lot of problems with Spark. It can do everything. You can call APIs and you can use it when you do streaming. You can do really, really weird things. By the way, on top of really small data sets in some cases, but it's also a very generic way to solve problems.
But as Doron said, it is extremely heavy. There's always has to be a cluster. The cluster has this minimum size that it needs. It has a long startup time. It's like if you look at technologies and try to solve things faster so that you can take KSQL for example, it's very different from that. It's a more heavy lifting type of tool.
I remember working for a company years back where they would end up doing a Hadoop MapReduce job that took an hour to run to add up a thousand rows to get a total.
Spark is the next generation or previous, I don't know how to call it, but it's better than MapReduce, but still it's another type of tool that is ...
A sledgehammer to crack a nut.
Yeah. I think it's ...
Yeah. But they're improving that way on that area. I don't think they'll remain the same over these. But right now ... Sorry, you're starting to say ... I'm so sorry.
I don't remember what I wanted say. It's okay. But no, I agree with Liran on what he said. But I think we saw, even in terms of streaming, then we had a lot of problem with Spark as well because of the micro batching and scaling wasn't perfect. We were basically running our own self-managed Spark clusters, still do in most cases.
We always have problem with scaling well and reacting to changes in the amounts of data coming in and in the oncoming batches, which we didn't want to solve with just adding more machines to ignore that problem. Streaming started not to feel really like streaming. I remember years ago when we tried to address the problem of updating or updating data files in the data lake, we want to update Parquet files with CDC data. We use Spark with Hudi. It just didn't feel right. Things have changed. That's three years ago, but the batches were so long and we were running streaming jobs. It took 40 minutes to process a batch.
Yeah, that was insane.
Yeah. That's where we felt that maybe different workloads require different solutions, which is different than what we used to think up until then.
Right. Yes. You were trying to one size fits all to solve this problem. What did you do? What was your first step to move away from that?
We didn't start by moving away from this specific problem about Spark. Spark is still major in both companies by the way. It will continue to be, because again, it's a really great tool to solve a lot of different issues, but we can use other tools that are more hyper-specialized for specific use cases.
By the way, Yotpo, one of the tools that we actually did use for streaming and one of the problems that Doron just mentioned about the updating in the data lake and updating really fast, arriving data from CDC streams, we used Upsolver for that because they did a really good job with the minimum amount of resources needed for that. That was a really good solution back then.
But again, we didn't came directly because of, we wanted to just switch to another technology other than Spark. We started looking at the entire problem that we talked about governance and about losing all that metadata and about understanding and reusing more and more data sets and not recreating them by different teams.
Just a small remark. It's not that we wanted to replace Spark altogether or do it now. We wanted to have the ability to have this agility.
Yeah, the freedom, thank you. That we can move away from Spark and not lose all the beautiful stuff that we've created. We want all the infrastructure to be in place and not to be totally coupled to Spark.
The mission was actually decoupling and not switching to a different technology. Once we did a decoupling, now we have that freedom.
I think it's funny because one of the things that we used to say to ourselves all the time in the infrastructure group, "We're all about decoupling. Everything in the data platform is decoupled. Yeah, every solution is very dedicated around a specific need and a requirement." We didn't even notice that we had this crazy coupling in the modeling area. I think this is the biggest aha moment we had.
By coupling in the modeling area, do you mean that your models were tightly coupled to Spark's way of doing things?
And reading on Spark SQL and everything.
It's not just about Spark. It was coupled to the execution. It could have been a different technology. We could have used Snowflake, for example, to run our batch or streaming jobs, but at that time it was Spark. But again, part of the configuration of a model was, how are you going to execute it? What's going to be the output format? These things that should be probably decoupled then your business logic. That was where we started from.
You've ended up with a mixture of data and how to deal with it rather than just pure data. I always think that data is the one thing that decouples, right? If you can make the connection between two systems, just data, that's as decoupled as you can possibly get.
I think that we also thought about, "Okay, eventually, we're just creating Parquet files." Anyone can use these Parquet files. It's not coupled to a certain engine. But going a step back, the actual modeling itself, the actual process ingesting and digesting the data, that part should be decoupled. That's the part that we wanted to keep across different implementations.
I want to also add, because I didn't think we talked about it before, but the focus on consumers is something that was really important. It's not just about the decoupling, it's about also being able to describe things so the consumer can consume data better. You talked about data and being the ultimate decoupling, but what is data? Data is by the basic properties, it's just a bunch of files that have maybe a schema defined somewhere. Someone can consume it with some different technologies.
That was not enough for us. For example, one of the things you wanted to know like, "Okay, who is the owner of this data? Who owns this data set? How can I contact them?" Or for example, "What is really the contract for this dataset?" Another question I want to know, is this field aggregatable? Can I do a sum on top of it? Can I create my ... It's these questions about consumption that we really didn't answer when we just write this dataset somewhere and someone can consume it.
I feel like we were talking on a ... Just finish one more thing. Yeah. Data catalog, they kind of solve this because it's much more observable and you have all the information there. You can consume it. But I think it's more of the notion for the data producer to be aware of the consumer and have this discussion while composing this data pipeline and not after the fact.
This is reminding me of data mesh ideas, this one of the ideas data mesh where you treat your data as a product that you plan to make available.
Yeah, this is kind of the the same, but I think our thoughts came in before. No, I don't know.
We invented it. We just had a different name, not as good.
We are very lazy, so we didn't really read data mesh, the entire thing. But once we saw data mesh was out and people talking about it, we were like, "Oh, that's kind of what we were doing."
Yeah. It really resonates to what we do. But I think any other thing, it's very important to note the thing. It goes for data mesh and whatever it is that we are trying to do. Infrastructure is one thing. It should be opinionated and real thought of, but then you have processes and organizational processes that you have to take care of to bring this to life. Infrastructure is not enough. I think that that's also a concept that data mesh is constantly talking about. This goes for every important thing that you want to. Like a transformation for the organization that you want to drive.
Yeah. I actually find this reassuring about data mesh that you could easily dismiss data mesh as just a buzzword. But the fact that so many people seem to be independently rediscovering some of these principles makes me think there's genuinely something in it that we need to be paying attention to.
Oh, it's very real. It's problems ...
It's like the conclusion you'll come to if you keep searching.
Yeah. I think it happens when the organization increases in size and increases the amount of data. Not the actual size of data, but actually amount of data assets that you have. It becomes so complicated and the ownership model just don't work anymore if you do that.
Ownership. We didn't talk about the ownership. It's a big thing also in data mesh and what we were trying to do.
Yeah, that's probably the ultimate goal for everything is that we want people to really own their data. Then we ask, what is actual ownership? Again, just to answer your question. Yeah, it definitely resonates and it's definitely reassuring. I think in our podcast as well, we hear so many organizations struggle with these problems. Because as it scales, it just becomes this really insane thing and very complicated thing to produce and consume data.
Yeah, it can be. If I'm understanding this correctly, in your case, you've got a particularly thorny issue in which your producers and consumers aren't actually developers in your company. They can sometimes be third parties writing jobs. Is that fair?
I think for both of our companies, it's not one of the problems. One of the problems is, and I think that again, it's the size of the organization that dictates this, is that so many people are able to produce data. We should be allowing them to do that. If we're an unhealthy organization, we would allow only to a specific set of people to write data.
Then it will be probably easier to do everything because it's always like this bottleneck. But you go through everyone and they're a bunch of people that are really interconnected. But in organizations, I think so many different types of skill people ...
Yes, personas can create data. We want them to create data because they're doing it for different levels of organization. Engineers produce data either to make their product data accessible, to create different features on top of data or do ML and data science things. While analytics wants the data to do analysis, to answer to reports. They also want to create these data sets. They may not be for many, many use cases, but they may fit your use case or other people from their department use cases.
Then you have BI developers or data engineers, which are also another beast in this area that produce data for different use cases.
I really like to call it, we're talking about data democratization all the time and people should have access to the data, but this is more than that. It's like data tooling democratization. Everyone has the right and the ability to build their own data pipeline and we should enable that. It's part of enabling growth.
To answer your question, we don't have third parties creating data inside our systems, but it's just a lot of different people writing data.
It's so many different teams that in a way it structurally almost behaves like external. A large enough company has actually several small companies quite often.
Okay. I understand that now. As you say, there's a lot of different personas that you have to get them to think about the data they're happily writing differently and provide them and the consumers more awareness. How do we even begin to tackle that?
Actually working right now on this framework at ZipRecruiter to help with this process. But I think that this word process is the beginning of everything. Hopefully, the infrastructure and the tooling can help with facilitating this. But the first thing you need to do is ask these questions. You have to talk to people. It's just like designing any type of product. You say data as a product. It's a real thing.
Product managers, analysts, BI developers, whoever is writing the data should own it. The first part is actually talking to people, understanding the consumption patterns and making sure that in the end, this dataset will be usable. What our infrastructure is actually doing is pushing into that direction in a way that you're going to have to describe a lot of different things when you write a data set. You're going to have to describe the owner. You're going to have to describe the schema. You're going to have to describe how you do aggregations.
If you're going to do all that, they should probably be aware of your consumers. One other part's very important is now since we did a really good job with this new framework that we built, there's lineage, so you know who consumes you and you know what kind of things that you consume. The lineage part is super important for everyone to be aware of their surroundings. Again, it's a process where the infrastructure help push that.
I find that super interesting. Because it's like, how can you begin to care about the quality of your data that you're writing? Until you know who cares about reading it.
Exactly. I also think that the concept of the shift left, which is getting bigger and bigger in the industry. Eventually you come to a conclusion that unless you shift left, there's no way to overcome these issues. The leftist, you shift it.
I'm going to make you define shift left as a movement.
Oh, I'm going to, yeah. Actually, I borrowed it from DataHub. I think they were the ones that used this term first I think. But I really like it. It actually means that you don't need or you can't fix all the data issues and data corruption discrepancies on the analyst side, on the consuming application side. The leftist is you fix this, which is on the producer side, on the service side, on the database side, the actual developer that now creates this new feature, which is going to create this new stream of data, has to be the one that is fully aware of the quality of the data that they're creating. If there are any data issues they should be solved and the source.
Then I think this is the true key for building a healthy system. Otherwise, more and more people are wasting more and more time trying to figure out and fix the same problems that other people are going to solve because you're not solving the problem close enough to the source. But this is ...
Push the solution upstream, right?
Yeah, exactly. Then you have many solutions that people have to apply. They look differently. You don't really solve the problem from the root cause. This is complicated. This is a big thing to say because it also brings much more responsibility to the application side where they don't necessarily care exactly about some data duplication that doesn't affect the application side. This can lead to another culture where you create a different clean set of data that you feed analytics with, which could be different than the data that is used for the application side.
But I think at the bottom of all of this, it's a matter of understanding how your data is being used and taking ownership over it and realizing this is an asset that you control and you produce.
Yeah. It's really a part of process and culture more than its technology. The organization needs to want this and understand why it cannot work any other way.
Invest the time in it.
Yeah. Both in tooling and both in all the different people that are actually producing data. They have to do a lot more work right now.
Because there's always the risk in a less healthy organization, that you go to the people writing the data and they say, "It's fine. I've been writing it that way for three years. It's absolutely fine. Why should I change?" What's their problem?
I won't call it risk, I'll call it reality.
No, but I think unfortunately, and I think we see it in both organizations is it's a top-down thing that has to happen. The organization needs to realize that it cannot work that way. For example, in our organization, we're going to start this process called certification of a data set. It has to be certified. Again, we can use the data catalog and the different tooling and help the producers make that a quick and really easy process.
But again, we now have a set of expectations from data consumers that they only consume certified data. That certified data has to uphold a bunch of different things. That's one way to solve this. We know there are many different ways, but again, it's a process way to solve this issue.
I think that where infrastructure comes in and what you just described is, how you create this processes and change things in the way that would not hurt the velocity of the organization drastically.
Yep. That's our role, I think in the organization. Make it transparent.
Just making more work for ourselves. Yeah, it's just to make sure that we have something to do in the organization.
By the way, it's not just velocity. I think one of the things that we actually love to do, enjoy myself, I'm really a fan of this is making this a fun process.
A developer happiness.
Yes. Oh my god, I love it. No, but it's super important because ...
Did I make it up?
It's yours. It's yours now. Developer happiness.
You tried picking it.
Yeah, it's true. She's really looking forward to this.
Trying all the time.
But I think this is a really important thing is when we ask these developers, "Oh, now you have to document a bunch of things that you weren't documenting before." Or there's no expectation from. There'll be like, "Oh, but it's so boring. I'm not even a consumer of this data. Why should I care?" You have to make this into a fun process that's very easily implemented. They have these different visualizations to see if actually this process actually worked. If they understand what's in it for them, that can really help with the process.
This reminds me of something like Swagger. The quality of people's documented rest APIs went up after Swagger became mainstream. It's like, "Okay, so it's no longer a horrible chore and a maintenance burden forever." Is that something you've done with the tool? Something like that, have you done with the tool you've been working on?
I think we have different use cases. In Yotpo, what we did was utilize a tool called dbt to document all the different data sets, which is really nice. It has a lot of difference. There's a UI on top of it with a data catalog. It's really easy to write and test. We really invested and you can see our Current talk if you want, you can see a demo of what it can do.
Link in the show notes to your talk, you gave to Current.
Yes, we'll do. That will be the Yotpo. A ZipRecruiter for example, we're doing the same but using protobuf format. That's actually very similar to Swagger. We're going to document all our data sets with protobuf and again, making it very easy to document things and making it very testable and CICD-oriented and everything. But again, protobuf is going to be the format to document all that.
Yeah. It's a concept of having an actual contract that both sides sign on and understand and respect.
Yeah. It's funny. Sometimes in this industry, we avoid saying, "This is static typing, isn't it?" These are strong types of your data. Yeah, it's the same idea.
Let's not open the subject.
It's kind of worms.
Can of worms. [inaudible 00:32:24] types, let's move on.
This is the thing during my career and Doron's as well. Are we starting? No, I've been moving back and forth on the topic of schema. Schema less, kind of typing. I think by the way, the world is going back into strong typing both for just programming and also for data. Because I think it's just really messy if you don't have that. If the organization is really big, then it's even too messy in a way that it can really cause production issues if you don't have that in place.
Schemas are really important, but I'm going to add something to that. That's also important for me is that it's a question of where that control over the schema or the data types is.
I have an example. No, finish your sentence before I'll give an example.
It's something I haven't finished talking about.
No, I know.
No. In the infrastructure world, it depends on the organization. But at the Yotpo, we used to have all of the data pipelines go through us. It's not us. It's a virtual us. But we were the ones in charge of what's behind the scene running all the different ETLs and data pipelines. Whenever something was bad, it came back to us. If you have a streaming job and something is like this, for example, someone made a non-backward compatible schema change ...
That was the example.
Oh, okay. Sorry.
Okay. Then all it comes back to us. "Oh, so fix it for us. It's your problem. You are the infrastructure. I made a mistake. Deal with it." I think this is where we want things to be different. That schema enforcement or backward compatibility and everything, needs to be at a level where everything that's below that level can still be schemaless, fluid and free and not having all these operations issue. It's still on the producer side or the shift left, whatever Doron said before, to make sure that they have their contract. But it's not on the infrastructure side.
Yeah, I think ...
You don't want to be an infrastructure service. You need to be a platform.
Exactly. I think exactly what Liran said, just to follow this through, what we did back then, we had a certain issue where we were streaming CDC data and for our purposes to feed live tables in the data lake. But suddenly, "Whoa, this is super interesting data. All these consumers said, "I want this too. I want in." Then it started consuming from the same topic. Then we started getting heat from the compatibility issues because changes were being made to the schema and us, as a CDC consumer into the data lake did not really care about it.
So we said, "Okay, no compatibility. We don't care about this." Then we started breaking consumers. Then the architecture that we came up with is actually separating into different Kafka topics and using MirrorMaker to replicate the topics onto the consumers, brokers for them to manage their own schema compatibility levels, which solve the problems. You can manage whatever schema compatibility that you like. I think this is a really good example of how to distribute the responsibility for whatever changes happen.
Can I cut? A split ownership model. I have another example, but I don't think we have time. I don't know.
We've got time. I love an example. I love making it concrete.
Okay. I actually have an interesting example. That's actually from a production side, not from an classic data lake side. We have the Elasticsearch for our CDP. We build our CDP, our customer data platform at Yotpo. We were like think about a centralized place where all of the events from the organization are flowing in.
Then you can do aggregation and analysis on top of it. You can cluster different customers and send them, for example, this was mainly used for messaging. You want to find all the customers that have a specific set of properties. The database at this point was Elasticsearch. It was getting fed from many, many, many different data producers, the data.
We have producers that were creating data on their side, they're writing their contract, for example there. Then just feeding it to some kind of a very complicated Kafka-based system that was feeding Elasticsearch. Elasticsearch was actually a very problematic infrastructure at that point because it is not a schemaless tool by design. It looks schemaless, but it's not. It still has a schema behind the scenes.
You cannot change a type for a field after the fact. You're going to have to create a new index. That's just an example of something that I actually was interrupting. Then they were like, "Oh, okay, I have control over my schema. It's okay. I want to delete it. I want to create a new one, or I want to change something that's not compatible or even want to push something that's even from third parties, which I don't even have control over its schema." There's like a schemaless world happening.
Now, that's an example for our third party producer.
Oh, yeah. We wanted our customers to send their own custom events to the CDP and then Elasticsearch, let's think about it as a part of an infrastructure piece of the puzzle was not even allowing that. That was where I think, I don't know if they're changing into a different engine behind this.
No, because you left.
I left. I don't know what's going on right now. But again, this was an example of how an infrastructure can really break that model. You want the schema, but you wanted it at the consumption level. Elasticsearch was also in the producing level. We're really affecting the entire way the system can work and limiting it. That's like an example. Yeah.
Oh God, you untie knots like that though.
Yeah. This was too much for me. Is this too much?
I'm getting a sense that this is a slightly sore point that he's abandoned you.
No, I just miss him. I just miss him.
Yeah. I think for example, in this case, we were looking at technologies to allow us to feed schema's information or change the entire data model. It's always like that. You always have to think about, "Maybe I've done this entire data design wrong. I have to change it." In this example, maybe we should have split into different indexes in elastic based on the event type and just do a dynamic creation of indexes based on the schemas. Or maybe we should have used a very generic schema and everything is a string.
Then the consumption layer, you're going to do something called schema on read. You can use that feature or use a technology like Rockset, which allows you to feed schema's information and then do schema on read, whatever you want in a very streaming way. It's a very cool technology. There's a lot of different solutions back then. But again, it just shows how it's important to have an infrastructure that it's not really opinionated on how to actually consume the data.
Yeah. I think a lot of the times, it's a journey. A lot of the things we couldn't have known at the startup point, how things are going to grow and what are going to be the needs and the requirements. This is why we have to keep agility, flexibility and also in mind and actually work plans to fit the solution wherever it goes.
Yeah. If you make a decision like this, this can be a really, really expensive decision that you make. If the system is like we're building it right now and what we've worked on together at Yotpo is you have this decoupled system, then the decision is not as expensive as it can be. You have that decoupling in place. You can always switch to different thing. There's always going to be a price for every change, right? It's never fair. But how costly it will be? That's up to if you did create this really good infrastructure ...
It sounds like you've been through a bit of an evolution in the wars and you've got some ideas. But I know that you are trying to turn this into a system and open-source it. Tell me a little bit about your open-source plans so other people can benefit from this.
Interesting, interesting. I no longer work for Yotpo so, do we have open-source plans?
That's the rumor I hear.
We have plans. That's a rumor we're trying to spread. At the moment, we're really focused on winning with this new platform and making this. Working on feature parity with Metorikku in what we have today in order to replace it substantially. That's what we were super focused on now, but the feedback, what we are getting through the demo and we talked to people and we showed it in a few meetups and shared it with people, people are excited. What we do in Yoda, it really gives a more holistic view on what dbt we're trying to do, which we really liked and really loved it.
I want to reiterate. Yoda is a project that we started off with dbt. dbt, it's called Data Build Tool. It's a data modeling infrastructure open-source by a company called dbt Labs. They also have a managed solution. It's a really defacto solution for data modeling today. It supports so many, many different engines. It was really fitting to our use case.
With Yoda, what we did was, okay, so dbt is great and a very basic part of what we need, but doesn't solve everything. It's not really about the execution layer. It doesn't really good do well for orchestration. You can see our talk about this. With Yoda, we're trying to solve this from one side, from the developer experience side. Again, as Doron said, we want to win, we want people to move to Yoda. We created this really nice way of interacting with it, which is better than what dbt proposes as its main interface.
Then the second part is wrap it up with CICD and orchestration layer. That helps do very complicated things automatically behind the scenes. This is what Yoda does.
Does it help at all with the whole governance and lineage angle?
Yeah. But lineage is basically something you get out of the box from dbt because everything works with inherent references between different data models. That's really nice. By the way, it's a different kind of lineage because it's not created via run time. It's just created from the actual code. Then the SQL that you are creating.
But that that's also very good solid lineage that you have. That part is done. I think the part of the data documentation that we really, really care about, we added our own automations in order to make this easier for the developers. Because it's really tedious to type in all these YAML files and all these descriptions. We wanted to make this easier for them. As Liran said before, fun to encourage them to do so.
Because we talked about it a lot. But I think that during the process of developing a new data pipeline, you're all into it. You know what it means, where it comes from, who to talk to and when time goes by, and especially when you leave, there's no one else to tell the story. There's this pipeline running there and no one really knows how to attend it and where it comes from, what's used for. This is a very critical point when a pipeline is being developed and created to capture all this metadata.
It's just not letting that all go to waste.
You're just collecting it in a smart way.
They already have that information in their head. They just need to put it somewhere.
By the way, like dbt gave us out of the box, this beautiful documentation side where you can see all the lineage and documentation, which is awesome. Definitely this is where we're starting with. But it goes without saying that we need something bigger like a data catalog solution, for example, ZipRecruiter are using DataHub. We need something on top of that to serve the whole system and bring us end-to-end to map out the whole data platform from sources to the different applications consuming it. That's the full vision of where we're going.
If we were talking about open-sourcing it, we would open-source probably ...
We would open-source.
They would open-source.
You can be a contributor.
Yeah. No. Okay. Let's not talk about that.
You signed your way right when you left. I'm sorry.
No, I'm building something much better here. We're going to call it Zoda. No, not really.
Yoda's Yotpo ....
Did you need a divorce lawyer or something?
Yeah, we did.
Anyways, there's a piece where it's the interactive CLI tool that's really useful that Doron just described, where it's auto creating a lot of the different documentation piece. Again, we're using a lot of the data that's already out there. I already acknowledge that you have it in your head. It's knowledge that it's already out there in the catalog or in the meta store. As we collect that, we can create that automatically and generate that for you. This is a lot of auto generating pieces.
Then I think the second piece, they'll probably open-source is the orchestration level, which is the way to automatically map how to run and when to run the different data sets. All we ask, for example, from our developers from there, it's so confusing from developers in general, "Is tell us how fresh do you need this data set to be?" Is it going to be a streaming dataset? Is it going to be something you're doing to be once a day, once an hour? What are your requirements?
Then behind the scenes, since we have the lineage and we have this knowledge, we can auto-generate the entire orchestration. We can create Airflow DAGs. We can create streaming pipelines. We can do whatever we want. All you have to do is just tell us what you need. We'll take care of that. Again, you see that there's a really nice decoupling of the execution and the actual business logic.
Yeah. It's the actual abstraction. If we want to create this workflow management, this workflow for this specific pipeline or pipelines, then the developer doesn't necessarily need to know what is happening behind the scenes. It's another thing we can spare from them, which now they need to know how to build an Airflow DAG, but it's completely useless.
It's useless information. Yeah. Why would they need to know that? No, it's boring in a way that it doesn't really help them with ...
Yes, programmatically, very not exciting. It's another trademark.
Liran, would you like to commit Doron to a release date for that?
Yes. They were planning to release it in the beginning of January 2023 for the new year.
Excellent. Amazing Christmas.
I hope they'll invite me to the party. But I don't think they will.
2023, you said.
Yeah, that's in one and a half month.
Yeah, take your time.
Don't hold your breath. But it's coming.
Depending on when we broadcast this podcast. You may have already missed that release date.
No, it's a really cool project. I'm super proud of the work that we did together. They're doing it right now. Really, I'm rooting for the win because it's a win for all the people that actually do platform engineering. Because this is the ultimate data platform engineering effort. I would really love to see it win.
Yeah. Another thing that we basically invested a lot in is that basically, dbt is really built for data warehouses. It's really custom for it. We wanted to open it to something that's much more flexible and different and to fit it into our data lake and open data platform needs. I think this can address a lot of people in the industry. The use case is something that's very common.
Yeah. If I could put a cap on this, I'm going to tell you what I think I'm going to take away from this and you can tell me what I've missed for the conclusion. The overarching theme here is find a way to get that feedback loop from your consumers to your producers. Give the producers ability to see the larger picture without distracting them with lower level details, like how systems will be executed. You've got to get them to focus on the data pipeline from a 10,000-foot view. The only way you're going to get that shift is you have to have management buy in from the top.
Is that a fair summary of the picture, a three-minute sketch?
Yeah, it is. I think if you see our talk, it's a story about creating data platforms and the different decision making that you have along the way and how they can really affect the culture for the organization. These two things you just said are completely right. We didn't have our management support or the entire organization support in the beginning to all of the different things we wanted. Therefore, we had to work from the bottom up.
We had to work from our engineers. We created Metorikku which is very successful. But again, wasn't talking to analytics or wasn't talking to different data consumers. We were really focused on the engineering side. That part is really right.
I don't remember the other.
This feedback loop from how it's being used to how it's being produced.
I think that's the part that we're both relearning right now as we go into this and in both of our companies is the consumption of data is super, super important. Just putting data out there is not enough. You have to have more information about it. That feedback loop is really important. You have to talk to consumers to understand what they need and to really, really understand what are the expectations from data.
Even if we, let's say we did create some really good data set and it takes about 20 minutes to query, is that a good data set? I don't know. I don't think so. It's not about just the metadata. It's also about how people are actually going to use it. What kind of questions they're going to ask.
If you are not thinking about it or you're thinking only about yourselves and your team, then you're missing the point. Then they're going to have to create these duplicated data pipelines to summarize your data or to create some shadow data pipelines on top of your data to make it smaller, whatever.
I think ... Oh, sorry, go on.
I just want to say that I think it's also a story about maturity and how we matured as the data group and how Yotpo matures and ZipRecruiter matures in terms of how we treat our data. We also talked about it at our talk, that we talk a lot about governance and governance tooling. That's a big topic in data infrastructure, but we don't talk enough about our role as data governors. I think that this piece of infrastructure really had a fancy English word, which I forgot, but I'm going to say it's in a silly English word. But I think that this is ...
We have so many of those.
No, I had a really nice one in my head, but I think that this piece of infrastructure really shows what we believe in and what's important for us and our job as educators for our organization and how we can push this agenda through this.
That was your big word.
No, it was a much fancy word. I'll write it down for you later. Agenda's not a big word, I know. Shut up.
Email it through. We'll edit the show note. Just randomly.
No, just to put a text to ...
Just shut up.
Okay. I think it's a shame you two are no longer working directly together. You should find a way to fix that.
Yeah, we know.
But in the meantime, thank you so much for joining us. Liran and Doron.
It's been a pleasure.
See you guys.
Thank you Doron and Liran. This is a bit of a tangent, but I'm going to go there. I have always wished that the Agile Manifesto explicitly mentioned feedback loops. Feedback loops are in there. If you read between the lines, they're there. But it'll be nice if it was explicit. Because so often we return to this idea. A large part of the job of being a programmer is to get faster and better feedback loops, to iterate better, to give people more of what they need quicker.
Sometimes we do that by leaving the desk and actually talking to different departments and talking to users and understanding them better. But sometimes it's about building tools that we can use to skip over that conversation and get everybody the answers they need fast. You need both. My point is, you might think that a platform team could just hide in the server room and keep things running. But to be good at it, they actually need to be involved in feedback loops. They need to be enabling feedback loops for others. That makes a really great platform.
Speaking of feedback, as we were, we love feedback here at Streaming Audio. Please do get in touch. There are like buttons to click, comment boxes, review boxes and share buttons if you like that kind of feedback. Or if you want to get in touch with me directly, you can find me on Twitter if it's still running there. At the time I'm recording this, it seems a bit uncertain how much longer Twitter will last. But assuming it's there, I'll be there too. Come and find me.
If you are ever in London, I've recently started running some monthly hack nights in London, so look us up and you can join us for a bit of programming. With that said, before we go, Streaming Audio is brought to you by Confluent Developer, which is our free site that teaches you everything we know about Apache Kafka and realtime event systems in general.
Check it out at developer.confluent.io if you want to get started with Kafka or if you want to level up your Kafka skills. To do that, you'll probably going to need a Kafka cluster. The easiest way to spin up one of those is to go to Confluent Cloud and use our Cloud Kafka service. That's our platform. We're quite proud of it.
You can sign up in minutes. You can have an Apache Kafka cluster running in no time. If you add the code PODCAST100 to your account, you'll get $100 of extra free credit to run with. With that, it remains for me to thank Doron Porat and Liran Yogev for joining us and you for listening. I've been your host, Kris Jenkins. I will catch you next time.
In this episode, Kris interviews Doron Porat, Director of Infrastructure at Yotpo, and Liran Yogev, Director of Engineering at ZipRecruiter (formerly at Yotpo), about their experiences and strategies in dealing with data modeling at scale.
Yotpo has a vast and active data lake, comprising thousands of datasets that are processed by different engines, primarily Apache Spark™. They wanted to provide users with self-service tools for generating and utilizing data with maximum flexibility, but encountered difficulties, including poor standardization, low data reusability, limited data lineage, and unreliable datasets.
The team realized that Yotpo's modeling layer, which defines the structure and relationships of the data, needed to be separated from the execution layer, which defines and processes operations on the data.
This separation would give programmers better visibility into data pipelines across all execution engines, storage methods, and formats, as well as more governance control for exploration and automation.
To address these issues, they developed YODA, an internal tool that combines excellent developer experience, DBT, Databricks, Airflow, Looker and more, with a strong CI/CD and orchestration layer.
Yotpo is a B2B, SaaS e-commerce marketing platform that provides businesses with the necessary tools for accurate customer analytics, remarketing, support messaging, and more.
ZipRecruiter is a job site that utilizes AI matching to help businesses find the right candidates for their open roles.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us