When you're talking about data mesh, and talking about data mesh is the thing that we in the Kafka community are doing a lot these days, you're probably talking about the work of Zhamak Dehghani. Or if you're lucky, you're talking to Zhamak, which is what I get to do today. We talk about data mesh. She's the creator of the idea. And she really gives a great introduction to it as a conceptual framework and kind of grounded in enough implementation concepts. Listen to this episode, this is the thing you need to understand. And she gives a great explanation.
Also, can't fail to mention, Streaming Audio is brought to you by Confluent Developer, and that is a website, developer.confluent.io. I just like saying that, so I'm going to say it again, developer.confluent.io. There are free video courses, a library of event-driven architecture, design patterns. That's actually pretty substantial right now. And something we're always building out more of. You can get a list of episodes of this podcast if you want. So all kinds of great things, check out Confluent Developer. And if you listen to the end, I will give you a discount code that would get you extra Confluent Cloud usage, when you sign up and you do courses and do exercises there. So without further ado, let's get to my conversation with Zhamak.
Hello, and welcome to another episode of Streaming Audio. I am your host, Tim Berglund and I am super happy to be joined in the virtual studio today by Zhamak Dehghani. Zhamak is the person best known for the topic of data mesh. And that's what we're going to talk about. Zhamak welcome to the show.
Hi Tim. Thank you for having me.
I'm excited to have you on, and I neglected the whole say where you work and what your title is. Now you happen to work at Thoughtworks and I know titles at Thoughtworks are kind of a thing, but what do you do there?
Oh, at the moment, I'm kind of focused on the emerging technologies in North America and really creating a process that allows other people to come up with new and novel ideas like I did with data mesh.
Oh, nice. Oh, okay. That's very cool. Actually, I didn't know that. That sounds like a thing Thoughtworks would do with a senior person. That's awesome. So let's talk about data mesh. Now, I'm not just going to ask you what it is, we'll get there, but I think the best way to attack this if you're new, I mean, everybody listening, if you haven't bumped into it as a buzzword, you have now, but you probably have. I think the best way to understand it is in terms of the story. So tell us about analytics. Tell us about data. Where have we been? That is maybe, the thing that we want to change. Tell us about the bad old days of analytics.
Sure. Maybe I'll start a little bit from the beginning. I think for almost half a century, we've had ambitions around putting data into work to make better decisions to serve our customers better. But the progression that we see is that we kind of started with data warehousing, which was supposed to serve business intelligence, making strategic decisions based on reports that were being created on a quarterly basis, monthly basis, and so on. And as time has gone by, we have found ourselves with much more audacious kinds of use cases for data, right. Today we naturally want to do all the things we did with business intelligence, but we also want to use data for designing up products for optimizing the services, optimizing the workflow, giving superpowers to our employees. So we really want to embed [inaudible 00:04:08] and analytics in every aspect of our business.
And not only that. We want everyone in the organization to be able to use data. We really want to be data-driven. So if you have a product manager, they need to understand the behavior of their customers and the users based on the data, they need to embed intelligence in their products with personalizations, recommendations, and all aspects. We want every engineer to be able to build a solution. So, the complexity in our kind of ambitious plans for using data has grown. The complexity of those use cases has grown. And on the other hand, I think what we've found is that to truly build those ML models or analytical kinds of functions, we do need data that comes from many, many different sources within the organization or outside of the organization. So we tackled the scale of the data over the last few decades from the [inaudible 00:05:17] case of volume, we tackled volume with Hadoop and distributed file systems and those sorts of systems.
Mobile came along, and then we had a problem with velocity and streaming and Confluent and Kafka of the world. And streaming technologies came along and we have a problem now with the diversity of the data, the origins of the data, and where those sources are. They can be anywhere on this planet or beyond. So, how do we need to meet those ambitious goals that we have, given the complexity and dynamic nature of the organizations and given the proliferation of this ubiquitous data? Like what solution do we need to build?
Right. Something different than we've had. So just thinking about what you said there, and I always like to, when contrasting the way we've been doing things with the revolution, whatever revolution we're currently talking about at the minute that we're trying to unleash. I always want to say, it's not that we were doing it wrong. It's that the concerns have changed. The environment has changed because of the classical ETL data warehouse world... Put yourself a few decades ago and the set of concerns that it grew as its own revolution to address. You might picture an oak-paneled room and a person in a suit, maybe some gray hair with a phone, giving orders to buy and sell and ship. And then somebody hands them a report, and they need to look at this report so they can make decisions about what to buy and sell that day or what the business should do that day.
And that, I'm playing it up a little bit, but that was what data warehouses were for. There was a decision-maker and she or he needed to know what was going on yesterday so you could give good orders today. And you said some things that I think are key here, which is that, that's fine. Nobody's going to stop wanting that. That wasn't a bad thing to want. It wasn't a bad thing to build tools for. It was a very, very good thing, but data now is getting driven into operations so that people may be at the bottom of the org chart who are just doing stuff, they need the analyzed activity of the business put in front of them for decisions that they're making.
Yeah, absolutely. And it's just that we... If you imagine this picture of data coming from so many different sources and it needs to be used by so many different people and in so many other different contexts and use cases, what's the best way to connect these two? Is the best way to connect it through a lake, a warehouse, an intermediate data team? Or is it really the best to connect them in a peer-to-peer fashion?
Mesh implies many peer nodes that are interconnected in arbitrary ways. It's a graph. We don't know a priori what those connections are going to be, but we're going to make it so those connections can form.
Exactly. I mean, any system. And we talk about system in an abstract way, the system could be an organizational structure. It could be an architecture technical system. But I think any system that needs to scale out, and we're at a point in time that we really need to scale out our solutions, need to remove the points of synchronization, the points of coordination. I mean, if you think about the past technological kind of breakthroughs that we had in terms of scaling, I can go all the way from networking with Async IO to reactive programming, to how even the parallel databases work. All of these systems have been able to scale out by removing points of contention and synchronization.
Okay. Make it so that they don't have to synchronize.
They don't have to have this intermediary broker or someone who makes decisions for them. And yet organizationally, we have had to design ourselves. So we will have this data team, BI team, someone in the middle that would get the data from all these different places, make it nice and usable, whether in lake or warehouse format and those are different, but nevertheless, it will put it somewhere that now everybody can come and use it. So that system, just organizational and architecturally, just simply doesn't scale out for where we are today and where we are heading.
You're right. You could just look at that from an organizational block diagram perspective, very high level, and you ought to be able to smell something, and sure enough. So I want to get back to that. I want to try to give... And we'll see if I remember this. I don't always remember. But I want to try to get back to the data team and like how do we get them on board.
But what happens now? I've got the picture in my head. I want to make sure I'm... Listeners also have the picture. There was this thing that was designed around daily reporting. Now we need it to be driven to an operational level, which is going to change the kinds of systems we're going to have. And also we don't want a big funnel of things that we're going to extract and add value to, and then put somewhere.
We want a web or a mesh of things that are talking to each other. So, that makes all the sense in the world on an abstract basis. But how do you do that? So this is kind of the what is data mesh question. Those are the concerns that motivate it, and that's the problem that you're trying to solve, but what do you build?
Yeah. And they are hard problems to solve. I'm not pretending that we have actually technologically have solved all of the problems, but I think we can start with kind of high level, how can we make it possible, and then go into details of practically what things need to be done.
But at the principal level, the things that need to happen are a shift in the culture, as well as technology so that the data can be [00:11:44 shared] for analytical purposes right at the point of origin. We always had this idea that data for it to be used for training machine learning models, or [inaudible 00:11:55] querying reports, has to move to somewhere else for it to be ready for consumption. So, first and foremost, I think this idea of domain ownership, that the domains of the business that are shared now, producing data as a by-product of running their business are responsible for sharing that data in a way that can satisfy those analytical modes of access. Now we can talk about what those modes of access are.
But once you do that, then how do you prevent the siloing of data within these domains? And the second principle that data mesh introduces is that, well, we need an architectural unit or a construct, as well as organizational accountability, to be able to share this data as a product. So it's no longer that data is just bits and bytes on a stream or on disc, is actually packaged in a way that delights the experience of a data analyst or data scientist. It can be found. It can be understood. It can be queried right at the point of origin.
And somebody is managing that as a product there. This is a thing that I export to the organization.
Exactly. There's no hot potato game that, oh, I just run the e-commerce system. And I'm dumping my events on this beautiful stream that I have. Somebody else would take care of downstream on turning that into, if it's a lake, lake [inaudible 00:13:25] or if it's a warehouse, some bottled water for other people to come and drink. Now I'm responsible right at the source to provide, whether it's in event format, event stream format, or tabular or file. Whatever format it is, I'm countable for long-term ownership of this data to reflect the reality and truthfulness of the business with quality and complying with some sort of a global standard so that this product that I produce, this data, is interoperable with the others.
Right. Right, okay. And that product thinking is orthogonal to the format of the data. Like you said, it could be tabular, it could be stream.
Files could be an Excel spreadsheet on a SharePoint share. In fact, let's edit that out. That's terrible, but we're not. Narrator. It would not be edited out. Okay, so the product thinking is really the sensibility change that happens in this. What do you call it? This domain encapsulated [crosstalk 00:14:26]-
People building the thing.
Absolutely. And then the moment you do that, you go, gosh, that's a lot of work. Like how many days [inaudible 00:14:34]. How many data engineers do I need to go higher so that they can now have... We can have these data products coming out of the domain. So, that's from a kind of an academic point of view. This is wonderful, you have these domains, they're sharing their data. They have data product owners. What could it be? Couldn't it be better than that?
But this is the moment that the CIO would say or the CTO, how do I afford this? How is this going to be feasible? How many hundreds of data engineers do I need to go hire now? And I think this really gives space for innovation here for us to imagine kind of the infrastructure and the technology that enables creating these data products in a way that you don't need to hire hundreds of data product developers.
Those are your generalist kind of programmers that learn how to use, of course, certain tooling, but we have created enough abstraction that they don't need to be specialized data engineers. So that brings us to kind of the third principle of data mesh, which is, re-imagining ourselves of infrastructure to use the technologies that exist today, but create the abstractions that are missing for them to be used much more easily.
Right. Right. And this clearly creates a demand for those [inaudible 00:15:55], and that work is not yet done. That stuff needs to emerge now. But just like any early adopters, there's a couple of things going on in my mind right now. Number one is, I spend a lot of my time trying to persuade the people who build things out of bits, that event-driven architecture is a good idea.
And that's still a scary proposition because nobody knows how to do it. Nobody has done that. Everybody has the sense that it's a good idea, but everybody's building their first system and it's a scary time, right? So, I'm in a sense trying to help mediate this burden that I am also putting on people, but let them know it's not that bad of a burden. Same thing here, because I mean, you did just describe again from a product perspective, and like you said, academic, or 10,000-foot architectural perspective, it is a glorious vision, who can deny it.
And people who have to build things are like, well crap, come on. But early adopters now, are the people who are building that infrastructure. Just like if you built your own Kafka Connect in 2014 or your own Kafka streams in 2015, people did that. And it's because they're cool. It's because they were there first, and that's fine. But now we've got these standard things, just looking at the Kafka ecosystem that solves those problems. And we are in the early days. We don't know those APIs yet. We don't [inaudible 00:17:21] those spring mesh extensions to the spring framework or whatever they have to be. I think the people who are catching this vision are the ones who are building their own buggy, partial implementation of that thing that in five years we'll be one of two or three competing standards that, that is how you build this. At least that's what I see.
Exactly. I think that's why it's such an exciting time. How many times in a decade have we come across... We find ourselves at this point that we have this blank canvas to kind of reimagine and innovate. So I am super excited. And as you said, like over the last few years, because this hypothesis, I suppose, within Thoughtworks came about, I kind of introduced it in 2018. So we've been kind of busy trying to implement it with our clients, use the technology that exists today. We're another product company so we just use the products that are out there and we'll wire them together and we build some bespoke stuff on top.
I can tell you that there is a lack of kind of new sets of technologies. I would say we still have to capitalize on what we have built. It's not that we're going to throw out all this research and technology that we've built, but we have to... I think it's why I configured them slightly differently and fill some gaps to build those platforms.
Yeah. Well, that's clearly what's coming next. I want to get as concrete as possible just for a few minutes. In implementations that you've seen and that you've participated in, what does it look like? I'll just ask some questions that come to my mind. To get very tactical about it, this seems like a schema problem, on the one hand. Like there needs to be some agreed-upon way for me to describe the stuff I'm exporting or do we not want that to be centralized? Does a data team or a product team or a business unit, whatever you call that thing, they're building the application, they're exporting the data. They're taking a product mindset with respect to the data. Should there be no organizational schema standards and format standards? Do they just publish that locally? How does the integration work?
Good question. I think let's touch on the fourth kind of aspect which we didn't talk about, which is related to this. This idea is that we need to have some form of governance. Like interoperability, as you said, is a big now concern once you distribute the ownership. How do we allow these domains to go and define their own data model for their own domain? Because the e-commerce guys understand best how the user is interacting with their applications so they can convey that information either as events or other forms of structure.
Nevertheless, the semantic of that user interaction with e-commerce system, semantically they need to communicate it. So we can allow them to go and define the model for that independently, but yet in a consistent way with the same language, with the same modeling language, that if I'm an analyst or data scientist and want to consume data not only about my user interactions with my e-commerce, but also the orders that the customer has put through.
And also the demographic of that customer that I'm getting from some other third-party source. If I want to correlate all of that data, I don't want to deal with three different semantics and three different syntax languages, right?
Exactly. So there are certain concerns that the governance needs to standardize on and the platform needs to enable that standardization. So that's why the role of the platform is so key here, and again, self-serve. So in our implementations, in fact, we created this thing called data product quantum. I'm sorry for the name, but naming things is really, really hard.
It's hard and you might as well make it sound cool. So data product quantum.
Yes, it's like this cool DPQ thing. God, it even sounds worse.
No, no, it's better. Every time you say it. DPQ, anyway go on.
Essentially it's a unit of your architecture that encapsulates the code that needs to maintain a domain data-oriented, I guess, orient... Sorry, domain-oriented data. So it's a code that maintains that analytical data and also it's the access to the data, the underlying data itself, over a long retention time because I mean for use cases need to go back and forth in time. And that unit of architecture, what we have done for our clients, we have automated the scaffolding of that, the blueprint of that, like what constitutes one of these things. And then you get a set of tooling from the platform to define your schema. You can standardize in different schema standardizations, you have your own JSON based, or I don't know, Avro, whatever the standardization is.
Specifically, we're a Protobuf or JSON schema whatever.
Exactly. Whatever you decide. So, you get those sorts of things that need to be consistent. Those cross-functional concerns from your platform. And as part of the scaffolding of your data product, you get access to those APIs and tooling from the platform so that you can have a consistent schema language. You can have a consistent way of discoverability of your API. So here I am, user interaction with e-commerce data products. These are the datasets that I expose. These are the SLOs that I guarantee.
All of this kind of information that makes that data really a data product. And here's the code that will maintain this. This is a code that gets the data from, let's say an event stream from my e-commerce system. And then it turns it into an analytical format. And this code runs on a CI/CD pipeline, it's deployed, it has testing, all of those good things.
So we encapsulate all of that into a unit of architecture. It looks ugly today and on the surface, you go, oh, I have this beautiful, single unit of architecture, but when you actually look under the hood, it looks very ugly. It's not like microservices that you have a Docker file and you have Kubernetes and you deploy this one thing. The logical encapsulation and the physical reality of implementation, are two things, very, very different looking.
Totally. To be fair, I think most microservices looked like garbage in 2014-ish. When you were three years into their life, you didn't have any kind of container standards and you had them talking to each other through databases. And it was a disaster. And then honestly, most people who are building services probably would say they look like garbage now. That's how we as developers describe our work, but it doesn't sound bad. I mean, I really appreciate how concrete that is and the stuff that sticks to me is... And I'm going to make it more concrete. I mean, this is a Kafka podcast, so nobody would be surprised. I think ideas like data mesh are not successful if they hitch themselves to a single infrastructure... Data infrastructure technology. You need to be agnostic with respect to that.
And you kind of need to be a layer above that. But if you're doing event-driven architecture and you're using Kafka, this would look like, well, now there are... Here's a topic with a well-known name, and here is the AVSC file that describes the AVRO schema and it's published in a particular place. And there's probably some other kind of metadata, as a read me or it's a Wiki page, or just a little bit of here's our guarantees about this. But what I really need to know is how do I de-serialize this stuff in here? And where is it?
And I think, I want to say this is totally optimistic, but just putting that down on paper as an organization. You export your analytical data this way. You have to tell us the name of the topic, and you have to tell us the schema. Product thinking, a little bit of product thinking, maybe you're bad at it, maybe you're good at it, but it's going to happen. That's an API that you've committed to now.
So that seems like that gets you there. And again, if you're doing something that's not Kafka, you could say all those same things.
Yeah. And in fact, with data mesh, at least where we are today because there is no universal way of representing data. There is no universal way of querying data. So, what I kind of have built so far is that these data products, they may have, they may choose, especially as they become closer and closer to the source that kind of source online data products, they will have a set of APIs that say, oh, I represent my data as time-series events and here's where you can get to it, here's the topic address, here's my schema. And this is the access control. Like you've got to actually...
Governance is a thing.
Authorize yourself to get access. And here's other information about it. Here's my documentation and so on. But at the same time, your data scientist, somewhere down the line may not actually like that representation. So for the same data product, you may need to have a new projection of that data that is, semantically it's the same... It's representative for user interactions with your e-commerce system, but it is storing and providing access based on, I don't know, some sort of a columnar, [inaudible 00:27:28] format and a little bit more batched, or the folks that are writing reports may not be happy with that either. And they might need some sort of a relational database with SQL queries. And I know that our technologies are trying to... Even with Kafka itself, you guys are trying to have tables and SQL queries on top.
And because of this multimodal access to the underlying data, the need for multimodal access. So, data mesh doesn't dictate how do we provide this multimodal access but it acknowledges that for this to be a product, we have to satisfy a spectrum of data users. And that spectrum may like SQL or may like to write Spark jobs, or may like to just subscribe to an event and consume events and do the processing itself. So we've got to meet where the users are today and where their native toolsets are. So you find yourself with a single data product actually having separate ports, a separate set of APIs where there are different modes of access.
That makes a lot of sense. That makes a lot of sense. And that comports with the kinds of stories that... I would say vaguely, CQRS stories that we tell in the event-driven architecture world where you've got a log and that's your system of record. And of course, you can consume that into whatever because you might need a relational view. You might need some transformed event view. You might need some text-searchable things. That's not quite what you're saying, but it has the same basic shape on the whiteboard as that. So, that's pretty cool.
Yeah, there's just a little bit... I love event-driven architectures and that's where a lot of inspirations for microservices, a lot of inspirations for data mesh came from. That's the world I came from. But we have to be, I guess, cognizant of really the needs of the analytical needs and the type of queries they run.
They run queries over a long period of time or a long volume of time. So then once this data is distributed, how can we still have a system that is optimized for people, as in organizationally, it doesn't have those synchronization points, but also optimized for systems and computers. So you're not moving data around from so many different places. So there is like a technical level, even though that we want to satisfy this distribution of data ownership at the technical level and physical level, we've got to build solutions that do lend themselves to some sort of a federated query that is still optimized.
So as this gets productized over the next five years, and maybe there're things going on right now that I don't know about, that make that five-year sound like a long time. But it just kind of feels to me like the sort of thing that-
Five years from now there'll be some stuff with traction and it'll still be immature compared to the last 30 years of data warehousing, but it'll be a thing-
I hope so [inaudible 00:30:37].
It'll start answering those questions about how do we accomplish federation in a way that respects the expense of moving bits around and the... Not the long tail, but the head of the kinds of access modes that systems and enterprises [inaudible 00:30:57].
And I think we are being forced into that direction anyway Tim. I think organizations are becoming [inaudible 00:31:02] and complex. They've got like three different cloud providers. Data is in so many different places. And this idea that just keeps moving your data around into one centralized place so that you can use it. They don't like that. They don't want to do that. So, I think we are just organically have been pushed to this position that we have to admit that data should be used wherever it is by whomever it's owned.
But we need to create that internet-like access for analytical purposes for this large-scale query and training purposes with this distribution embraced wholeheartedly rather than try to fight it. Right now in a book that we... On enterprises on one hand realize that's a reality. I'm going to talk to a lot of kinds of CDOs and CTOs, and they put RFPs out there to the vendors saying, look, I just want to bring processing to where my data is.
Don't tell me to put my data into some sort of a lake or warehouse, but we're still kind of fighting it. Of course, there are some solutions that support that, but we don't... I don't think as an industry we've said, this is the starting point. If you are building a solution for the next... That's going to serve us for the next 10 years or 20 years, the starting point, the default place to start is that data is ubiquitous. It can be anywhere.
And the workloads are analytical workloads. They are large volume, large velocity streaming or batch, doesn't matter. Some spectrum of that. And we've got to just... And the world is complex and keeps changing. So then let's listen, embrace it as a reality and stop the kind of fighting it with canonical schemas or warehouse or lake. And then I think the solution emerges once we accept the reality.
My guest today has been Zhamak Dehghani. Zhamak, thanks so much for being a part of Streaming Audio.
Thank you [inaudible 00:33:09].
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer, that's developer.confluent.io, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event-driven architecture design patterns, executable tutorials covering ksqlDB, Kafka streams, and core Kafka APIs.
There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code PODCAST100 to get an extra hundred dollars of free Confluent Cloud usage. Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlbrglund on Twitter.
That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you're watching and not just listening or reach out in our community slack or forum, both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcast, be sure to leave us a review there. That helps other people discover us, which we think is a good thing. So thanks for your support. And we'll see you next time.
The data mesh architectural paradigm shift is all about moving analytical data away from a monolithic data warehouse or data lake into a distributed architecture—allowing data to be shared for analytical purposes in real time, right at the point of origin. The idea of data mesh was introduced by Zhamak Dehghani (Director of Emerging Technologies, Thoughtworks) in 2019. Here, she provides an introduction to data mesh and the fundamental problems that it’s trying to solve.
Zhamak describes that the complexity and ambition to use data have grown in today’s industry. But what is data mesh? For over half a century, we’ve been trying to democratize data to deliver value and provide better analytic insights. With the ever-growing number of distributed domain data sets, diverse information arrives in increasing volumes and with high velocity. To remove the friction and serve the requirement for data to be consumed by operational needs in various use cases, the best way is to mesh the data. This means connecting data through a peer-to-peer fashion and liberating data for analytics, machine learning, serving up data-intensive applications across the organization, and more. Data mesh tackles the deficiency of the traditional, centralized data lake and data warehouse platform architecture.
The data mesh paradigm is founded on four principles:
A decentralized, agnostic data structure enables you to synthesize data and innovate. The starting point is embracing the ideology that data can be anywhere. Source-aligned data should serve as a product available for people across the organization to combine, explore, and drive actionable insights. Zhamak and Tim also discuss the next steps we need to take in order to bring data mesh to life at the industry level.
To learn more about the topic, you can visit the all-new Confluent Developer course: Data Mesh 101. Confluent Developer is a single destination with resources to begin your Kafka journey.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us