March 10, 2022 | Episode 203

Why Data Mesh? ft. Ben Stopford

Transcript
Notes

Kris Jenkins: (00:00)

On today's Streaming Audio, we're going to start with a buzzword and pick it apart to figure out what it really means. Our guest today is Ben Stopford, and we're talking about data meshes. What the heck are they? But before we start, let me tell you that the streaming audio podcast is brought to you by Confluent Developer, which is our site that teaches you everything about Kafka, from how to start it running and write your first app to architectural patterns, performance tuning, and maintenance. Check it out at developer.confluent.io. If you want to take one of our hands-on courses that you'll find there, you can easily get Kafka running using Confluent Cloud. If you sign up with the code PODCAST100, we'll give you an extra hundred dollars of free credit to get you started, which at my exchange rate is about 74 quid, which ain't bad. With that, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get started.

Kris Jenkins: (00:59)

Our guest today on Streaming Audio is Ben Stopford, who is the Lead Technologist at Confluent, the author of Designing Event-Driven Systems, and a colleague of mine. So it's nice to have you back on the show, Ben.

Ben Stopford: (01:11)

Great to be here, Kris.

Kris Jenkins: (01:14)

Our topic on the table today is data mesh, which sounds like a bit of a buzzword, but I'm promised there's a hard meaning to it. Going to take us through it. Let's start with a definition. What is a data mesh?

Ben Stopford: (01:27)

That's a very good question. It is a set of principles, and you can use them to guide a certain type of architecture, which is really data architectures. It's a little bit like microservices in that regard. It's a...

Kris Jenkins: (01:45)

How is that?

Ben Stopford: (01:45)

Well, in that it's a way of building a style of application. So much like you can use microservices, you cannot use microservices. You can use a whole bunch of other... You can use a three-tier architecture, an N-tier architecture. You can just build a simple, dumb monolith if you want to. Often, those are the right choice, right? So monoliths, very valid pattern for building a system. Likewise, you might want to use a data mesh. So I guess a good way to think about a data mesh before we try and define it because it's quite a long-winded is really what it is not. A data mesh is not your standard data architecture where you have a bunch of systems at the front, and you have a big data warehouse at the back, and all the data feeds into the data warehouse. That's the pattern that we're trying to get away from.

Ben Stopford: (02:40)

So, in the microservices analogy, the data warehouse pattern is the monolith, right? So nothing wrong with monoliths, and there's nothing wrong with data warehouses. For certain use cases, they just come with a set of trade-offs. So, for certain types of organization, you will potentially get some preferable properties if you take the data mesh approach. Basically, in a nutshell, if you think about a data warehouse approach, it's like lots and lots of systems at the front, and they're pointing to one big data warehouse at the back. In a data mesh, you've got this many-to-many relationship. So it's actually a lot more like services. So instead of having a service, you have data products, which is built like a service that gives you data. So you've got a bunch of data products around the organization, and they all provide self-service data.

Ben Stopford: (03:34)

Let's say you want to... I don't know. You're doing some kind of fraud detection. You have this kind of menu of different services. They're all the services. They're all the data products that you... They're all providing historical data. It's all curated by the people that know that data, and you can basically just subscribe to those datasets. You can rewind, replay, get the data set, and maybe inside your database, you can do your fraud detection or plug it into your application, and do it in real time. So that's the concept of data meshes. It's like data on demand directly from source. But that way, that source is lots of different places, and you are actually one of many consumers. So rather than the data warehouse where it's a bunch of applications going to one data warehouse, this is actually a bunch of source data products, all feeding data to a bunch of other consuming data products, and there's, effectively, a web of interactions.

Kris Jenkins: (04:40)

Okay.

Ben Stopford: (04:40)

You don't have to build these event streams. Event streams are by far the most common way of doing it. I think it was the inception case was based of... A data mesh is based on event streams as I understand it, but you have a... Yeah. So there are other approaches, but definitely, the most common way of doing it is to use event streaming because it actually obviously takes out all of that complexity of those many point-to-point connections that you would have otherwise and just gives you streams of data on demand.

Kris Jenkins: (05:10)

Okay. Let me risk missing the point in the hope that helps clarify it.

Ben Stopford: (05:15)

Mm-hmm (affirmative). Yeah.

Kris Jenkins: (05:16)

Imagine I've got a big data warehouse, and it's a bunch of tables in a big, whole relational database. I wake up one morning, and I put nearly every table in a separate database. Right?

Ben Stopford: (05:29)

Right.

Kris Jenkins: (05:30)

So now I've got a whole bunch of different databases, each serving up one set of data.

Ben Stopford: (05:34)

Right.

Kris Jenkins: (05:36)

Have I got a data mesh now? Am I happier?

Ben Stopford: (05:39)

No, no, no. Well, you might. You might. You might call it a data mesh. It is a bit like microservices. Like if I break my application up into lots of different bits, is that microservices, or is microservices actually something?

Kris Jenkins: (05:55)

Why am I breaking things up? How is it helping me? That's what I'm trying to get.

Ben Stopford: (05:58)

Yeah, that's a good point. So this is actually probably what you do. You have to get the principles to come in, but the first two principles are effectively sociotechnical. So they're about domain ownership. You should get counterparty information or customer information from the system that originates it. That way, it's most likely not to be broken, basically.

Kris Jenkins: (06:22)

So if there's a data feed team in my bank, I should be getting my data feed data from them, not from the data warehouse team?

Ben Stopford: (06:29)

Yeah, or the finance team that happened to be your mates, and they've got the stuff, and they'll give it to you. So these Chinese whispers or game of telephone as they stateside, those kind of problems. Yeah. So that's first one. Second one is data products. So that's very closely related, in my mind, to domain ownership. If you want the feeds team or whoever it might be to provide accurate data to the rest of the organization, one of the things they have to do is make sure that data is really consumable. So it's got to be of decent quality. It can't have every 50th minute message has a random control character stuck in there for you to make your system blow up, et cetera, et cetera. That's really about making every system that curate data in any form and disseminates it to the organization feel responsible for that role. So it's like making it a primary role.

Ben Stopford: (07:36)

The best way to think about this is actually in that anti-pattern. So the anti-pattern. In the anti-pattern, you're a trading system. Maybe you have some data, some trade data, which you give to the rest of the organization. That was some flat file feed that was written by a grad six years ago. He's moved on. Everyone really knows how the code works. You could change it. You're probably not going to, unless you're absolutely made to do it. Yeah. It's just like you don't really see it as part of your job so much. It's like your primary reason for existing is to be a trading system and help these traders, not to disseminate information to the organization. So data as a product is about trying to reshape those priorities and say, "Look, being a data product in an organization these days is so important to the overall operation of the organization." So, actually, what the organization will spend on its data tends to be huge these days. So being able to disseminate from source high-quality data is really advantagious.

Ben Stopford: (08:40)

I think that's what differentiates that from the data warehousing model where you tend to do the cleansing inside the data warehouse, and then there tends to be... That's one anti-pattern that you get out of it. So, sometimes the data warehousing team, they don't really understand the data that well. It depends how complicated the data is. If it's really simple data, it might be all right. If it's really complicated, nuanced trade data or something, it's often a complete nightmare, but the...

Kris Jenkins: (09:11)

Yeah.

Ben Stopford: (09:11)

Then, the second part is you don't have this single point of coupling where data warehouses, you tend to end up with lots of different people trying to mash around, trying to get data out of this single data schema. So you as a user of the data don't have much control, and that's like a double-edged sword. It can be good in some cases because you just go to the data warehouse, and you don't have to worry about it too much. But it can be bad if you do actually want that extra level of control, whih you often get in larger organizations. So, again, it's horses for courses. No model pros and cons across those two.

Kris Jenkins: (09:44)

That does seem...

Ben Stopford: (09:47)

So those are the two core ones. There are another two principles as well, which make the whole thing work. So those two about like... I will call those the two foundational principles. I won't talk about the...

Kris Jenkins: (09:56)

Those two seem cultural, more about how you approach the problem rather than technology per se.

Ben Stopford: (10:02)

Yeah. So I think a term that is sometimes used is "sociotechnical concern" to say it's a mix of social and technical elements. Whereas the second two principles are pretty much just technical things effectively.

Kris Jenkins: (10:23)

Yeah.

Ben Stopford: (10:23)

So they're like data that's just making... They're actually easier to explain as well. One is making data self-service. So really, that means if I want to have all of the customer information, I need to better get access to it immediately. I don't have to ask anyone. I don't have to raise a request, or maybe you have to ask permission to see the data or something, but basically, it's all automated for me, and I can get hold or access to the whole dataset. The second one is governance, which is a bit like adding unit testing in agile. It's like you don't have to do it, but if you don't do it, you're probably going to make a mess of it. If you add it, you got a much better chance of succeeding.

Kris Jenkins: (11:00)

Right. Okay. So what is it? What's the governance piece?

Ben Stopford: (11:04)

Governance is actually... Yeah. It's like you can do a whole podcast on that, but effectively, it's a number of tools which will help you in an, actually, often, imperfect way, but they help you better manage data in the organization. So things that governance tools tend to do include answering questions like, "How do I find this piece of data? Okay. There's a problem with this piece of data, where did that data come from? How can I trace it back to its original source?"

Kris Jenkins: (11:43)

Right.

Ben Stopford: (11:44)

"Is data different in different places? Can you resolve those kinds of differences?" So a lot of it is really about discovery of information and figuring out where information came from. Those are probably the two primary concerns. There are some other things we do with master data management and stuff like that. But for me, those are the two that really make a big difference because they make, effectively, these data so it's visible. So, in a data mesh, we... In a data warehouse, you don't necessarily need these as much. They're still useful, but you don't need them as much.

Ben Stopford: (12:23)

In a data mesh, because you've got this web of interactions, suddenly, your life got much harder. It's a bit like in agile. In agile, because you're moving faster, you need the testing. If you haven't got the testing... If you do like XP and forget to do testing, then you tend to make a mess because you're moving really fast, and you break a lot of stuff in there. You didn't know you broke it, so your software never works. Whereas if you've got the testing, you're probably okay, and it's so similar, like a similar thing. If you build a data mesh, because of the many different connections that you end up having, if you haven't got the governance, you're just much more likely to make a mess of things.

Kris Jenkins: (13:04)

Is this part of like... I was thinking agile. What you're really doing is tightening feedback loops and installing good feedback loops.

Ben Stopford: (13:11)

Yeah, yeah. Exactly. Yeah.

Kris Jenkins: (13:12)

Exactly. Is the governance part supposed to be self-service?

Ben Stopford: (13:16)

[crosstalk 00:13:16]. Yeah, so the self-serving thing. I'm sorry. The self-service thing helps with the feedback loop, but then you often need other pieces. It's like change management is another part of it, which sounds boring, but again, it's really important. If I want to change this field in a non-accurately-compatible way, how do I do it without it becoming some projects I have to hire five people to go and execute on because we don't know who consumes the data and how we're going to notify them, and ask them if it's okay? Are they actually going to change their side of things at any point in time soon? So we're going to chase that down. So just from a practical sense, it's just like an operation of a large distributed data system, which is what a company ends up being. Doing that well is actually quite complicated, and you need tools and processes to make that work well.

Kris Jenkins: (14:11)

Yeah. Yeah, that makes sense. You know what this reminds me of? It's like early days of the web, right? You have all these startups starting up building their products, and a lot of them started to realize that if they wanted to grasp adoption and drive new use cases, they needed to make their API more than afterthought, but actually a central part of what they're delivering to people.

Ben Stopford: (14:36)

Yeah.

Kris Jenkins: (14:36)

The API is as important, if not more important than the front-end that the users use.

Ben Stopford: (14:42)

Yes.

Kris Jenkins: (14:43)

Is it fair to say it's that kind of driver, but within a large organization?

Ben Stopford: (14:47)

No, I think it's just that kind of driver for data is the way I describe it. It's exactly the same thing. I mean, what you're providing is an API for data. You're saying that as a system that produces data. If you're disseminating that to the rest of the organization, then you are providing an API, and it's one of the most important things that... It's one of the most important interfaces that you have because arguably, your role of disseminating that data to the rest of the organization could actually become more important than whatever it is your system does if you think about it holistically. Maybe not just in terms of what your boss thinks you should do as you were, but in a holistic organizational sense. Often, your role is more important in that regard.

Kris Jenkins: (15:31)

There must also be a kind of discovery piece if you want, say, a completely... Ideally, a department you haven't thought about will come and find your data and use it.

Ben Stopford: (15:41)

Yeah.

Kris Jenkins: (15:41)

In a new way that you hadn't thought about.

Ben Stopford: (15:43)

Yeah. Matter of fact, you might not even know that they exist. Ideally, you won't even know that they exist. They just come and use the data, and you don't know anything about it because you're just providing good data and lots of people use it.

Kris Jenkins: (15:55)

So is there an inherent catalog idea or something? Is there anything in data mesh that specifies discovery?

Ben Stopford: (16:02)

Yeah, absolutely. Yeah. I mean, that sits under the governance piece. So if you build a data mesh using Confluent technologies. So Confluent Cloud is obviously the best place to do this. It has a bunch of features that make it easy. It's got infinite storage, which means that you can do the self-service data, so you can cache data in Kafka. I have to do it that way. You can cache it in a database in each service if you like and have some replay button. But the easiest way to do it is to cache it in Kafka, which we'll do in Confluent Cloud, which will do it efficiently, store that at a low cost for you.

Ben Stopford: (16:37)

Then, you've got the discovery piece, which is what you asked about, which is covered by, basically, our data governance features. So what those allow you to do, a concrete example, is you basically have... If you use event streams to build a data mesh, the simplest approach is basically the data... Each data product is represented as an event stream. It's just the simplest way to do it. So if I have customer data, I would have a customer event stream in Confluent Cloud with maybe infinite storage enabled. It was enabled anyway, but you can store the whole dataset, and then you can provide that data to anyone once it's... So how do they get hold of it?

Ben Stopford: (17:25)

Well, you would tag that data, that topic, that event stream in Confluent Cloud as being customer data, and then people... and you actually also might target it with other things, like you might tag several sub-elements, which is other datasets that it might contain, so like secondary dimensions. If you're tagging data in the cloud product, then you can do it with different dimensions, and you can also have an ability to search that pretty easily. You just put it into the search box and search away, and you will be able to find this tagged information. Then, you can basically pull up a lineage graph, which will show you how that data flows inside the mesh. So you can see, for example, if it's recombining with something in a derivative data product, but the easiest way to do a data mesh is actually not to have derivative data products. You just have like a... You keep the architecture as simple as you can. It's like the Law of Demeter...

Kris Jenkins: (18:28)

So you stand with the primary resources?

Ben Stopford: (18:30)

Yeah. It's like the Law of Demeter for the data, if you've ever come across the Law of Demeter.

Kris Jenkins: (18:39)

I can't remember it. Remind me.

Ben Stopford: (18:41)

Basically, it's like the chain calls. So like you don't have like "food.bar.whatever." Those very long call chains because it creates... If you have a very long chain, like eight different object calls, what you're really doing is creating a very tight coupling down to this method. You're better off with just having fewer, and it's the same thing as back-coupling. So, yeah, decouple things more if they... The fewer interconnections you have, the less coupling you have overall. That's how complex that coupling is.

Kris Jenkins: (19:16)

Okay. So I think we should pin down exactly how this relates to event streaming because you've hinted at it, but is there something out event streaming that is well-suited to data meshes or something about data meshes? Is this well-suited to event streaming, or is this an any-database, any-architecture kind of approach?

Ben Stopford: (19:41)

No. I think there aren't that many ways to do a data mesh, and event streaming is definitely the primary one. No doubt about that. I mean, you've got a many-to-many architecture, so you need... The primary functional principles that a data mesh has to provide is you need to be able to get access to a particular dataset, but you also need to be able to mash that up with other dataset, and those will all, by definition, be in different parts of the organization because they'll be in different data products across the mesh.

Ben Stopford: (20:19)

So the easiest way to do that is just to subscribe to those event streams, pull them together into your database, and join it there, or use a stream processor and join it in the stream processor. In which case, you can then use that data directly to trigger an application to respond directly or obviously, you could put that in a database if you wanted to, but it tends to be... In the data mesh world, it tends to be more about collecting datasets together, and the reason it, obviously, is different now to a data warehouse is most of the consuming applications or consuming databases, the databases they're making use up the datasets in the mesh. They don't need all the data that the organization has, so they're not... They have some small subsets of data that they're actually interested in, and because the mesh provides them data on demand, yeah, they don't really have this... They don't feel responsible for taking all of the data.

Ben Stopford: (21:19)

So that point is worth drilling into, I think, because it's a little bit subtle, but if you think about it in the traditional world of flat file transfers or enterprise messaging, the first thing you do when you get hold of a bit of data is you write it down in a database because you might never see it again because it's ephemeral. So you write it down, and then you have to maintain that dataset. So, basically, what happens is everyone builds up these big historical copies of data, which they have to maintain over time. It's actually pretty tricky when you've seen this change and so forth. So it actually has a fair amount of cost to it.

Ben Stopford: (22:06)

Often, there'll be a whole bunch of that data. Maybe you are interested in, I don't know, customer information, but really, you only use the name and the ID, but you'll take all of the information because you may never see that information again. It'll be a pain to get it. But in a data mesh, you don't have that problem. It's quite easy to get hold of the data. You can actually repopulate it. So if you just need the customer name and the ID, you're just going to take the customer name and the ID, which actually saves you a lot of pain in the long run as dataset change. You have to manage the life span of that data and all that sort of stuff, the life cycle of that data and so forth.

Kris Jenkins: (22:44)

Yeah. You know what it reminds me of? We had Gerard Klijs on the show a couple of weeks back talking about GraphQL.

Ben Stopford: (22:52)

Okay.

Kris Jenkins: (22:53)

And how you can build up queries that just take the fields you need.

Ben Stopford: (22:58)

Yeah.

Kris Jenkins: (22:59)

As a supplier of that, you can auditing which fields are actually in use and which fields matter to your users. So I can see that fitting in well.

Ben Stopford: (23:10)

Yeah. I think it's a similar concept, and there isn't really... As it stands today, there's no way that you can audit that. It's a good product idea though. Maybe we should build that or somebody else out there should build that. I think that sounds...

Kris Jenkins: (23:27)

Streaming Audio, the podcast, where we do product design before your eyes.

Ben Stopford: (23:34)

Make your million or build the open source product that's going to change the world.

Kris Jenkins: (23:42)

Okay. So I can definitely see a synergy there, but just one last thing on the whole event streaming thing. Do you think it's tightly bound to the idea of real-time data streams?

Ben Stopford: (23:57)

Yeah. I mean...

Kris Jenkins: (23:58)

Does it help?

Ben Stopford: (23:59)

Yeah. I mean, you can redo it in batch. I mean, that's the self-service thing. I mean, self-service if you got to wait for it at the end of the day. I think that's really self-service. That's a bit like you walk into McDonald's, and you go up, and you press the thing, and you see like the burger and so forth, and then you have to come back at dinnertime. It's not really going to fly, is it? It's like not if you're ordering a, whatever, Chicken McMuffin, whatever. What do they call them?

Kris Jenkins: (24:27)

It's been since you've been to McDonald's clearly.

Ben Stopford: (24:31)

Yeah. I probably haven't been to McDonald's for a while. It's true, but that's more to do with COVID than anything else.

Kris Jenkins: (24:37)

Oh, that's fair. That's fair. Okay then. So you're in an organization. You decide that there are going to be advantages to setting up this mesh and taking responsibility for your feeds. How on earth do you get started if you're just a bank?

Ben Stopford: (24:55)

Yeah. That's a good question. So, I mean, firstly, you work out whether or not you really want one, but I think you should always do that. You should always evaluate the pros and cons of these things. You don't necessarily have to take all of it.

Kris Jenkins: (25:08)

Yeah.

Ben Stopford: (25:08)

But a great place to start is, yeah, obviously, reading materials that are available today. I know Zhamak, who's the lady from Thoughtworks who proposed this set of principles, she's running a book on it, which I've read a little bit of. It seems pretty good. We've also got some materials that we've written. Adam Bellemare, who's the guy who wrote the Event-Driven Microservices book. He did a blog post recently on it, and that has a pretty cool demo which gives you a sample. It sits above Confluent Cloud. It's running on a server on the internet. So you can literally just go and play with it, and that provides you with a basis, a simplistic data mesh, and it demonstrates some of those core principles.

Ben Stopford: (25:59)

So, in that, you can do things like you can expose... You can pretend to be a source system and expose a data product as an event stream. You can look at the lineage of data. You can basically combine different data products together to create a derivative data product using a stream processor using KSQL running on Confluent Cloud, and you can manage data product life cycle, and then obviously, search. If you're tagging different data products, you can search all of those kind of things. So it provides a wrapper over Confluent Cloud. It has some additional functionality, and that's something that you can... It's all open source, so you can take that. You can fork it. If you can figure out the programming elements, then you can maybe skin it to your own organization and use that to drive a project. But hopefully, it gives you that first step up. I think that's a really good starting point.

Ben Stopford: (27:02)

Then, yeah, other than that, I think a lot of the materials that we're seeing are coming out. We also have a course on Confluent Developer. So if you haven't come across Confluent Developer, it's a fantastic resource for everything to do with event streaming. It's all free. It's got I think nine courses on there and a bunch of other things like core resources, animations, event streaming design patterns, and lots more stuff. So there's course on there on data mesh, which goes through a lot of the principles and actually also has an exercise which uses this demo to allow you to actually enter the exercise. You're building a local data mesh for your organization right there online in Confluent Cloud, which is pretty cool.

Kris Jenkins: (27:45)

There is a hosted version of that demo if you just want to quickly go and kick the tires.

Ben Stopford: (27:48)

That's right. Yeah.

Kris Jenkins: (27:50)

Yeah. I've looked at that, and one thing I really liked about it is this idea of... Okay. So you've talked about not being a derivative data product between organizations, right? You don't want this Demeter's law problem, right? But within your organization, there's nothing wrong with taking your raw data feeds and packaging them up for publishing.

Ben Stopford: (28:17)

Yeah.

Kris Jenkins: (28:19)

If you are a price feed system, you might take a price feed stream and some analysis stream, and publish them as one thing for the rest of the organization, right?

Ben Stopford: (28:29)

Yeah. Yeah, I think.

Kris Jenkins: (28:31)

So you're publishing your own complex queries for the sake of the organization.

Ben Stopford: (28:37)

Yeah. Yeah. I think that's a pretty good way of... a good approach to doing it. I want to say in practice, data measure is quite new. So when we say in practice, what we're really thinking of is systems which are data measures or have a lot of the data measures properties applied retroactively because the name is newer than, yeah, a lot of the systems, but...

Kris Jenkins: (29:04)

First, you get working, and then you name it, right?

Ben Stopford: (29:07)

Yeah, and the same with microservices, right? We're building microservices in mid-2000s. We just didn't call them... At Thoughtworks, we called them fine-grained SOA. That's what we were calling them, and it was a good five years before they became microservices. When they became microservices, we've also worked out that there were a few things you didn't want to do, like you didn't want to share a database because it causes lots of problems. We actually went through all those problems, and it's the same thing, I think, with data mesh. So we've been building event streaming systems and actually messaging systems before that for a long time and struggling with a lot of the difficulties that come from it.

Ben Stopford: (29:43)

A lot of these principles, like self-service data, and data lineage, and treating data as a product, these are things that we've we've explored. The one that relates to the point that you were mentioning there, which is the derivative data products, interesting trade-off here is this dumb pipe, smart endpoints thing, or to put it another way, it's like the ESB approach. So, again, this is all quite subjective, but if you... Yeah. One of the problems that we saw with ESBs is they evolve to a position where they started to embed a lot of logic inside the ESB itself, which is very attractive. Right? It's attractive from a software vendor perspective. It's attractive from the perspective of pointy head architects inside some big corporation, "I'm going to control everything using these validation rules inside the central system."

Ben Stopford: (30:41)

In practice, it actually makes it really hard to do stuff, and data mesh is trying to not do that. Right? It's actually trying to say, "Look, data should be freely available." So when we're talking about adding that kind of functionality, we want to be careful that it is the sort of event streaming stuff that we use to bind datasets together. That tends to be pretty simplistic. For example, joining on a primary key. It's not too business-centric. If you did start applying proper business logic into those data products, that's where you tend to have problems. Yeah, so those, anything that's business-specific that really should be, ideally, owned, a source owned in a... additional data product owned by that source domain. That's the typical bridge.

Kris Jenkins: (31:30)

Right. Okay. You said something in there I can't let slide, which was, "In microservices, we found out that sharing a single database is a bad idea."

Ben Stopford: (31:43)

Sure. Yeah.

Kris Jenkins: (31:44)

But why is it a data mesh sharing a single Apache cluster would be okay, or would it?

Ben Stopford: (31:53)

Yeah. Yeah. Can you build a data mesh with one cluster? Well, I think probably... It is a bit of an aside, but one of the really nice things about cloud, one of my favorite things about cloud is this idea of how many clusters you have goes away or certainly going away. So, right now, you can create a cluster instantaneously for nothing basically that's pay for use. So if in that model you actually think less about clusters, you just think about topics, and that topic could be in one cluster, it could be in another cluster, you don't really care so much. With things like cluster linking, which is a technology which joins these actual clusters together in a way that doesn't involve a proxy, so they're just talking to each other, it actually makes that pretty efficient.

Ben Stopford: (32:49)

So whilst there's still actually work to do that, there's notable move away from this idea of thinking clusters and towards thinking just literally like topics and like, "I'll have as many of them as I want." That's really important because if you think about... Another really important principle is this data on the inside, data on the outside idea, which Pat Helland introduced really, really long time ago. It's like, yeah, I think a regional paper. It was like early 2000s, but basically, it was written in the time of XML, but it was just saying that data that you have in your database is very different to data that you share with someone else, and the reason is really simple. It's like if you have data in your database, you can do whatever you want with it. It's really easy. If you share data with somebody else, suddenly, it's locked down, and it's really hard to change, and that...

Kris Jenkins: (33:39)

Minimize the contract.

Ben Stopford: (33:40)

Yeah, the dynamic of that contract is really different. So that's why, and that is at the heart of your question. I know it's been a while since your question, but I think your question was like, "Why do I have this problem with microservices?" Well, the reason is, is because you're wrapping them in an API, and the API constricts what you can do hugely. Right? It hugely constricts what you can do and says, "Well, all you can do is like these various different methods." That's very, very different to exposing a database and saying, "Hey, look. There's my Postgres instance. Off you go," because a database has an amplifying contract. It has a declarative language, which you use to talk to it, which actually amplify as the coupling to that thing. You can not only ask for the data inside that database, but you can ask for the data inside that database in all sorts of different ways, including maybe even creating your own data, right? It's an amplifying contract.

Kris Jenkins: (34:32)

Yeah.

Ben Stopford: (34:32)

Whereas a supplier provides a restrictive contract. It's being very, very specific about what datasets and the protocol in which you're going to interact with that data. So even if it's a data service, they're two very different extremes. The data mesh approach takes a third approach, which is to say there is no protocol. You're just getting raw data. You can have all the data, and you can have that data updated live. It has a schema. You get it back in your day database, so you have full control. That's the crux of the whole thing. It's all about who has control over the data, and there's a tension in these systems between retaining control over your data on one side and making sure that it doesn't diverge or become difficult to manage long-term on the other side. You're always balancing the trade-off between those two elements.

Kris Jenkins: (35:25)

Right. So I'm finding a way as my department publishing some data to retain control over the shape of that data whilst also giving you a lot of freedom about what you do with it later?

Ben Stopford: (35:36)

Yeah, and you're doing that through a contract that looks a lot like an API, but it doesn't actually have, really, interaction complexity. It's just actually a data stream. So a data stream is really a type of API where the contract of the API is really just the schema of the data stream.

Kris Jenkins: (35:52)

Yeah. Okay, so this leads to... Once you're there, once you're in this place where you're streaming out data for other people to consume in their own ways, maybe you've made that organizational change, how do operations change when you're living in that kind of world?

Ben Stopford: (36:14)

Yeah. I mean, operationally, the nice thing about that approach is you have a huge amount of freedom. You can choose to be really simplistic. So you can treat a data mesh like a mechanism for getting batched data. If you want to, you'd have to convert it, but you can literally do... You can do old-school patterns and new-world patterns, more contemporary patterns at the same time. But yeah, operationally... Yeah. I think if you're just moving data between different datasets, I think yeah, it's actually pretty similar. The main problem is that you can't go out and buy a data mesh, and there's two reasons for that. One is, as we discussed earlier, a bunch of it is sociotechnical stuff, right? So you just can't buy that, period. You could buy some consultants probably from Thoughtworks, I guess, that would come in and tell you how to become data-mesh-y, but you can't go and buy that...

Kris Jenkins: (37:09)

You can't buy two packets of culture.

Ben Stopford: (37:10)

Yeah. Well, genuinely, when we also thought as we did go and... People would go in and sell a block of agile. It was amazing, and that's a completely different story, but you do have this element of the sociotechnical stuff, which you can't really do much about, but you do also need the technical part, and I said that Confluent Cloud is a great example. You can build a data mesh with Confluent Cloud. It has all the constituent parts, all of the parts of that meal, and what we're providing in those blog posts and so forth is a recipe that helps you pull a lot of them together, but it's not like a shrink wraps data mesh product. There would be a difference between those two things. Today, you still need to do a bit of work to piece these things together. We did some of that I think in that prototype application, which I think does a really good job of laying that foundation. But still, today, it requires a certain amount of custom coding.

Ben Stopford: (38:08)

Now, the reason that that is not so bad is that generally, what happens for most organizations is their mileage is going to vary a bit in some way. Right? Spending some time investing it allows you to go... and building that layer on top lets you customize it to your specific requirements because you're going to have different datasets. You're going to have weird legacy systems, which are going to be harder to integrate. Whilst the nice road, perfect picture that we paint, and when we talk about these things looks very appealing and very simple just like with microservices, the reality is always much more messy. As a result, it's the things that break the rules that tend to be the hardest parts to have to integrate rather than a happy path.

Kris Jenkins: (39:02)

I'm just wondering. The hardest part about shifting that is shifting culture, I think. Technology is hard to shift, but people are even harder. Right? So what's the minimum you need? It feels like you need one department that says, "I'm going to start publishing my data as a product."

Ben Stopford: (39:23)

Yeah.

Kris Jenkins: (39:23)

And another department that's agreed to be the first people consuming it.

Ben Stopford: (39:28)

Yeah. I mean...

Kris Jenkins: (39:29)

That feels like the bare minimum to get the ball rolling.

Ben Stopford: (39:32)

Yeah. I mean, that's the bare minimum. You can ask another department and get the ball rolling at that. I actually think it has to be top-down. I think that's the most successful route. It might not be trendy, but it's definitely the best way to do it, and that's actually cultural. So the data as a product one is the best example, right? So the data as a product thing is cultural. If I was going to give an organization advice on how to do this, it's like, "You need to get your CEO standing up there in front of the organization and talking about data products and how important it is to be data-centric, and this is what it means."

Ben Stopford: (40:08)

I don't know. I make them go on training courses and so forth that they... Whatever works for you, but you do have to get it into the culture in some way because it is a cultural change. Maybe it's in your values or whatever, but those... That is probably the only way it's actually going to work in practice. Yeah. I mean, you see this in a bunch of different environments where if you want associated, technical property to work, maybe it's agility. Maybe it's this type of approach to data. You have to have that cultural shift that sits behind it.

Kris Jenkins: (40:48)

Okay. Well, hopefully, we've done a bit to lend the mindset that will help that adoption and shift.

Ben Stopford: (40:55)

Yeah.

Kris Jenkins: (40:56)

Yeah. No, it's good. I mean, I think...

Ben Stopford: (40:57)

So a final point I would add is that one of the things I like about this is that it makes it for... Whether or not data mesh is the perfect solution. I think that's still up for grabs. We haven't decided that yet. It's still too nascent. There's definitely some good stuff in there, but maybe the most valuable thing that I've seen come out of the data mesh conversation is actually the conversation itself. It's the...

Kris Jenkins: (41:24)

Yeah.

Ben Stopford: (41:25)

So if you think about software engineering, we have spent so much time analyzing the way that we build code, the relationship between different modules, the differences between approaches of functional programming, or imperative programming, or object orientation. There's a wealth of sources, of material out there. We've just really, as an industry, thought really hard about that problem. I just feel that we haven't done anywhere near as much thinking or a useful experimentation on the data side of things. I think that's actually much more nascent.

Ben Stopford: (42:03)

It is probably the first thing that I've seen that really challenges the standard data warehouse architecture and the organizational constructs that come up around that. Even though we've known for many years it has problems, I mean, the data lake thing had tried to have a go at it, but that was maybe not that productive. Well, that kind of worked. So, for me, a big part of it is actually the conversation and the thinking. There's a lot of people thinking about this and trying to work out better ways to do it, better ways to tackle the problem. I think it's a big unsolved problem in our industry. So I very much welcome that conversation. That's one of the best bits, I think.

Kris Jenkins: (42:47)

Wow. Data, the final frontier.

Ben Stopford: (42:50)

Yeah. That's one way to put it.

Kris Jenkins: (42:51)

Cool.

Ben Stopford: (42:53)

Cool.

Kris Jenkins: (42:53)

Ben, thank you very much for your time. Thanks for talking to us today.

Ben Stopford: (42:57)

Yeah. Great seeing, Kris. Thank you.

Kris Jenkins: (42:59)

No doubt we'll have you on the show again. See you later.

Ben Stopford: (43:01)

Yeah. Lovely to be here.

Kris Jenkins: (43:02)

Well, that was data mesh. There's a lot to think about there, and clearly, some undiscovered country to explore. We'll have links to the blog post and the demo that Ben mentioned in the show notes if you want to start exploring. Before we go, let me remind you that if you want to learn more about event-driven architectures, we'll teach you everything we know at Confluent Developer. That's developer.confluent.io. If you're a beginner, we have getting started guides. If you're more advanced, there are blog posts, recipes, in-depth courses, architectural patterns, all sorts.

Kris Jenkins: (43:34)

If you take one of our courses, you can follow along easily by setting up a Confluent Cloud account. If you do that, make sure you register with the code PODCAST100, and we'll give you a hundred dollars of extra free credit. If you like today's episode, then please give it a like, or a star, or a share, or whatever buttons you have on that user interface, and let us know you care. If you have a comment or want to get in touch, please do. If you're listening, you'll find our contact details in the show notes. If you're watching, there are links in the description and probably a comment box just down there. With that, it just remains for me to thank Ben Stopford for joining us and you for listening. I've been your host, Kris Jenkins. Catch you next time.

With experience in data infrastructure and distributed data technologies, author of the book “Designing Event-Driven Systems” Ben Stopford (Lead Technologist, Office of the CTO, Confluent) explains the data mesh paradigm, differences between traditional data warehouses and microservices, as well as how you can get started with data mesh.

Unlike standard data architecture, data mesh is about moving data away from a monolithic data warehouse into distributed data systems. Doing so will allow data to be available as a product—this is also one of the four principles of data mesh:

Data ownership by domain
Data as a product
Data available everywhere for self-service
Data governed wherever it is

These four principles are technology agnostic, which means that they don’t restrict you to a programming language, Apache Kafka®, or other databases. Data mesh is all about building point-to-point architecture that lets you evolve and accommodate real-time data needs with governance tools.

Fundamentally, data mesh is more than a technological shift. It’s a mindset shift that requires cultural adaptation of product thinking—treating data as a product instead of data as an asset or resource. Data mesh invests ownership of data by the people who create it with requirements that ensure quality and governance. Because data mesh consists of a map of interconnections, it’s important to have governance tools in place to identify data sources and provide data discovery capabilities.

There are many ways to implement data mesh, event streaming being one of them. You can ingest data sets from across organizations and sources into your own data system. Then you can use stream processing to trigger an application response to the data set. By representing each data product as a data stream, you can tag it with sub-elements and secondary dimensions to enable data searchability. If you are using a managed service like Confluent Cloud for data mesh, you can visualize how data flows inside the mesh through a stream lineage graph.

Ben also discusses the importance of keeping data architecture as simple as you can to avoid derivatives of data products.

EPISODE LINKS

Continue Listening

Episode 204March 15, 2022 | 41 min

Handling 2 Million Apache Kafka Messages Per Second at Honeycomb

In this episode, you’ll get a taste of how Apache Kafka is used at Honeycomb! Liz Fong-Jones (Principal Developer Advocate, Honeycomb) explains how Honeycomb manages Kafka-based telemetry ingestion pipelines and scales Kafka clusters. Honeycomb is an observability platform that helps you visualize, analyze, and improve cloud application quality and performance. Their data volume has grown by a factor of 10 throughout the pandemic, while the total cost of ownership has only gone up by 20%.

Listen Now

Episode 205March 22, 2022 | 42 min

Building Real-Time Data Governance at Scale with Apache Kafka ft. Tushar Thole

Data availability, usability, integrity, and security are words that we sometimes hear a lot. But what do they actually look like when put into practice? That’s where data governance comes in. This becomes especially tricky when working with real-time data architectures.

Listen Now

Episode 206March 29, 2022 | 23 min

Bridging Frontend and Backend with GraphQL and Apache Kafka ft. Gerard Klijs

What is GraphQL? And how can you combine GraphQL with Apache Kafka to query data in real time? With over 10 years of experience as a backend engineer, Gerard Klijs is a Confluent Community Catalyst, a contributor to the GraphQL project, and also a creator and maintainer of a Rust library to use Confluent Schema Registry with Java client. In this episode, he explains why you want to use Kafka with GraphQL and how they work together to bridge the gap between backend and frontend to make data more easily accessible in the frontend.

Listen Now

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog