Learn how to access your data anywhere you need it, as quickly as possible.
VP Developer Relations
Lead Technologist, Office of the CTO (Author)
Principal Technologist (Author)
The third principle of data mesh is building a self-service platform. This concerns a problem that you may have faced in your work in the past: how to access the data you need in a reasonable amount of time.
A new trade surveillance system was being built in a large financial institution. This system needed data from thirteen different systems:
The end result was that just getting data into the database in a logical way took around nine months.
The third principle, data available everywhere, as a self-service, specifically tackles the problems that the bank faced.
Unlike the other data mesh principles, the third principle is achieved through a central infrastructure. As a rule, decentralization is preferable in data mesh, but in this case it is quite beneficial to implement a centralized service that lists available event streams and the data that they contain.
This isn't centralizing data ownership, data product ownership, domain expertise, and so on; we know that those things are decentralized. This is just an infrastructure layer. Basically, the system needs to manage both real-time and historical data, and make that data available everywhere within the company.
The process of accessing this company data and getting it into your individual database should be autonomous, or at least automatable. Just like with microservices, the most efficient architectures allow you to get the data you need and execute it at your own cadence through rapid data provisioning.
Consuming data from the data mesh means potentially consuming both real-time and historical data. There are two ways to do this. The first, described in the bank scenario above, is called the Lambda Architecture. It requires you to build two separate systems, one for historical data and one for real-time data.
A better solution is known as the Kappa Architecture. It begins with an event streaming platform that stores the streams indefinitely. Destination data ports, which consume the data, can choose to start with the latest events from the platform, or to consume all historical events first. In Apache Kafka®, you achieve this by doing two things: First you set a topic for infinite retention so that it has all of the history, then you reset your consumers back to offset: 0 to start at the beginning of the topic. In some cases, you can also use compacted topics as an optimization, to reduce the amount of time required to load the initial snapshot. This eliminates historical data, so it’s only for cases where you truly don’t need that data.
Note that if you use something like Confluent Cloud, which has a more efficient mechanism for infinite topic retention, you can store very large data sets indefinitely, in a relatively cost-efficient manner.
As of the summer of 2021, there is no tool that will automatically implement such a centralized platform for data mesh. This is to be expected, because data mesh is so new. Similar to microservices, the whole point of a data mesh is to be decentralized, so there really shouldn't be only one tool. We would expect many tools to emerge.
But you need a single tool you can use in order to discover the information in the mesh—similar to a service discovery system in a microservice architecture. It could be a UI, a wiki, an API, or something else.
Assume it's a UI. A data mesh UI might look something like this:
You could type search terms and discover schemas. You could introspect the data flowing through the mesh, looking at the various data types and values, and requesting that event streams be joined and translated.
In the image above, orders and customers are selected. One of them is a historical data set and the other is solely real-time events. A stream processing query, in this case delivered by ksqlDB, joins these together, along with a preprocessing step that filters out the events you're not interested in (you're only looking for orders for Platinum clients). The results are deposited in Elasticsearch for you to query.
The data mesh is related to another data implementation pattern: the database inside out. If you're not familiar with the term, search for it, as there are a lot of materials covering it.
In the database inside out pattern shown in the image above, Kafka provides the storage layer and ksqlDB provides the query processing (transformations, joins, the ability to create passive views which you can then send to other queries). So you have storage and query processing, and thus a database.
To be clear, this doesn't mean that there isn't anything called a database in your system. Those can still exist. This is just construing the system as one giant custom distributed database.
In a data mesh, events can be processed either as pure events or as a queryable view of current state. This point really becomes important as the mesh grows. Imagine that the data mesh becomes global, with different nodes that span continents, where delays are significant and connectivity is less than reliable in some cases—probably still pretty good, but not as fast as inside a data center.
In the image below, you have a destination data port on the left, which has two types of interfaces that make the data in the mesh globally accessible. The first interface, at the top, uses events. The data port decides what data it's interested in, and submits it to the mesh. Then the mesh pushes the events into the data port as they become available. Those events can then be persisted by the data port in whatever way suits them best (into a database, S3, etc.).
The second implementation is query-based, rather than event-based. Here, you tell the data mesh to materialize a view, because that's what you need. You don't want a sequence of events; you want something you can look up using a key. For example, you might materialize a view of orders that are for a particular customer, which may be queryable by order ID or by customer. You can now interact with the mesh using a request-response paradigm.
When a global data mesh is implemented with Apache Kafka, the mesh is really one logical cluster made from many physical ones spread around the globe. While the data product can send data to or receive data from the mesh, if it's playing the role of a destination port, it typically can't use the underlying Kafka implementation, since that sits under central control. Instead, it should use its own Kafka instance. This pattern is simple to implement on the cloud using systems such as Confluent Cloud, where using additional standard clusters introduces very small marginal costs.
So that's the third principle of data mesh: making data available everywhere in a self-serve manner. Hopefully you’ve seen enough implementation detail to start to make the idea concrete: These are basically messages, produced into Kafka topics and consumed from Kafka topics.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
Hi, I'm Tim Berglund with Confluent. Welcome to Data Mesh module for, Self-Service Data Available Everywhere. So here we have the third principle of the data mesh, building a self-service data platform. This is about a problem many of us have faced in our work in the past. How do you access data in a reasonable amount of time? Let's take an example. So there is a new trade surveillance system being built in a large financial institution, and it needed data from 13 different systems, easy, right? The company had a messaging system in place, so it could publish events from sources and lots of them did publishing events that other people could receive, right? So while you could get a hold of the real-time event stream, there wasn't a simple way also to get historical data. It was a queue. Stuff just went away when you consumed it. And when you're starting out from scratch, you typically want historical data. In full the initial load, you want to get that into your new database and have all that state, and then keep it up to date with the stream of events. So in this story, some of those systems had a feature where you could submit a request for back population. That's fun. And have historical events republished for you. But most systems didn't have this feature automated. And even so it's not like that's a pleasant way to do things. So you'd be sent a data snapshot directly from their database. So in a situation like this, like maybe you get a CSV file dumped on you and, you know, you have fun with it. That's not what you do to a friend. Then you hit another problem. The event streams and snapshot file have different schemas and different names for each field and all that kind of thing. This is all based on a true story, folks. So that made it very hard to get the real-time and historical data to match up in schema and in time. So just getting data into the database in a form that made sense for a new node in the mesh or a new application took about nine months, right? This isn't an uncommon story. It, again, it's, it's inspired by actual events. And, you know, you've probably seen something like this before in projects, across whatever industries you've, you've worked in. So coming back to principle three, it's about tackling exactly this problem, getting data into your product, nearly instantaneously, providing a tight team feedback loop so they can iterate quickly on any issues that may come up. Unlike other principles in data mesh, this is achieved through a central infrastructure. We're trying to decentralize, but not everything can be. There is a centralized. We do want centralized event streams and a centralized service for allowing you to discover and provision the data they contain. So this system needs to manage both real time and historical data and make that data available everywhere in any database in the company. The process of accessing data and getting it into our database should be autonomous. Or at least automateable just like microservices. The most efficient architectures, let us get the data we want and execute it at our own cadence through this type of rapid data provisioning. And you can see how centralized infrastructure is going to be a part of that solution. And that centralizing isn't centralizing data ownership, data, product ownership, domain expertise. We know those things are decentralized. This is just an infrastructure layer. So consuming data from the mesh means consuming both real time and historical data potentially. There's two ways of doing this. The first is the one that I described in that trade surveillance example. So if you just want to watch the world burn, you know, you can do that where you get the, the real-time and historical data separately. And honestly, this sounds familiar. If you're familiar with the Lambda Architecture, certainly something that has fallen out of favor in recent years, it's the same pattern. Here's a system for the historical data. Here's the system for the real time data seat, you build it all twice. The solution to this is known as the Kappa Architecture. Let me explain what that is. The events streaming platforms, stores the streams as events, indefinitely. Destination data ports, the things that are consuming that data can then choose to start with the latest events if they want or consume all historical events first. In Kafka, we can handle this by doing two things. First, you're going to set the topic to have infinite retention. So it's got all the history of everything. And then by resetting our consumers back to offset zero, to start at the beginning of the topic, to get all of the data in the topic, right from the start. Compacted topics can also be used as an optimization in some cases, to reduce the amount of time it takes to load that initial snapshot. Now that is going to eliminate historical data. So this is for cases where you truly don't want it. Now, if you use something like Confluent Cloud, which has a more efficient mechanism for infinite topic retention, you can store those very large datasets indefinitely in a pretty price efficient manner. So what does this look like in practice? Well, as I'm recording this in the summer of 2021, there isn't some product or framework or tool that you can adopt that is an implementation of all of this. There's no one thing that says, "Okay, here's my, data mesh and I, I learned this, like I learn Micronaut, and I've got a framework for microservices or something like that." This is to be expected because this is early days, just like microservices, maybe at the beginning of the 2010s. And like microservices, the whole point of a data mesh is to be decentralized. So there really shouldn't be one tool. We would expect many tools to emerge, but there does need to be some single point you can go to, to discover the information in the mesh. This can be a UI, could be a Wiki. I, that seems like an extreme case, but, hey. It could be an API or something else that we haven't thought of yet, but it needs to exist. Just like microservices typically have service discovery systems to help you figure out where end points are located. Let's assume it's user interface for a minute. A data mesh UI might look like this. You could type search terms and discover schemas. You could introspect the data, flowing through the mesh, look at the various data types and values. And you can request that the event streams be joined and translated. In this example, we have orders and customers selected. One of them is a historical data set. The other is solely real-time events. A stream processing query, in this case, using ksqlDB, then joins these together along with a preprocessing step that filters out the events we're not interested in leaving only the ones we want. That's orders for platinum clients. Well, you know, we're looking only for the high rollers here and deposit that in Elasticsearch for us to query. So that's a concrete example of a data mesh node. This is kind of an implementation of another pattern called the Database Inside-Out. If you're not familiar with that term, strongly encourage you to Google it. And you'll get a lot of great articles and videos and blog posts and things, and even small books elucidating it in some really helpful ways. In this pattern, Kafka provides the storage layer. KsqlDB provides query processing. You know, transformation joins the ability to create passive views, which we can then send other queries to. So storage query processing there's your database. And to be clear, that doesn't mean that there isn't anything called a database in your system that, that's recognizable as like a relational database. Those still exist. This is just kind of construing the entire system as like one big, giant custom distributed database. Anyway, both of those things provide an implementation, which can manipulate the events that are held inside the central event streaming platform that, that central data mesh and materialize them either as events or as a queryable view of current state. This point really matters as the mesh grows. Imagine that the data mesh becomes global with different nodes, that span continents, where delays are significant and, and maybe connectivity is less reliable in some cases. Probably still pretty good, but you know, different than inside a data center. Like in this example, we have a destination data port on the left, which has two types of interfaces that make the data in the mesh accessible literally globally across the planet. That first interface uses events. That's the one at the top. The data port decides what data it's interested in, submits it to the mesh and the mesh pushes the events into the data port as they become available. Those events can then be persisted by the data port in whatever way suits them best. Say into a database as three or whatever it might be. The second implementation is query based rather than event based. Here we tell the data mesh to materialize a view for us because that's what we need. We don't want a sequence of events. We want something that we can, we can look up, but with using a key. For example, we might materialize a view of orders that are for this particular customer, maybe queryable by an order ID or by a customer for that matter. We now interact with the mesh using a request response paradigm in that second case. We send the mesher question and we get an answer. That's kind of how queries work. Finally, when a global data mesh is implemented with Apache Kafka, the mesh is really one logical cluster made from many physical ones spread around the globe. While the data product can use the mesh to send data to, or receive data from, if it's playing the role of a destination port, it typically can't use the underlying Kafka implementation as that sits under central control. Instead it should use its own Kafka instance. This pattern is simple to implement on the Cloud, using systems like Confluent Cloud, where using marginal or additional standard clusters, that standard that's clusters of the standard type, not basic, not dedicated, but standard. Those extra standard clusters introduce very small marginal cost. So that's the third principle of data mesh, making data available everywhere in a self-serve manner, including I hope enough implementation detail to start to make this concrete. There are, when it comes to data mesh, there are abstract principles that you really need to just kind of beat into your mind. But if it's all abstract, it stays kind of magical. I don't want that. I want you to realize no, no, no these are messages. They get produced and into Kafka topics and consumed from them. It's a real thing. And I hope we've started to make that feel real here. In the next module, we'll dive into the fourth and final principle as we explore federated governance.