Sr. Director, Developer Advocacy (Presenter)
Lead Technologist, Office of the CTO (Author)
Principal Technologist (Author)
The third principle of data mesh is building a self-service platform. This concerns a problem that you may have faced in your work in the past: how to access the data you need in a reasonable amount of time.
A new trade surveillance system was being built in a large financial institution. This system needed data from thirteen different systems:
The end result was that just getting data into the database in a logical way took around nine months.
The third principle, data available everywhere, as a self-service, specifically tackles the problems that the bank faced.
Unlike the other data mesh principles, the third principle is achieved through a central infrastructure. As a rule, decentralization is preferable in data mesh, but in this case it is quite beneficial to implement a centralized service that lists available event streams and the data that they contain.
This isn't centralizing data ownership, data product ownership, domain expertise, and so on; we know that those things are decentralized. This is just an infrastructure layer. Basically, the system needs to manage both real-time and historical data, and make that data available everywhere within the company.
The process of accessing this company data and getting it into your individual database should be autonomous, or at least automatable. Just like with microservices, the most efficient architectures allow you to get the data you need and execute it at your own cadence through rapid data provisioning.
Consuming data from the data mesh means potentially consuming both real-time and historical data. There are two ways to do this. The first, described in the bank scenario above, is called the Lambda Architecture. It requires you to build two separate systems, one for historical data and one for real-time data.
A better solution is known as the Kappa Architecture. It begins with an event streaming platform that stores the streams indefinitely. Destination data ports, which consume the data, can choose to start with the latest events from the platform, or to consume all historical events first. In Apache Kafka®, you achieve this by doing two things: First you set a topic for infinite retention so that it has all of the history, then you reset your consumers back to
offset: 0 to start at the beginning of the topic. In some cases, you can also use compacted topics as an optimization, to reduce the amount of time required to load the initial snapshot. This eliminates historical data, so it’s only for cases where you truly don’t need that data.
Note that if you use something like Confluent Cloud, which has a more efficient mechanism for infinite topic retention, you can store very large data sets indefinitely, in a relatively cost-efficient manner.
As of the summer of 2021, there is no tool that will automatically implement such a centralized platform for data mesh. This is to be expected, because data mesh is so new. Similar to microservices, the whole point of a data mesh is to be decentralized, so there really shouldn't be only one tool. We would expect many tools to emerge.
But you need a single tool you can use in order to discover the information in the mesh—similar to a service discovery system in a microservice architecture. It could be a UI, a wiki, an API, or something else.
Assume it's a UI. A data mesh UI might look something like this:
You could type search terms and discover schemas. You could introspect the data flowing through the mesh, looking at the various data types and values, and requesting that event streams be joined and translated.
In the image above, orders and customers are selected. One of them is a historical data set and the other is solely real-time events. A stream processing query, in this case delivered by ksqlDB, joins these together, along with a preprocessing step that filters out the events you're not interested in (you're only looking for orders for Platinum clients). The results are deposited in Elasticsearch for you to query.
The data mesh is related to another data implementation pattern: the database inside out. If you're not familiar with the term, search for it, as there are a lot of materials covering it.
In the database inside out pattern shown in the image above, Kafka provides the storage layer and ksqlDB provides the query processing (transformations, joins, the ability to create passive views which you can then send to other queries). So you have storage and query processing, and thus a database.
To be clear, this doesn't mean that there isn't anything called a database in your system. Those can still exist. This is just construing the system as one giant custom distributed database.
In a data mesh, events can be processed either as pure events or as a queryable view of current state. This point really becomes important as the mesh grows. Imagine that the data mesh becomes global, with different nodes that span continents, where delays are significant and connectivity is less than reliable in some cases—probably still pretty good, but not as fast as inside a data center.
In the image below, you have a destination data port on the left, which has two types of interfaces that make the data in the mesh globally accessible. The first interface, at the top, uses events. The data port decides what data it's interested in, and submits it to the mesh. Then the mesh pushes the events into the data port as they become available. Those events can then be persisted by the data port in whatever way suits them best (into a database, S3, etc.).
The second implementation is query-based, rather than event-based. Here, you tell the data mesh to materialize a view, because that's what you need. You don't want a sequence of events; you want something you can look up using a key. For example, you might materialize a view of orders that are for a particular customer, which may be queryable by order ID or by customer. You can now interact with the mesh using a request-response paradigm.
When a global data mesh is implemented with Apache Kafka, the mesh is really one logical cluster made from many physical ones spread around the globe. While the data product can send data to or receive data from the mesh, if it's playing the role of a destination port, it typically can't use the underlying Kafka implementation, since that sits under central control. Instead, it should use its own Kafka instance. This pattern is simple to implement on the cloud using systems such as Confluent Cloud, where using additional standard clusters introduces very small marginal costs.
So that's the third principle of data mesh: making data available everywhere in a self-serve manner. Hopefully you’ve seen enough implementation detail to start to make the idea concrete: These are basically messages, produced into Kafka topics and consumed from Kafka topics.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.