As you begin to try and comprehend data mesh, the second principle, data as a product, is possibly the most important concept to grasp.
In a data mesh, your domain's shared data is managed as a true product, and your objective is to provide this data in a clean and proper way so that other teams in your organization can use it seamlessly. In other words, you treat other teams as internal customers of your data. Your data isn't just a commodity that you insert, update, and delete in a database until your job is done, but is rather something you're publishing for other people—something you're proud of. Product thinking prevents data chauvinism, i.e. "this is our data and no one else's."
Somewhat counterintuitively, decentralization and granting local autonomy over domains can risk causing even more data silos, which is exactly what you want to avoid. Domains must consider themselves to be servants of the organization in terms of the data they publish.
A simple way to think of a data product is as the data equivalent of a microservice: It's a node on the data mesh, situated within a particular domain of the business. The role of a data product is to consume high-quality data within the mesh, and like a microservice, it encapsulates everything it needs for its function. It has the actual data to share with the outside world; the code that creates, manipulates, and shares the data; and the infrastructure to store the data and run the code. It is similar to a microservice, but it's publishing data rather than delivering compute.
There are several technologies that can distribute data between products in a mesh. Apache Kafka is a natural fit, given the way that it enables data to be easily shared using streams (or topics). Kafka allows any product to consume from the high-quality stream of any other product:
In fact, the data mesh itself is actually quite similar to Confluent's concept of a central nervous system of data, whereby data is continuously flowing, all the while being analyzed and processed. (Keep in mind that the data mesh in the image above is a logical view, not a physical view.)
You may recall the spaghetti-like diagram in What is Data Mesh?, which illustrated a system with many point-to-point data connections. An architecture employing Kafka is much simpler, because event streaming is a pub/sub pattern:
Data producers are decoupled from consumers in a scalable, fault-tolerant, and persistent way. One produces a message to a topic, and one consumes—it's write once, read many. All that a data producer has to do to participate in the data mesh is to publish its public data as a stream, so that other data products can consume it. There are a few naming and schema issues to work out, but overall it's relatively simple.
And Kafka's event streams are real-time; you can propagate data throughout the mesh immediately. They are also persistent and replayable: they capture real-time historical data in one step. Finally, their immutability makes them an ideal source of record and ideal for governance purposes.
There are three main options for getting data into and out of a data product:
You can quickly onboard data from existing systems through Kafka Connect and its library of connectors, allowing you to stream data from your systems into your data products in the mesh, in real time. There are about 200 connectors currently available, so you will rarely have to write anything yourself.
Generally, only the team responsible for a data set needs to be concerned with its internal processes. Everyone else only cares about the public interface. How that public data is created—implementation details, language data infrastructure, and so on—doesn't matter as long as it meets the criteria of being addressable, trustworthy, and of good quality.
The team responsible for processing the data, however, has two options wholly within the Kafka ecosystem: ksqlDB and Kafka Streams. To learn more about these, see the courses ksqlDB 101, Inside ksqlDB, and Kafka Streams 101.
We’ve seen in this course that data prepared for sharing has far more value to an organization than data treated chauvinistically. To properly share your data, however, you must use schemas to put data contracts in place. This requires some know-how.
When you publish a stream, you should use schemas with appropriate backwards compatibility and forwards compatibility. This ensures, for example, that when you add a new data field to a stream, any existing consumers of that stream can continue to operate without having to make any changes.
However, if you do need to introduce a breaking change to the schema of the stream, use versioned streams. For example, if you're required, moving forward, to obfuscate certain data fields, and this will break existing consumers, publish a new version of your stream and deprecate the original stream once it isn't being used any more.
Note that in practice, there will be situations where a data product is forced to make incompatible changes to its schemas. You can address this using the dual schema upgrade window pattern.
You may find that some data sources are difficult to onboard into a data mesh as first-class data products. This can be for many reasons, both organizational and technical. For example, batch-based sources often lose event-level resolution, rendering events non-reproducible. In such a situation, it is best to use event streaming along with change data capture (CDC), with or without the outbox pattern.
Finally, as mentioned when we discussed the first principle, to better understand data mesh it is helpful to learn more about Domain Driven Design.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.