Get Started Free
‹ Back to courses
course: Data Mesh 101

Data as a Product

12 min
Untitled design (21)

Tim Berglund

VP Developer Relations

Ben Stopford

Ben Stopford

Lead Technologist, Office of the CTO (Author)

Michael Noll

Michael Noll

Principal Technologist (Author)

Data as a Product

As you begin to try and comprehend data mesh, the second principle, data as a product, is possibly the most important concept to grasp.

Product Thinking

In a data mesh, your domain's shared data is managed as a true product, and your objective is to provide this data in a clean and proper way so that other teams in your organization can use it seamlessly. In other words, you treat other teams as internal customers of your data. Your data isn't just a commodity that you insert, update, and delete in a database until your job is done, but is rather something you're publishing for other people—something you're proud of. Product thinking prevents data chauvinism, i.e. "this is our data and no one else's."

data-product-thinking

Somewhat counterintuitively, decentralization and granting local autonomy over domains can risk causing even more data silos, which is exactly what you want to avoid. Domains must consider themselves to be servants of the organization in terms of the data they publish.

A Microservice of the Data World

A simple way to think of a data product is as the data equivalent of a microservice: It's a node on the data mesh, situated within a particular domain of the business. The role of a data product is to consume high-quality data within the mesh, and like a microservice, it encapsulates everything it needs for its function. It has the actual data to share with the outside world; the code that creates, manipulates, and shares the data; and the infrastructure to store the data and run the code. It is similar to a microservice, but it's publishing data rather than delivering compute.

data-product-microservice

The Mesh, Apache Kafka, and Event Streaming

There are several technologies that can distribute data between products in a mesh. Apache Kafka is a natural fit, given the way that it enables data to be easily shared using streams (or topics). Kafka allows any product to consume from the high-quality stream of any other product:

data-mesh-product-domain-logical

In fact, the data mesh itself is actually quite similar to Confluent's concept of a central nervous system of data, whereby data is continuously flowing, all the while being analyzed and processed. (Keep in mind that the data mesh in the image above is a logical view, not a physical view.)

You may recall the spaghetti-like diagram in What is Data Mesh?, which illustrated a system with many point-to-point data connections. An architecture employing Kafka is much simpler, because event streaming is a pub/sub pattern:

data-producers-scalably-decoupled-from-consumers

Data producers are decoupled from consumers in a scalable, fault-tolerant, and persistent way. One produces a message to a topic, and one consumes—it's write once, read many. All that a data producer has to do to participate in the data mesh is to publish its public data as a stream, so that other data products can consume it. There are a few naming and schema issues to work out, but overall it's relatively simple.

And Kafka's event streams are real-time; you can propagate data throughout the mesh immediately. They are also persistent and replayable: they capture real-time historical data in one step. Finally, their immutability makes them an ideal source of record and ideal for governance purposes.

data-mesh-data-product-streams

Moving Data In and Out of a Data Product

There are three main options for getting data into and out of a data product:

  1. Event Streams. See the previous section for an explanation of why these work well. An additional advantage of event streams is that they continuously import and export data as soon as new business information is available.
  2. Request/Response APIs. You can use these synchronously to request data snapshots on demand (similar to a file export).
  3. Traditional ETL for Batch Import/Export of Nightly Snapshots. Traditional ETL will give you a fixed cadence of importing and exporting nightly snapshots, but compared to option B, it provides less control over the snapshot data you'll be working with.

three-main-options-getting-data

Kafka Connect for Quick Integrations

You can quickly onboard data from existing systems through Kafka Connect and its library of connectors, allowing you to stream data from your systems into your data products in the mesh, in real time. There are about 200 connectors currently available, so you will rarely have to write anything yourself.

kafka-connect-data-product

Processing a Data Product's Contents: ksqlDB and Kafka Streams

Generally, only the team responsible for a data set needs to be concerned with its internal processes. Everyone else only cares about the public interface. How that public data is created—implementation details, language data infrastructure, and so on—doesn't matter as long as it meets the criteria of being addressable, trustworthy, and of good quality.

The team responsible for processing the data, however, has two options wholly within the Kafka ecosystem: ksqlDB and Kafka Streams. To learn more about these, see the courses ksqlDB 101, Inside ksqlDB, and Kafka Streams 101.

ksqldb-kafka-streams-processing-data-in-motion

Sharing Streams in Your Data Mesh: Schemas and Versioning

We’ve seen in this course that data prepared for sharing has far more value to an organization than data treated chauvinistically. To properly share your data, however, you must use schemas to put data contracts in place. This requires some know-how.

When you publish a stream, you should use schemas with appropriate backwards compatibility and forwards compatibility. This ensures, for example, that when you add a new data field to a stream, any existing consumers of that stream can continue to operate without having to make any changes.

However, if you do need to introduce a breaking change to the schema of the stream, use versioned streams. For example, if you're required, moving forward, to obfuscate certain data fields, and this will break existing consumers, publish a new version of your stream and deprecate the original stream once it isn't being used any more.

publish-streams-with-back-and-forward-compatible-schemas

Note that in practice, there will be situations where a data product is forced to make incompatible changes to its schemas. You can address this using the dual schema upgrade window pattern.

Some Data Sources Will Be Difficult to Onboard

You may find that some data sources are difficult to onboard into a data mesh as first-class data products. This can be for many reasons, both organizational and technical. For example, batch-based sources often lose event-level resolution, rendering events non-reproducible. In such a situation, it is best to use event streaming along with change data capture (CDC), with or without the outbox pattern.

Finally, as mentioned when we discussed the first principle, to better understand data mesh it is helpful to learn more about Domain Driven Design.

Use the promo code DATAMESH101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Data as a Product

Hi, I'm Tim Berglund with Confluent. Welcome to Data Mesh module three; data as product. So, this is the second principle of Data Mesh, treating data as a first-class product. Honestly, when you're first trying to get Data Mesh into your head, I think this is the most important principle to focus on. You've got to really own this, and some of the rest starts to make sense. So this is about product thinking. Your domain's shared data is managed as a true product. And your objective is to provide this data in a clean and proper way, so other teams can make use of it. In other words, you treat the other teams as internal customers of your data. Your data isn't just stuff you insert and update and delete in a database to get your job done, but this is something you're publishing to people as something that you're proud of. This product thinking is so important and it prevents what we call data chauvinism. That's a risk of principal one that we covered in the previous model, thinking too much in enclosed terms. This is just our data and it's the good stuff and nobody else's data matters, nobody needs to see our data. We don't want that. You know, decentralization and granting local autonomy over domains can cause the creation of even more data silos. We wouldn't want that. And serious data quality problems, if the origin domains only care about themselves. So they really have to look at themselves as servants of the rest of the organization in terms of the data they publish. So what's the data product? Well, let's keep going. Think of a data product as a microservice for analytics, or a microservice for the data world. It's a node on the Data Mesh situated within a particular domain of the business. The role of the data product is to produce and maybe also consume high-quality data within the mesh. And like a microservice, it encapsulates everything it needs for its functioning. It's all right there. All right? You've got the actual data we wanna share with the outside world, the code that creates it and that manipulates it and shares it, and the infrastructure that we need to store the data and to run the code. So it is really pretty similar to a microservice. It's just it's purpose is publishing data and not delivering compute. So before we take a closer look at a data product, let's finish the zoom out again and look at the bigger picture of the Data Mesh. Because really a key question here is, how is data shared across data products in the mesh? That is, how do we implement the links and connections between the nodes in the network? And even visually in this diagram I'm showing you hypothetically here, we can see that this interconnectivity lends itself very naturally to. You saw this coming event streaming with Kafka, right? This, this seems like a natural fit. So here data is provided to other data products through streams in Kafka or topics, if you prefer to call them that. Any other data product can consume via Kafka from the high-quality data streams of other data products. This is really a hand in glove match as I look at it. As we can see here, this idea of a Data Mesh is very similar to the confluence idea of a central nervous system. The way we talk about event streaming, where data is continuously flowing, being processed, analyzed, acted upon, these things go together well. Of course we have to remember that the Data Mesh shown here is a logical view, okay. We're trying to get Data Mesh into our heads here. And these nodes are these local, you know, pieces of, of domain specific data functionality, publishing products to other nodes. This isn't a physical view. That physical view would be Kafka topics, exchanging messages. So if you know Kafka, you know that the reality there looks a bit different and even potentially simpler, which is a good thing. That's because event streaming isn't a point-to-point architecture that leads to the spaghetti mess we saw in module one and that a lot of us have experienced painfully in our own lives. So instead, it's an architecture that decouples data producers from consumers in a scalable way, also a fault tolerant way and a persistent way because streams are durably stored. So here a data product talks to other data products only indirectly. It's not point to point. Really one is producing messages to a topic and another is consuming from a topic or a stream. And this layer of indirection it's those topics or those event streams in Kafka. So this setup is write once, read many and all a data producer has to do, one of these nodes to, to participate in a data match is to publish its public data as streams. And then other data products can independently consume the data the way they see fit. Now, look, we've got some naming things to work out. Some schema things to work out. There are some implementation details, but you see how well Kafka as a basic piece of data infrastructure and Data Mesh go together. As a quick refresher, let's recap the properties of event streams and why they're such a great fit for this. So here are some of the reasons you see here. For example, streams are real time, right? You could propagate data throughout the mesh pretty much immediately. As soon as new information is available, streams are persistent and replayable. They'll recapture real-time and historical data with one step and they're immutable, right? So they make for a great source of record, which is great for governance purposes. And you know, this approach of coupling, loosely coupling things together through event streams and Kafka sort of already a thing for Microservices, right? Again, we see that parallel. So let's get back to data products and the question of how do we get data into and out of a data product. There are three main options. One is just use event streams. We just discussed why this is a good fit and the Data Mesh. And you know, another strong advantage is that they continuously import and export data as soon as new information is available to the business. Two, you could use request response APIs, synchronously to request data snapshot on demand like a file export that could still have a place in the system. Three, you could use traditional ETL for batch import/export of nightly snapshots. This will give you a fixed cadence of importing and exporting these snapshots, but compared to option B less control over exactly what snapshot data you'll be working with. With event streaming, you can quickly onboard data from existing systems through Kafka connect and connectors, that stream data from those systems into the mesh in real time. There are about 200 of those in existence at the time of this recording in the early summer of 2021. You know, those are readily available things. So in most cases, you don't have to write anything yourself. That integration is probably done for you. So it's a great way to set data in motion when you're bootstrapping a Data Mesh. Now that there's some data flowing into our product, what's happening then? Well, really what's happening inside a data product is kind of no concern to the Data Mesh. I shouldn't worry my little head about that at all. Apart from the team responsible for the data product who are quite concerned about it, everybody else only cares about the public interface, the public data that's shared by the domain. How that public data is created, implementation details, language, data, infrastructure whatever's going on in there, it doesn't matter as long as it meets criteria such as being addressable, trustworthy, and good quality. That being said, the local team does care about how the data is being made and among other interesting technologies, again, event streaming is useful to implement the insides of a data product. Naturally you'd use streaming tools like ksqlDB and in Java, the Kafka streams, client library to process streaming data that enters a product or is created within a product and streaming data that leaves is shared with the outside world and the Data Mesh in the same way. So if you're interested to learn more, we've got separate courses available for you. The cover ksqlDB and Kafka streams in detail. And of course, introductory Kafka stuff as well. This brings us to the last point. In which way should we share our data product streams within the mesh? So the two main considerations here are the use of schemas and versioning. When you publish a stream, you should use schemas with appropriate backwards and forwards compatibility. This ensures, for example, that when you add a new data field to a stream, that any existing consumers of that stream can continue to operate without having to make any changes. However, if you do need to introduce breaking change to the schema of the stream, then use versioned streams. For example, if you're being required to obfuscate certain data fields moving forward, which likely does break existing consumers, then publish a new version of your stream and deprecate the original once no one is using it anymore. So we really see here that data on the outside is harder to change, but it has more value for the organization at large. Because of this, you should use schemas to put data contracts in place. There'll be situations in practice where a data product is forced to make incompatible changes to its schema. This happens, the world changes and we have to deal with it. And you can address that using the dual schema upgrade window pattern. I was kind of just hinting at that a moment ago. The next recommendation is that in a Data Mesh, you should always favor getting the data from the source rather than from intermediaries. Otherwise data quality will deteriorate. It's like a game of telephone, more and more, the more edges you've got in that graph, it's gonna get worse no matter what. Thankfully it's easy to circumvent this problem through event streaming. With event streaming, you can easily subscribe to data directly from its authoritative source. Third thing we see is to change data at the source whenever there's a need for such a change. like the case of fixing an error. You shouldn't fix data up locally as you consume it for reasons we mentioned earlier in this module with the Alison Joe example, with orders and inventory. Finally you'll soon see that some data sources will be difficult to onboard into a Data Mesh as first-class data products. This can be for many reasons, which are either going to be really organizational or technical. For example, batch-based sources often lose event level resolution. There can be some non reproducibility. Events can be non-reproducible in those sources. In such a situation we use event streaming together with maybe change data capture the outbox pattern, things like that to integrate those difficult sources into the mesh. A big take-home here from this principle is to learn domain-driven design and remember to learn more, you can check out these other courses on Confluent developer. That wraps it up for this lesson.