Get Started Free
‹ Back to courses

Data Ownership

Data ownership is the first step to ensuring data protection, security, and governance. Here are examples of data ownership and how it differs from centralized data warehouses.

7 min
‹ Back to courses
course: Data Mesh 101

Data Ownership by Domain

7 min
Untitled design (21)

Tim Berglund

VP Developer Relations

Ben Stopford

Ben Stopford

Lead Technologist, Office of the CTO (Author)

Michael Noll

Michael Noll

Principal Technologist (Author)

Data Ownership by Domain

The first principle of data mesh is data ownership by domain, or domain-driven decentralization. This principle simply requires that data be owned by those who understand it best. It's a reaction to the years that most companies have spent doing things the other way: putting everything into a centralized data warehouse.

The Antipattern

To illustrate the data mesh pattern, it's most effective to begin with its antipattern: the data warehouse. A typical data warehouse implementation has many data sources spread across a company, with varying levels of quality. There will be many ETL jobs, which are possibly running in different systems and pulling data sets back to the central data warehouse. The data warehouse teams will often be required to clean up and fix a lot of the data. If you've ever done this kind of work, you know that 95% of the work is cleaning. Extracting and loading takes up the little bit of remaining time.

This centralized approach cuts across domains, or units of business, by optimizing for a common set of technological skills, rather than a set of business skills: The team running the data warehouse usually doesn't understand the data sets very well because their focus is running the data warehouse as expertly as possible. Even if they do have some familiarity with the data, they are never going to know it as well as the domain teams where the data originates.

data-mesh-antipatterns

Another challenge is that the source systems that are feeding data into the data warehouse don't always behave particularly well. They haven't been built with responsible data sharing in mind, since the builders are being pressed to build application features, not thinking about the best possible way to make the data shareable.

So instead of optimizing horizontally as with a data warehouse, you should optimize vertically.

The Decentralized Data Mesh

In a data mesh, ownership of an asset is given to the local team that's most familiar with it—the ones who are intimately familiar with its structure, its purpose, and its value. In this decentralized approach, many parties work together to ensure excellent data. The parties that own the data have the responsibility to be good stewards of that data, and they are explicitly identified and known to the rest of the organization.

A Practical Example of Data Mesh

data-mesh-practical-example

Imagine Alice, at the top of this diagram, is working on an orders domain. Joe, at the bottom of the diagram, is working on inventory management, and he gets some bad inventory data from Alice.

  • The first thing Joe does is try to fix the data locally for himself. While he may be successful, the problem is that this only fixes the data for his service. Effectively, by fixing the data locally, he couples his inventory product to his local fixes. So there will be a problem if Alice decides to fix the product on her end after Joe has made these local changes.

  • A better solution is for Alice to fix the data on her end, which will eliminate Joe's problem. This is for the benefit of not just Joe, but anyone else who might want to consume Alice's data.

So for the data mesh to work, there are a couple of requirements:

  1. There must be a process that allows Joe and Alice to communicate quickly.

  2. There must be a sense of responsibility amongst members of the mesh: Alice needs to be both willing and able to make her changes quickly, which will ensure that her data in the mesh is always of good quality (there should be incentives in place to ensure that this happens).

Respect Language and Nomenclature (DDD Style)

Similar to domain-driven design practices, a standard language and nomenclature should be used for all of the data in your decentralized data mesh. This ensures that your streams of events will create a shared narrative that all business users can understand. (All of the various flows through the mesh should be comprehensible to non-technical people).

Use the promo code DATAMESH101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Data Ownership by Domain

Hi, I'm Tim Berglund with Confluent. Welcome to Data Mesh Module 2: Data Ownership By Domain. So, this is the first principle of Data Mesh, domain-driven decentralization or data ownership by domain. The objective of this principle is to ensure that data is owned by those who understand it best. So, this pattern is a reaction to what we've come to realize over years of practice of doing things the other way, that a centralized approach doesn't work at non-trivial scale. Okay. What that centralized approach does is cuts across domains, cuts across units of operational or business expertise, domain expertise by optimizing for a common set of technological skills. So people who know how to run the data warehouse, they're the data team. They might not know much about how particular parts of the business work because they're too busy trying to be good at data warehouses. So that's just the wrong direction. We think, you know, that we're cutting the wrong way. We should be cutting vertically, just as we've been saying for a few decades now, the way we ought to cut application functionality when it comes to application work. So that horizontal or centralized data management might work well for a small startup, but not for an enterprise. And probably the best way to understand this pattern is to think not about the pattern itself, but its anti-pattern, the central data warehouse like I was just saying. So let's go back to that. Typical data warehouse implementation, here, is that we've got lots of data sources spread across the company with, you know, varying levels of quality between them. There'll be many ETL jobs which are maybe running in different systems and pulling these data sets back to the central data warehouse and the data warehouse teams are often required to clean up and fix a lot of that data. If you've ever done this work or if you love someone who has, you know that like 95% of the work is cleaning and the actual extracting and loading is, you know, the remaining 5%, maybe it's the remaining 40% as these jokes go. But this central team doesn't really understand the data sets that well. Certainly less well than the domain teams where the data originates, the people who actually, you know, inserting those rows and databases are producing those messages to Kafka topics. Another challenge is that these source systems which are feeding data into the data warehouse don't behave particularly well. They haven't been built with responsible data sharing in mind. These are people, you know, imagine sort of getting beaten on to build application features. They're not thinking, gee, how can I make this data shareable? They're like, no, I just, I need to make, I have these, these story cards. I need to pull them off the wall and implement them. I need to get these things done and usually supporting analytics isn't a concern there. That's the thing that this horizontal data team will come in and translate your strange, extremely normalized application schema into something they can use. That just hasn't worked out well, is kind of the lesson. And the centralized approach is in contrast to the de-centralized approach of the Data Mesh. In a data mesh, ownership of a data asset is given to the local team that's most familiar with it and its structure and its purpose and its value. In this decentralized approach, there are many parties who are working together in a big enterprise, right? There's lots of teams building stuff and they all exchange data through the Mesh. And we'll talk about the nature of that exchange here. If you could just kind of, you know, imagine data moves from place to place for right now. Here, the parties that own the data and the responsibility to be a good steward of that data, those are both explicitly defined and known to the rest of the organization. And that owner is typically co-located in the part of the organization where that data originates. That's what we call the data's domain. Rather than being centralized as we did with a central data warehouse team. This is really what we mean when we describe Data Mesh as being decentralized. So a practical example: imagine Alice is working on the order's domain at the top here. We have Joe down in the inventory management at the bottom. Joe gets some bad order data from Alice. What does he do? Well, his first option is to try and fix that data locally for himself. And maybe he would do that. He, you know, he might be able to, he might not, but let's just say he does. The problem is that doing that just fixes the data for him. Also, he's coupling his inventory product to his local fixes that he's applied, which are themselves brittle. They have a risk of breaking if Alice decides to fix the problem at the source and his own fix might be brittle with respect to hers. You know, she corrects the error at the order service where it came from. So a better approach is for Alice to fix Joe's problem at the source, for the benefit, not just of Joe, but anybody else who might want to consume her data product. So in this way, everybody in the organization can benefit from this change. For that to work, though, there has to be a couple things. One, a process where Joe and Alice can communicate quickly. And two, Alice has to be a responsible member of the Data Mesh. So that means she's both willing and able to make changes like this quickly so her data in the Mesh is always of good quality. There have to be incentives in place for Alice to think that way. For her to think that this is a product that she's maintaining. So in summary, we can say that this approach ensures that the data doesn't diverge in different ways across the organization, Alice is always publishing, in a sense, the right thing. The main recommendation we can take from this principle, domain-driven decentralization, is to learn from domain-driven design, or DDD. You should use a standard language and nomenclature for data. You know, that stream of events really should create a shared narrative that business users can understand. And their various flows through the Mesh are comprehensible by those non-technical people. All right, this wraps up the first principle of the Data Mesh, which is called, again, domain-driven decentralization. In the next module, we'll dive into the second principle, treating data as a first-class product.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.