Staff Software Practice Lead
Data Governance is a set of processes and standards that allow businesses to effectively use the information they are collecting. It focuses on the principles of Availability, Usability, Integrity, and Security. In this video, we will outline each of the principles and discuss how they relate to Stream Governance.
Topics:
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
Businesses undergoing a digital transformation have been forced to contend with the massive amount of data they are now collecting. This has led to the growth of an industry known as data governance. Data governance is a collection of processes and tools that help effectively manage the information in a business. It is supported by four main principles. They are availability, usability, integrity, and security. When we move into the realm of data streaming, we need to look at data governance through that lens. The same principles are relevant in stream governance, but how they impact the system might be different. Let's take a look at each of the principles. Availability of data is critical to the operation of any system. If the data becomes unavailable for any length of time, it will cause a degradation of functionality. Losing access to the data often translates into a loss of customer confidence and revenue. For example, if a company is trying to keep up with the launch of the next big smartphone, it is important that the storefront remains operational. A momentary outage could cost millions or even billions of dollars. Therefore, it is critical that we ensure our data remains available when it is needed. There's no single magic solution to maintaining availability in your system. Instead, availability tends to revolve around a combination of solutions. Streaming data is one of those solutions. A common technique when working with streams is to use them to replicate data to multiple locations. This can potentially mitigate the impact of an outage. Or, if a downstream consumer of the data fails, it doesn't mean that the data stops flowing. We can continue to collect data and push it into a streaming platform, such as Kafka. When the downstream consumer recovers, it can pick up where it left off. This won't prevent the outage, but it can help to mitigate the impact. So simply by using data streaming solutions, we've already started to tackle the availability problem. Usability becomes harder to manage as your system scales up. When your system is small, finding the data you need tends to be relatively easy. A single developer might be able to keep track of where the data lives in their head. Or, he might record it in a document or diagram somewhere. But as the system scales up, this becomes difficult. When building new features, developers may struggle to find the data they need. Even if they manage to find it, how do they know who's responsible for it? At a small scale, these problems aren't obvious, but eventually, as the business grows, we're going to need tools to help us manage all of this data. To maintain the usability of our data, we need a way to catalog it. We need to be able to tag our data streams in a meaningful way and then search through those tags. For example, we may want to tag a particular stream to indicate that it contains personal identifiable information. That way, if a GDPR request comes through, we can find where the personal information has been stored. We may also want to annotate the data with details, such as who is responsible for it. Then, when building a new feature that leverages the stream, we would know who to reach out to if we have any questions. Unfortunately, even if we could guarantee the availability and usability of our data, that wouldn't be enough. Data is rarely static, especially in a modern digital world. The data we collect and how we use it changes all the time. If we aren't careful, the data may evolve to the point where it's no longer compatible with the downstream consumers. It's therefore critical that we put standards in place to ensure the integrity of our data, even as it evolves. These standards can be enforced using data schemas. A schema defines what form the data will take, as long as both the producer and consumer both support the current schema, everything should work fine. However, that can't be a fixed definition. It needs to be able to evolve. We also need to be able to keep track of how it has evolved over time and ensure that we remain compatible with older versions. This means we'll need tools to manage and enforce our schemas. This brings us to the final principle of data governance, security. In many ways, security is perhaps the most important of the four principles. A failure in any of the other three will potentially result in a loss of revenue, development velocity, or customer confidence. A failure in security could result in all of the above, plus legal or even governmental intervention. The consequences of such a failure can be far-reaching, so security is not something to be taken lightly. Here at Confluent, we have built multiple courses on security because we want to make sure you are prepared for anything. Because we have other courses focused on security, we aren't going to go into a lot of depth here. What you need to know is that everything we will discuss has important layers of security behind it. As we talk about streaming and cataloging all of this data, remember that this can all be protected behind layers of authentication, authorization, and even encryption. We want to ensure that the data is available when we need it, where we need it, but we also need to be careful to make sure it is only accessed by the right people. We've discussed the four principles of data governance, but unfortunately, building the tools to support these principles is not easy. Any work put into supporting them is taking time away from building concrete business value. Nobody makes money by providing availability or data integrity. Yet, they are critical to the ongoing evolution of the business. Thankfully, Confluent Cloud has built tools to support our stream governance needs. The three pillars of Confluent Stream Governance are stream quality, stream catalog, and the stream lineage. Together, these three pillars help to ensure the availability, usability, and integrity of your data streams. Meanwhile, like with everything in Confluent Cloud, security is fundamental to all three of these. We can leverage these tools knowing that they are backed by the layers of security necessary to keep our data safe. Understanding these tools and the roles they play is going to be the focus of the rest of this course. If you aren't already on Confluent Developer, head there now using the link in the video description to access the rest of this course and its hands on exercises.