Get Started Free
‹ Back to courses
course: Streaming Data Governance

Discovering Streams

5 min
Wade Waldron

Wade Waldron

Staff Software Practice Lead

Discovering Streams

Overview

As our applications scale, one of the biggest challenges we face is finding the data streams we need. The experts on any given stream will be distributed across the business, and potentially across the globe. Even if we manage to locate the stream we need, knowing who is responsible for it might be almost impossible. However, it is critical that we are able to locate the streams containing specific data, especially if regulations are involved. There are no excuses for not being able to explain where we have used Personally Identifiable Information (PII). In this video, we will dive into the problem of stream discovery. We'll introduce the Confluent Stream Catalog which provides self-discovery tools that enable us to find the streams we need and locate the people responsible for them.

Topics:

  • Stream Expertise
  • Regulations
  • Personally Identifiable Information (PII)
  • Stream Metadata
  • Confluent Stream Catalog

Resources

Use the promo codes GOVERNINGSTREAMS101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Discovering Streams

When teams first start building streaming data pipelines, things are relatively simple. In the early days, the number of streams they need is small. There might be a lot of data, but it can be described by a few basic schemas. Often, a few members of the team have a strong understanding of the flow of data. Those few members could stand in front of a whiteboard and quickly draw out most of the streaming architecture. This begins to change as your system scales up. Data streaming architectures tend to thrive as more streams become available. The act of combining these streams in interesting ways is what makes these systems powerful. In addition, the customer demand for real-time experiences is on the rise. As a result, if our system is successful, other teams are going to want to use it. They may start consuming our streams in their own applications, or they may create streams of their own. Suddenly, a single expert on a single team isn't enough. Now we have multiple experts spread across multiple teams. Eventually, having multiple experts becomes problematic. Those experts may be spread across multiple areas of the company, or even the world. Coordinating with them can be difficult. Meanwhile, they have their own responsibilities, and outside teams constantly looking for help puts a strain on their time. That assumes we even know who the expert is. Within the immediate team or a closely related team the expert might be well known, but as we expand those boundaries, it may become less clear. It may become difficult to know who the expert is for a particular stream, but it's not just the expertise that becomes difficult to find. We may not even know the stream exists. This may lead us to duplicate the data, which creates a new set of problems. This becomes even worse when regulations are involved. Let's take the example of Personally Identifiable Information, or PII. Modern regulations, such as GDPR, require this information to be tightly controlled. The right to be forgotten requires us to find and delete a user's information at their request. If that data is spread over a collection of streams, how will we know if we got it all? Worse, if the teams didn't know where to find the data and duplicated it, how many copies might live out there? If all of that information lives only in the head of experts, how hard will it be to track down and coordinate with those experts? Of course, leaving this kind of data only in the heads of experts is silly for many reasons. What if those experts leave the company? What we probably want is to create a list of all of the streams that contain personal information. Likely, we'd want to include notes about who is the expert on the data and what the data looks like. That way, if we need to find the information, we can consult the list. This takes the burden off of our experts, decreases the risk, and helps make our data more discoverable. Of course, the list only contains streams with personal information in them. We might want to perform similar searches for device data, location data, inventory data, et cetera. We want to know what data is available, where it can be found, and who is responsible for it. Essentially, we need a searchable catalog of metadata that we can use to contain details about the stream. But where does all of this metadata live? We could create a series of shared documents or spreadsheets and keep them in some kind of shared repository, but those documents are attached to the streams only through convention, and there's nothing to enforce it. This means it's easy for them to become out of sync or get lost. We could build custom tools that will enforce some of the standards for us, but that will require development resources. It sounds expensive, creates additional burdens of expertise, and isn't contributing to our business model. Thankfully, Confluent Cloud provides a stream catalog out of the box. The catalog allows us to categorize our data using tags that are self-discoverable by anyone with access, and it is directly tied to the Confluent Schema Registry, so we can maintain and comment on the structural details of the data. With the Advanced Governance package, we can even attach business metadata, such as who is responsible for the stream. This allows us to search for the categories of information we need and obtain a variety of details about the streams. It puts the power of our streams at our fingertips. If you aren't already on Confluent Developer, head there now using the link in the video description to access the rest of this course and its hands-on exercises.