Staff Software Practice Lead
Data streams rarely live in isolation. Instead, they are connected together to form pipelines of related streams. As a system grows in complexity, understanding the relationships between streams can become difficult. And yet, this understanding is critical to data governance. We need to know where our data is coming from, and where it is going, whether to meet regulatory requirements, protect the data, or even just understand it better. In this video, we'll discuss some of the challenges we encounter when trying to understand our data streams and we'll introduce the Confluent Stream Lineage, a tool that can be used to visualize your data pipelines.
Topics:
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
When we discuss streaming architectures, we focus a lot on individual streams. We talk about the data that is in a stream, who's producing it, and who's consuming it, but streams rarely exist in isolation. Instead, systems tend to be built as a collection of interconnected streams. Consider a concrete example. We have an e-commerce site with many moving parts. A Customer microservice manages information for the customers. An Order service tracks the details of a customer's order. The Payment service knows whether or not the order has been paid for. And the Shipment service makes sure the product gets to the customer. But this is only scratching the surface. We haven't talked about inventory, pricing, fulfillment, or a variety of other critical systems. Most robust and scalable systems are built on a foundation of multiple services communicating through asynchronous data streams. Individual data streams are built up into larger networks, often called pipelines. Data flows through the pipelines often passing through multiple services and being transformed at each step. Operations such as filters, merges, and branches can drastically alter not just the shape of the data, but the volume as well. As the complexity of the pipeline increases, it can become difficult to understand the lineage of the data. Think of the lineage as understanding where the data comes from and where it is going, and how it changes along the way. On a small scale, the problem isn't obvious. We can easily draw the entire system on a whiteboard and track the lineage of the data through that understanding. This can tell us not just who created the immediate event, but also any parent or child events and so on. We can track any type of event back to its original source or down to its final destination. But when working with streams at scale, the problem becomes difficult or impossible for a human to manage. It's definitely not going to fit on a standard whiteboard, and we do need this kind of information. It can be critical if we need to diagnose and fix bottlenecks. Knowing how much data is flowing through each part of the system can help us plan where we need more resources and where we need less. It's also fundamental to our ability to ensure the safety and integrity of our data. When an auditor comes knocking to see if we are complying with regulations, we are going to need to know exactly where the data comes from and where it is going. The auditor won't want to hear that the system is too complex and we don't know where the data ended up. And in the worst case, if a portion of our data is breached by an attacker, understanding the lineage of the data can help us determine the blast radius of the attack. This can allow us to quickly isolate the affected systems while leaving others intact. Perhaps most importantly, knowing the lineage can grant us a deeper understanding of our data. This can be useful for all of the reasons outlined and more. For example, if we we wanted to reuse that data for some purpose, it can be helpful to understand where the data came from and how it evolved. This can help us establish confidence in the data and its integrity over time. What we need is a way to easily visualize the lineage of our data, not just today, but potentially at critical points in the past. Ideally, this visualization can show not just the various connections between streams, but also the relative size of those streams. From this visualization, we'd want to be able to drill deeper into any individual point for additional information. If we are auditing our Personally Identifiable Information for example, we'd want to be able to view the schemas of the data at each point so we can tell what data is being propagated through the stream. If we were trying to locate a bottleneck, we'd want to see metrics such as throughput or consumer lag. However, we may also be interested in seeing what the system looked like in the past in case the bottleneck is a recent development. In Confluent Cloud, we can gain these types of insights from the Confluent Stream Lineage. The Stream Lineage will provide a map of the data as it flows through the system. It provides a detailed visualization of each producer, consumer, and topic. We can see not just the endpoints of the data, but also the relative amount of data flowing through each of those endpoints. We can then drill down on any point to get detailed information and metrics such as throughput or consumer lag, and it can even rewind time to see what the lineage looked like in the past. As a result, the Stream Lineage can be critical in fulfilling our Stream Governance needs. If you aren't already on Confluent Developer, head there now using the link in the video description to access the rest of this course and its hands on exercises.