Stream Lineage

6 min

Wade Waldron

Principal Software Practice Lead

Stream Lineage

Overview

The Confluent Stream Lineage is a tool that allows us to see, at a glance, how our data streams fit together to form larger pipelines. It combines details from our producers, consumers, and topics to give an overall picture of the system. It also integrates with various metrics to allow us to dive deeper into any individual component. In this video, we'll introduce you to the Stream Lineage. We'll show you how to set up a simple lineage and teach to you interpret some of the data it provides.

Topics:

Confluent Stream Lineage
Creating a Simple Lineage
Connections and Branch Points
Schemas and Metrics
Point-In-Time Lineage
Lineage Search

Resources

Confluent Stream Lineage

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo codes GOVERNINGSTREAMS101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Get Started

Stream Lineage

Once our system reaches a certain scale, understanding how the pieces fit becomes challenging. It's too big of a problem for a human to solve, and even if we could there's no guarantee we'd be operating with current data. Thankfully, the Confluent Stream Lineage is always working to solve the puzzle by mapping the data flow as it happens. Because the Stream Lineage is always on there's no need to enable it or configure it. From the moment we start pushing data into the first topic, the lineage is working. At any point, we can select the Stream Lineage in the Confluent Cloud Console and see how the data is flowing through the streams. Often when working with components in Confluent Cloud there is a link for "See in Stream Lineage". Selecting that link will take us directly to the relevant component in the lineage. However, this is a lineage of the data. If there's no data, then the system can't build a lineage. That means if we want to see the lineage we need to start producing data. Once data is flowing, the lineage will begin to populate. It doesn't happen instantly, so be patient. Here we see a simple lineage containing a Producer, Topic and Consumer. Each element of the lineage is given the appropriate labels. The client ID for the producer and consumer, as well as the name of any topics will also appear. It's important that we name our components wisely. Each element is connected by arrows representing the flow of data. However, the size of the arrows is different. It represents the relative number of messages flowing through each stage in the pipeline. The difference in size suggests there may be an imbalance. This may not be anything to worry about. The imbalance may be temporary as our system reacts to changes in load, or we may have operations that alter the number of messages at a particular stage. But sometimes it can signal a real problem. In this case, our consumer should process every message. This is a message in, message out situation. The fact that there is an imbalance warrants a deeper look. In situations like this it's often a good idea to look at the consumer lag. Thankfully, we can jump straight to the consumer by clicking it. From there, we can select the "Consumers" tab to see the lag. Our producer is creating an average of one message per second. The recorded lag of 87 messages implies our consumer is 87 seconds behind. That's a lot. This suggests we have found a bottleneck in our system. From here, we'd want to investigate deeper and either try to improve the efficiency of the consumer, or scale out with more instances. Here we see an expanded pipeline with the bottleneck removed. We can see that our payment service is now pushing data to new topics: PaymentSucceeded and PaymentFailed. Our lineage shows this as a branch in the flow. We can have both fan-in branches and fan-out branches. In this case, we are seeing a fan-out branch. Our consumer has now been relabeled as a custom app, because it is both a producer and a consumer, and once again we see arrows of different sizes. In this case, between the two new topics. However, we aren't going to worry about this imbalance. We expect our system to produce more successes than failures, therefore the imbalance is what we want. Of course, if the imbalance suddenly reversed so that we saw more failures than successes it would be unexpected and we'd want to look into it. Clicking on a topic gives us a variety of useful information, including the schema. This can help us understand how information flows from one topic to the next. For example, if we were auditing the use of personal information in our system we might look at the order created topic, select the schema and see that it contains the tag PII for Personally Identifiable Information. However, downstream schemas don't contain that tag. Therefore, in this case our PII is not being propagated downstream. Each element in the lineage contains a variety of metrics. This includes things like the number of producers, consumers, throughput, and consumer lag. This information is critical to understanding the system. It's worth taking some time to become familiar with what is available. One powerful feature of the lineage is the ability to rewind and see what it looked like in the past. Point-in-time lineage allows us to customize the time window we are viewing. By default, the lineage shows data for the last 10 minutes, but with Point-in-time Lineage, we can adjust that to a variety of different periods going back minutes, hours, or even days. We can use this to look back at the lineage and see how it has changed over time. For example, if we notice that one of the streams is unbalanced, we can rewind the lineage to see if that is normal or a new development. Of course, all of this assumes we can locate the portion of the lineage we are interested in. The lineage search feature allows us to quickly locate specific elements and jump directly to them. This can be incredibly valuable in a large lineage with many moving parts. For example, in a system with hundreds, or even thousands of producers how would you find the specific one you were looking for? With lineage search enabled, it becomes trivial. The human brain is capable of finding and recognizing visual cues, but we need to give it data to work on. Taking a few moments out of every day to look at the Stream Lineage can build a baseline in our minds. The next time we open the lineage, if something is abnormal we may be able to recognize a change in the pattern. When we see these types of unexpected changes it's well worth our time to investigate. We may have just discovered an unreported issue, or even a security concern. Using the lineage to develop this kind of deeper understanding can help us avoid future data governance issues. If you aren't already on Confluent Developer, head there now using the link in the video description to access the rest of this course and its hands-on exercises.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

Modules: Start from lesson 1
Total 17