Get Started Free
Wade Waldron

Wade Waldron

Staff Software Practice Lead

Stream Quality

Overview

Large systems of streaming data involve multiple teams working on different components. To ensure effective communication between these components, we need an API for our streams. The API often takes the form of a schema. However, if we want these teams to be able to use each other's APIs then we need a central authority where they can be managed and explored. In this video, we will explore the challenge of using APIs across multiple teams, and introduce the Confluent Schema Registry which can be used to manage and enforce schemas.

Topics:

  • Cross-Team Compatibility
  • Stream APIs
  • Schemas
  • Confluent Schema Registry

Code

Customer Record: Before

{
	"name":"John Smith",
	"phone":"(123)456-7890"
}

Customer Record: After

{
	"name":"John Smith",
	"phone":[
		{
			"type":"mobile",
			"number":"(123)456-7890"
		},
		{
			"type":"work",
			"number":"(123)987-6543"
		}
	]
}

Customer Schema

{
	"name":"Customer",
	"type":"record",
	"fields":[
	  { "name":"name", "type":"string" },
	  {
	     "name":"phone",
	     "type":{
	        "type":"array",
	        "items":{
	           "type":"record",
	           "name":"PhoneNumber",
	           "fields":[
	              { "name":"type", "type":"string" },
	              { "name":"number", "type":"string" }
	           ]
	        }
	     }
	  }
	]
}

Resources

Use the promo codes GOVERNINGSTREAMS101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Stream Quality

Imagine we have built a large-scale system of data streams. Some of those streams might be managed by our team, however, many of them will be managed by someone else. Those other teams may have very different standards and processes. If we want to make use of the data coming from those teams, we are going to need to establish a standard for us to communicate with. Otherwise, it can create a situation where it is difficult to guarantee the quality of the data that we are receiving. Let's consider a concrete example. Our team is responsible for producing data about online orders. As part of that process, we sometimes want data about the customer placing the order. For example, we may want to display the data on an invoice. However, the customer data is managed by another team. That team produces data streams that contain the data we need. Unfortunately, it changes constantly. Just last week, they modified the customer record to include an array of phone number objects rather than a single string. If it wasn't done carefully, that could break our downstream consumer, because we process the phone number as a single string rather than as an array. These are the types of problems we wish to avoid. What we need is some kind of standard contract both teams can agree on. This contract typically takes the form of a schema. If the producer needs to make a change to the schema, they must ensure the change is compatible with downstream consumers. At the same time, they need to advertise those changes so that consumers know what the new format will look like. We can use message formats such as Avro or Protobuf, both of which include the capability of defining a schema. JSON doesn't have schemas built in, but a JSON schema add-on does exist. We can then build a schema that will outline the exact details of what a message might look like. However, by itself, that's not enough. Once we establish the schema and share it with the individual producers and consumers, what is to stop them from making changes without letting everyone know? To ensure the schemas don't drift, we need a central authority to manage them. This central authority can help us build trust in the schemas. Used properly, it can provide a guarantee that the schemas will remain compatible, even as they evolve. We'll no longer need to worry that someone might change the schema without us knowing. This is where the Confluent Schema Registry comes in. It provides tools to manage and enforce schemas. This allows us to advertise the schema used for a stream, but also enforce compatibility rules as it evolves. It can also be linked to schema validation to guarantee that all published data matches the schema. By leveraging the schema registry, we can build trust in our data streams and feel confident that they meet the quality standards we desire. If you aren't already on Confluent Developer, head there now, using the link in the video description, to access the rest of this course and its hands-on exercises.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.