Dive into streaming data quality with Gilles Philippart! Learn to keep bad data out with Confluent's Data Quality Rules.
Software Practice Lead
Dive into streaming data quality with Gilles Philippart! Learn to keep bad data out with Confluent's Data Quality Rules.
Try Confluent Cloud for Free
Intro
Hi everyone, I'm Gilles from Confluent. I don't know about you, but I get really upset when crappy data makes its way into my system. Everyone is increasingly sharing data across teams or even partnering companies, so it has become super important to enforce the quality of the data that you ingest, process, and generate. But in the world of data in motion and streaming, bad data can spread faster and to more places, potentially causing downstream applications and services to crash. What teams have done traditionally is spend time writing code in the consumers to handle data inconsistencies, incompatible changes, or expensive and complicated transformations to process the data efficiently. In modern data architectures, we want to shift that responsibility to source of the data to increase the quality. The best way to do that is to create Data Contracts.
What is a Data Contract?
But what is a data contract? In a nutshell, it's a formal agreement between a data provider and a data consumer. A key tenet of Data Contracts is the ability to express and enforce the structure and the semantics of the data. It can also describe some governance aspects. For example, you can document the owner of the data, but also the terms of usage such as the notice period before change, or the use cases the data can be safely used for. In Kafka land, the data provider is the Kafka Producer and the data consumer is a Kafka consumer. You can think of a schema as the Kafka implementation of a Data Contract. We use schemas to enforce the structure and make sure that the fields have the right types, for example a String, long or an array of values. But it's quite limited if you want to enforce the semantics to further constrain the values of those fields. For example, verifying that an email address field contains a valid email, or that the age of a customer is greater than zero. Let's take a look at an example. Imagine you have the following Avro schema to record customer membership information. There's a start_date and an end_date to represent the period during which the membership is valid. Note that both have the date logicalType. It also has an email address field and the social security number of the user. Using the Membership schema we've just defined, we can successfully produce this record.
Semantically Invalid Messages
We have a valid start and end date, a valid email address, and a valid social security number. But rightly so, we would not be able to serialize and produce this message. In this case, the email isn't of the right type, the Avro schema expects a string, but we're passing in a number. In this other case, the mandatory Social Security number is missing. Again, this isn't a valid message according to the schema, as all the fields are required. Well, it seems like those Arvo schemas really have your back, right? Now, what do you think would happen if you wanted to produce "this" message? From a technical standpoint, it conforms to the Avro schema, but can you spot the problems? The email isn't well formed. I'm sure you spotted that one. Next, the Social Security number and email address aren't valid, as per our expectations. Wait, what's this? Both dates are valid, but the membership data itself is semantically invalid as the 'end_date' is before the 'start_date'. If this message goes through, downstream consumers will have a hard time sending email to the user or using the Social Security number to uniquely identify them. So what can we do?
Traditional Workarounds
Let's take a look at our diagram again. We could add some validation logic on the producer side if we don't want to send a potential "poison pill" that downstream consumers would choke on. Another common option is for consumers to add validation code for the input topics they don't really trust to increase reliability. Those consumers will then skip messages that fail the validation and maybe put them on some dead letter queue for further analysis. The problem with these approaches is that there's a disconnect about the data contract, which now lives in three different places: in the schema for the structure, and in the producer and the consumer for the semantics. Evolving the data contract becomes difficult as you need to agree on and coordinate the changes to the schema and the validation code, both in the producer and the consumer. What if there was a better way? What if we could keep that validation logic in the same place?
Data quality Rules Can Keep Bad Data Out
Well, Confluent has built a new set of capabilities right inside the Schema Registry client to catch those kinds of problems, the Data Quality Rules. In this video, I will just focus on the first capability, which is the domain validation and event condition action rules. For this demo, I've created a Confluent Cloud cluster with the Stream Governance package enabled.
Demo
Here I use the Confluent CLI to create the memberships topics and register the Avro schema. All right, let's write a simple Java producer, which will craft the invalid message and try to produce it to the membership topic. As you can see, the email isn't well formed and the Social Security number isn't valid. If we can successfully produce the message, a log will be printed on the standard output. Otherwise, an exception will be thrown. Let's run the producer. As you can see, the record has been written successfully to the topic. As we feared, downstream consumers will have to deal with the bad data. To prevent this kind of low quality data to go through, let's write a domain validation rule in a JSON file. We add a condition using a regular expression to check that the format of the Social Security number is valid. Conditions are written using Google Common Expression Language, which is both powerful and fast to evaluate. If the message fails the condition, the specified 'onFailure' action will automatically send it to a dead letter queue topic called 'bad_memberships'. Let's create the 'bad_memberships' topic. Let's find the schema registry URL and register the domain validation rule with the JSON file we've just created. If we now run the producer again, it will throw an exception indicating that the data contract isn't fulfilled. If we inspect the 'bad_membership' topic, we can see that the invalid message has been automatically sent to it by the domain validation rule. We've just solved our data quality problem.
Closing
If you've got any question about data quality rules, leave a comment and I'll do my best to get back to you. Thank you so much for watching. Don't forget to hit that like and subscribe buttons if you want to see more videos like this one. I'll see you next time.