Unlock the secrets to complex schema evolution with Gilles Philippart! Discover how migration rules can transform what is usually a daunting task into a painless process with no breaking changes.
Software Practice Lead
Unlock the secrets to complex schema evolution with Gilles Philippart! Discover how migration rules can transform what is usually a daunting task into a painless process with no breaking changes.
Lean more about complex schema migration rules
Intro
Hi, I'm Gilles from Confluent. Dealing with complex schema evolutions in Kafka can be a real headache, right? When you want to refactor a schema and make incompatible changes, it's an ordeal: setting up a new topic, moving consumers over, syncing everything up...and there's no easy way back. Well, I've got some good news. I'm going to show you how to use Migration Rules to make these complex schema changes in the same topic, with zero disruption for your consumers. It's simpler, cleaner, and a total game changer. So grab your coffee and let's get into it.
Schema Evolution
A key aspect of Data Contracts is schema evolution. After the initial schema is defined, applications may need to evolve over time. When this happens, it's critical to be able to continue producing and consuming the data without disruption. You can evolve schemas by publishing a new version with a new structure, but depending on your compatibility settings, you are only allowed a limited number of changes.
Schema Registry compatibility modes
Let's have a quick refresher. If you choose BACKWARD, which is the default compatibility mode, you can only delete fields and add fields with default values. The consumers can always use the latest version of the schema to read old data. It makes this mode very useful to create backfilling consumers, which must be able to read all the data since the beginning. For example, for storing it into a data lake or for machine learning scenarios. If there is a mistake downstream, it's very straightforward to rewind the consumer to any previous position in the stream and process the data again. The drawback is needing to update the consumers before updating the producer. When your team just controls the producer and must update its schema, it's not practical. For example, if you share your data with a dozen teams, waiting for all to adjust their consumers to the new schema is unrealistic. In such situations, using FORWARD is more advisable. In this mode, you can roll out the new schema used by your producer. Consumers will still use their old schema and we'll be able to read new messages. These consumers can be updated with the new schema at a later time. The downside is that these consumers are not capable to read old data, which can be a challenge. If you have a high retention for the data in your stream. There is also the FULL compatibility mode which enables you to add and delete with default values. Those changes are both backward and forward compatible, so you can update producers and consumers in any order and the consumers can read both old and new data alike. This sounds like the perfect solution, but unfortunately it allows for a very limited amount of changes, and it makes all the fields optional, which isn't ideal for data quality. Okay, so whatever mode you choose, are you doomed to live with your decisions from the past forever? Are you ever going to be able to refactor your schemas? Well, now you can, thanks to migration rules.
Breaking change example
Imagine you have the following Avro schema to record customer membership information. In this schema, there are three required fields: a ‘user_id,’ and a ‘start_date,’ and an ‘end_date,’ both of which represent the period during which the membership is valid. Let's say you want to move the ‘start_date’ and ‘end_date’ fields under a property called ‘validity_period’ and rename them. Well, that change would not be allowed by the Schema Registry because it's a breaking change. Whatever the compatibility mode you've selected, you can't add and delete a required field. Fortunately, you can now evolve schemas in an incompatible manner thanks to migration rules which are part of the new Data Quality Rules capability. It's available in Confluent Cloud and Confluent Platform Enterprise. Let's see how it works. Confluent has introduced the concept of compatibility groups in the schema registry.
Compatibility groups
You can create a compatibility group by assigning a schema to it. You can do that by adding a field of your choice in the metadata section of your schema. You can then configure the Schema Registry to use this field to perform compatibility checks only between versions within the same group. It means that you can create a new version of the schema and introduce a breaking change if it's assigned to a new compatibility group. Here we assign schema V4 to a new compatibility group identified by the ‘major_version’ property set to 2. That's where we use a migration rule. A migration rule holds a small piece of code which allows you to transform the data from one group to another. You can create several such rules and register them in the schema registry. When you do so, a new version of the schema is created with the rules attached. Consumers will use this code to automatically transform the data at deserialization time. It will upgrade the data from a previous compatibility group to the next one or downgrade it the other way around as needed. So what is a migration rule made of?
Anatomy of a migration rule
A migration rule is just a JSON document with a few properties. Let's have a look at a really basic example. Imagine we want to create a new field called ‘fullname,’ which concatenates the ‘firstname’ and the ‘lastname’ fields in the message payload. First of all, we give the rule a readable name. Next, we specify the kind of rule as ‘TRANSFORM.’ There's another kind of rule called ‘CONDITION,’ which I talked about in my video on Domain Validation rules. In the type field, we set the value “JSONATA”, which is the programing language we're going to use to transform the data. Then we set the mode to UPGRADE to allow consumers using the new incompatible schema to read old messages. Note that you can write a DOWNGRADE rule too to convert messages from the new format to the old one so that consumers which use an old schema can read new data. Lastly, we write that JSONata expression. Here, we create an object with a derived property called ‘fullname,’ which is the concatenation of the ‘firstname’ and ‘lastname’ properties. The ‘merge’ function adds this property to the message. Going back to our initial example. It's a bit more involved. We need to remove the ‘end_date’ and the ‘start_date’ properties from the message. We also need to add the properties ‘from’ and ‘to’ end nest them under the ‘validity_period’ field. We also need to set their values to ‘start_date’ and ‘end_date’ respectively. We are going to do that by using two JSONata functions. To remove the ‘end_date’ and ‘start_date’ properties, we're going to use the high order function called ‘sift.’ We pass it a predicate function to filter out the two fields. Then, to add the two new properties, we're going to create an object with those props and pass it along with the previous result to the ‘merge’ function. JSONata has dozens of functions and operators to transform your data. Now let me show you migration rules in action.
Demo
Let's see how migration rules work. We start by creating the memberships topic with the Confluent CLI. Next, we register the first version of this schema, the one with ‘start_date’ and ‘end_date’ at the top level. We create the first compatibility group by setting the major version property to 1. Let's start the consumer which has been to only use major version one. We do that by passing the property called Use latest with meta data. The writer ID is displayed after each message. We are also going to start the producer using the major version one format. Let's produce a few messages. Okay. The consumer receives the messages in the same format as the producer. If we try to register the new version of the schema, which has a breaking change, it is denied because we haven't told the schema registry to use compatibility groups. Let's do that. If we try to register again, it's now successful and the new schema is assigned to a new group with major version set to two. I've written two migration rules in the JSON File an UPGRADE rule to convert the data from major one to major version two and a DOWNGRADE rule for the other way around. Let's register them. Now, start a consumer which will use major version two. If we go to the producer V1 window and produce messages in the old format, we can see that consumer V2 receives them in the new format. The transformation is done by the Upgrade Migration rule. Let's start a new producer which uses the V2 schema and produce a few messages. We can see that consumer V1 can receive these V2 messages in the old format. The transformation is done by the Downgrade Migration rule. Both consumers can automatically receive messages in their preferred format. You can now evolve your schemas as much as you like without the stress of breaking anything. Migration rules only support Java clients for now, but support for .NET and Python is coming soon. If you've got any question about migration rules, leave a comment and I'll do my best to get back to you. Don't forget to hit that like, share, and subscribe buttons, if you want to see more videos like this one. See you next time.