By popular demand, this week's Streaming Audio is all about schemas and schema evolution, because I've heard a lot of people asking for guidance on how you manage data long term as the shape of it changes, as the schema of your data evolves. So in this podcast, we've brought in one of our experts, someone with a lot of experience in the field, to go through it all, from the simple early stuff like why you want to define a schema and what format you should choose for it, and should you choose the same format for the whole company, all the way up to the really juicy problems, like once you've got a topic with years worth of data in it, how do you migrate that schema without downtime?
The expert we have in question is Abraham Leal. He's this week's guest and he's one of our customer success technical architects. And he spends most of his working time diagnosing schema problems and solving them, and he's picked up some interesting ideas from the best developers out there too. So we've done our best to take you through the whole process from the problems you'll face on day one to the problems you'll face on day five. This podcast, as ever, is brought to you by Confluent Developer, and I'll tell you more about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. Joining me today on Streaming Audio is Abraham Leal. Abraham, how are you doing?
Doing good, Kris. How are you doing?
I'm very well. Where are you coming in from the world?
I am out of Austin, Texas. So we're the most recent Current happen was local to me.
Oh, very convenient, but less glamorous. You don't get to travel to a conference.
That's right. But you know what? I enjoyed it just as well.
Yeah, I probably get the same deal when I go to Kafka Summit London next year, it's basically on my doorstep.
There you go.
But we're not here to talk about conferences. We're here to talk about data contracts. So data contracts, I think they're basically the same idea as schemas and types, and we have a few different words for them in our industry. They're being explicit about the shape of your data.
That's right. I think that data contracts really encapsulates the whole concept of making sure that the data that your organization is producing from all its operations is ensuring that you can have that data be reusable across the various business units the organization has. So before we were think in terms of, "Oh, let's define my schema. Let's define my simple object for my program." Data contracts to me really encapsulates the whole flow, "Hey, I need to find a schema and I need to define how I'm going to make that available and how this data will eventually transverse my system." So it is maybe kind of like a encompassing subject of the things that we already have, have always been possible, we've given it a formal name,
Right. Yeah. So you are also, as well as the enforcement idea, you're talking about broadcasting the shapes to other places?
Yeah.
Yeah. Okay. So let's start from the very beginning of this story. I can imagine there are people using Kafka who have a JSON Blob and they just stick it on a topic and they're done. And if someone wants to reuse that data, they're just going to have to have a look at the fields in the JSON object.
Right.
I mean, it's a fine first step, but why would you push people to a second step?
Yeah, no. So that is a normal, most common way to get started with Kafka. In fact, if you go to the Confluent Cloud UI and you try to produce data from there, it will go as a JSON Blob that is represented as a string in Kafka. It is the easiest way to understand what's going on, but it is definitely not either the most efficient or the preferred way that an organization would want to do things, for multiple reasons. And I like to do things in threes, so I'm just going to give you 10 reasons, and then if there are more, we'll encompass them later. I would say reason number one is that a JSON Blob represented as a string is hugely inefficient for data transfer. It does not get encoded into more efficient binary formats such as other kind of schema [inaudible 00:04:45] languages. Not languages, but formats we do. So it is a bigger payload that contains more repeated data that is not essentially needed and is sent over the network. So it's more expensive because it's less efficient.
Number two is, well, now you have this data that you have sent in and there is no real control of what kind of data will follow. So from one message to the other, the JSON Blob could change in a thousand different ways. Rather you have five field defines in one, there could be 10 in the other, et cetera, et cetera. And that's really where data contracts kind of come in, unlike being able to enforce certain data to be part of a pipeline. And JSON Blobs, there's kind of a JSON schema format that is out there that you can do with actual JSON, but it's not what the normal regular Kafka user will use whenever there's a blob. Normally they just send it as a kind of string that will then be mapped into an object on the other end of the application. So it is inefficient. It is not very governable, because there is no real contract around it.
And three is just simply not the preferred data format that the modern architectures have built upon. JSON is, and for the foreseeable future seems to be, the language of the web. Between websites and services that interact in the web, this seems to be the preferred format and that's totally fine. However, when you're dealing with Kafka, Kafka often deals with backends that are a bit removed from that front end, and the preferred formats in the backends have evolved to ensure high efficiency and data qualities. And that's just not something that JSON has done when it comes to regular Blobs.
Yeah, because I mean, efficiency is an important thing, but I think everybody that's dealt with lots of JSON has had this process where they process 10,000 records from something and there's always one field that's always a string, and then suddenly one day it's null or a number and all your processing blows up. And the only way to know what shape the data actually has is to keep collecting more samples. You kind of want someone to officially say, "The definitive shape of this data is X."
And JSON has been around so much as the default format to communicate between the web that they've found ways around this type of problem. There's OpenAPI Specification around describing what you can expect from an endpoint, and then there's kind of like schemas defined for that endpoint so that you can understand them more. And then with the advent of GraphQL, they can even be even more specific about what the output of a web service is going be. But that's all in the front end and it's kind of like auxiliary tooling to help you give some normalcy to the chaos. In the backend, we get to do even more fun things around, "Hey, what do we decide we want the data format to be? And what kind controls do we want here?"
And this is where you get the other two popular options of Protobuf and Avro?
That's right. Yeah.
What are the advantages of those then?
Yeah, so Avro and Protocol Buffers, kindly shortened to Protobuf, they are formats that were developed with the efficiency and speed in mind that you'd expect from microservices communication. So both of them are extremely good at serializing data in a binary format that is schemaless and allows only the most essential data to be transferred over the wire whenever you're doing either microservice to microservice communication or asynchronous communication to Kafka. Both of them are really, really good at transferring only the data that is needed and decoupling that schema from the actual transfer, so that the users, the actual developers of the microservice, they can handle the schema and interpretation of that data because the only thing that they really care about is the actual contents.
So this is the idea that if I write a million records with a company name field in JSON, then I've written the word company name a million times-
There you go.
... and there's not really any point to sending that over the network.
Exactly. Yeah.
Whereas in Protobuf you can pull that out to one place.
Exactly. Yeah, it would be super wasteful to do that and that's exactly what you get to bring it back to JSON Blobs. That's exactly what you get. Whereas with Avro and Protobuf, they find more creative ways to be efficient. And they come with other things more than use that efficient binary format. They come with great schema enforcement rules. They come with great evolution rules. They allow you to be extremely explicit, specifically Protobuf allows you to be extremely explicit around data quality inside certain fields because it's so extensible. So the format is not only there to be efficient, it's also grown a lot of helpful things that in the backend are very useful to have.
You missed out one of my favorites, which is you can take one of the descriptions and use it to generate class files for Java and Go and-
That's right.
... Haskell and all the different languages just from that one description, which is really cool.
Yeah, I did miss that one and it's really one of the better ones. I guess comparing it back to the JSON Blob... well the JSON Blob is the JSON Blob, right? So it's not really, you have to do anything in language. Or if you want to do anything useful, you'd have to end up mapping it. I guess that is something that's really good here, where you can create plain objects in any language through Avro or Protobuf helper tools. And this can be interpreted in .NET, Java, Go whatnot. A JSON Blob, you'd have to use something like a [inaudible 00:12:10] library to map it into object and do anything useful with it.
And then probably you're guessing about the shape of the JSON Blob for that mapping.
That's right. Yeah.
It's one of those advantages that gets larger the more different languages you have in your company, I think.
That's right.
So that raises two questions and you can pick which order to take them, when you talk about the evolution features of those two formats and which one we should be using.
In my opinion, Protocol Buffers is really kind of taking ahead in the race whenever it comes to which format we should be using. And let me hedge my position a little bit in saying that if you are a company that has been around for a while, someone already at some point made a decision about this. And then most likely, most of the company has held up to this standard. At some point someone said, "Hey, we're just going to use a bunch of JSON objects. We're going to have an object repository. And if you want to implement [inaudible 00:13:28] something, just go and fetch it from there."
Or someone said, "Hey, we have a bunch of Avro definitions."Just go for it." If someone has done this and your company is 90% already adopted Avro as the spec for your data transfer and coding, you should go with that. You should not just go wild and utilize whatever you think is the cooler, newer tool. Because even more important than a tech problem, this is really a people problem, whenever you're talking about data. Because data quality really comes into play around... Well, in the end it's all just us coding this, microservices, it ends up being a people problems. So if you have a standard, go with the standard, do not try to push your whole organization to adopt Protobuf.
Because there's huge value to just the agreement, the company-wide agreement. Right? And you wouldn't throw that away.
100%. Exactly.
However, if you're in a greenfield project.
Yeah, if you're in a greenfield project, I would go with Protobuf 100 out of 100.
Convince me, because I'm more of an Avro user myself, so sell me on it.
Yeah, no, for sure. So Protobuf is a more recent project, so they've been able to incorporate a lot of things that the Avro project did not get to. They have very nice-
Very, very nice evolutionary concepts around order of fields, where ever it matters how you evolve around order of fields and Protobuf fields are IDed, right? They also have really good extensibility of the protocol itself. So you can do not only schema references in a very natural way, I would say, a more natural way than you would do them in Avro. They're also possible in Avro, I'm not saying they're not, it's just more simple to understand. That sensibility also allows for data quality rules to be implemented in the protocol whenever you're defining schemas.
Give me some examples-
So there's a lot of extensibility. Yeah, yeah. For example, let's say that you define a field, that it's a credit card field, right? Well you could say, "I want my credit card field to be a string." You wouldn't make it an integer because normally credit cards have dashes, though I guess if you don't care about dashes maybe you make it a long int, I don't know. But you can also say, if you find a string you can say, "Well my string, I expect four sets of four numbers that are separated by a dash. Right? And you can make that enforceable as you encode and decode the data from this serialization format.
And whereas in Avro these type of enforcements, you kind of have to augment either upstream or downstream. Whenever you're either producing the data or you're consuming and interpreting the data, you can augment from it. Right? Where in protocol buffers that extensibility, it's there for you to use.
Yeah. Relates to one of my pet peeves, which is strings are a terrible type in a way, because literally anything can go in a string. Right?
Exactly.
Is it the name of my user? Is it the complete works of Shakespeare? You need some way of narrowing that down.
Exactly.
Okay. And schema references you mentioned, what are they?
Yeah. So schema references, they're extremely useful concept around... Let's say that you have a schema for who an individual is, right? So an individual will have a first name, a last name, right? You're pretty sure about that. But addresses, they might have multiple. Not only might they have multiple, they might have different sections of an address that you wouldn't normally want to put it as a string, because as you say, strings can be whatever you want. You'd like some structure around it.
So schema references allow you to create complex objects for what you normally consider a field within an object, and then injecting them to create a more complex nested object. So in the example that we talked about, the individual, I could define the first name and the last name as strings, but then in addresses I could define that as an array of an address, right?
Right, so you're defining customer subtypes in a way.
Exactly. Exactly. And the address itself will be a type that is a schema in itself, that is nowhere defined within the schema that I'm defining for the individual, but it is rather defined somewhere else. And I can tell it, "Hey, go fetch it from here." Right? Protobuf does this very nicely natively. Avro is okay at it. The Confluent Schema Registry makes it nicer for Avro. But yeah, they're very helpful not only for making more concise objects that are more easily human-readable, but when it comes to evolution and long-term sustainability of this data, being able to evolve these separate entities individually is extremely powerful. It makes for a nicer story around the governance of your data.
Yeah, I can imagine that. By the time you've got three different objects that include an address then you've got a maintenance problem unless you can change it in one place. Right?
Exactly. Yeah. So now an address is a single object for your whole organization. You got the payment service over here and you got the shipping service over here. They both won't have two different definitions of an address, they'll both have a single address definition that is centralized and then given to them. Yeah.
Yeah, yeah. Okay. So let's say I'm convinced that I should write down my schema for the sake of everyone in the company. And I've chosen my schema format, which I'm choosing protocol buffers because you convinced me. Off the record, I'll keep thinking about that. I'll do some more research, but on the record I'm totally sold for the podcast. No, I'm kidding. I'm teasing you. I'm teasing you.
Right. So that's all good. I start publishing my data. People generate their object types from my definition. Everything's great. Now I need to change my data. Now what support will my schema language of choice give me for changing the data format?
So the differences between the format in Avro and Protobuf and whatnot is less noticeable around what the evolution allows you to do, but rather how the evolution allows you to do it. So being backwards compatible or forward compatible, or fully compatible, there's fundamental concepts here that are required in order to be considered backwards or forwards or full.
So for example, some fields, whenever you add them, have to be optional or in certain times if you want to remove a field that field better be optional. Or if you want to add it better be required. It works differently for different compatibilities. Right? Where it really differs between the formats is around, how are these compatibility checks happening locally? That allows me to be a little bit more liberal around adding fields or removing fields. For example I mentioned that one before where Avro really, really cares where you add a field. It in fact takes order as a very big indicator of what a field is whenever interpretation comes around.
Oh, so would Avro stop me reordering the fields alphabetically if I just felt that way?
Exactly. That's right. Yeah. So Avro does not like that. Whereas Protobuf, in its encoding format, it IDs fields by integers. If you ever define an Avro schema you have to give it a little number, and that's how it encodes it. And so order matters less in Protobuf, where you can add and remove fields and define them differently.
Okay. So you referenced two terms. I think one of them might be familiar to people, the other might not. Forwards compatibility as well as backwards compatibility. Can we take us through that?
Yeah, so backwards compatibility, and I always get this mixed up because at some point they all start blending together. Backwards compatibility means that you can update your consumers first and they will continue working as long as your producers produce with a older version of your schema. And this kind goes back to what is allowed. In backwards compatibility you can add objects or add fields in an optional manner, in a manner where, if the end object asks it, like, interprets it but doesn't find it, it is okay for it. It says, "Well, this is an optional field anyways, so I don't really care. I can just continue going forward."
The same is true for the reverse. Forwards compatibility says you can upgrade your producers first and your consumers will be fine as long as certain rules are followed. So whenever you add a field in a forward compatible manner the field is required. But since your consumers don't have the idea of what this new field that is required is, they don't really care. They'll just continue interpreting the data as they were before.
So I start writing, I decide it's important that we keep our products country of origin. I can start writing country of origin into the new data. And since the consumers don't yet know or care about it it doesn't matter.
Exactly. They'll receive it, they'll receive the encoded message, but then they'll interpret it fully and say, "There's some extra data, screw it. I don't really care. Let's keep going." So those are the two sides of the coin. And then there's full compatibility, which is the merging of both. And nowadays most companies will hear that and say, "Well, does that mean that I can just update in any manner? If I so choose to update my consumers first, or producers, it doesn't matter, right? I'll just do it." Right? And then I'll answer them. "Yeah, yeah, that's exactly what it means." And then they're like, "Well why would I ever go to backwards or forward compatibility? This is the best ever."
And the answer is because full compatibility comes with caveats, right? Yes, with full compatibility you can upgrade your consumers or your producers first, it really doesn't matter. But once you enter the full compatibility realm basically nothing is guaranteed. You'd have a base structure of your initial schema, and then every evolution after that will have to be optional fields that are added or deleted. And you will have no guarantees on whether a field even exists. At that point you'd have to start doing existence checks. Like, "Hey, I know my schema said that my field is here, but that field is optional. Let me actually check that it is not known." Right?
Yeah. And that feels a little bit like we're going back into the days when we were just guessing about it-
Exactly.
Right. Yeah.
Exactly. You're basically back at doing object mappers with JSON Blobs. Right?
Yeah. So what would you recommend then? Should I get into a habit of upgrading my producers first and then only look at that direction of compatibility? Or is there a habit that will help me here?
When people hear all of these options the most common one is forwards compatible. Why? Because forwards compatibility does allow you to add fields that will be required in the future. So you can have some strictness around your data. It's like, "Okay, well now I've decided that everybody must have a favorite color. Every individual from now on, I need to know their favorite color. It's required." And forwards compatibility allows to do that, and that's very attractive for organizations. But then some organizations that are very careful and care a lot about compatibility, just long-term compatibility, they do full compatibility and eat the drawbacks of it. There's some other... Mm-hmm?
I guess if you're a company that really needs really long-term compatibility you're going to fall into that hole eventually, right?
Right, and-
[inaudible 00:28:02] you need to be able to deal with every possible kind of version you've ever seen.
Exactly. Because think about it, you have this central nervous system in your company that is Kafka, and you have maybe a hundred dev teams, who knows? And they could have integrations with Kafka all from the app in your phone to devices like a TV that doesn't really get updated very often. So to them, a change in the data contract is a huge deal. Or even headphones, you don't really update headphones very often either. And that's where, "Hey, I really need to make sure that if I have a device that is 10 years old, I need to be able to interpret that regardless of where I am at, 10 years from now.
Yeah. Are you saying there's like a switch I can enable that will... I say I want full compatibility and it makes guarantees to me and stops me doing things that will break that?
So full compatibility, depending on the checks... And keep in mind that Schema Registry is the one that's checking. Whenever these checks come around I'm thinking of Schema Registry, there's other solutions around Schema Registry out there in the market. I'm thinking Schema Registry. If the compatibility guarantees are not there for whenever a new object comes into play for the data pipeline, Schema Registry will ensure that that producer gets shut down because it will not follow the guarantees. That's right. Yeah.
Right, okay. So let me follow this through naturally. Oh, I know what I was going to say, before I say that. So...I mean, this isn't a problem that's unique to Kafka, any data format speaking between different systems will have this, what happens when we change the data format problem. What we are really trying to avoid here is the situation where I'm going to change the schema for my central database and everyone has to upgrade their software in that same maintenance window or we're all going to have problems. We're trying to have a little bit of slippage between when the writer gets updated and when the readers have to get updated.
You got it. Yeah, you got it 100%. So decoupling, it's all the rage nowadays, you don't want one microservice to be dependent on another, I want full control and full liberty here. Which is great, it's worked out great for a lot of companies, you just got to make sure that these compatibilities are there to make sure that as each microservice evolves, compatibility is maintained between them.
Yeah. No one's saying it comes for free, it just comes much cheaper than trying to get 100 departments to release on the same day.
Exactly.
Okay, so here I am, I'm thinking about full compatibility over time, and I find that half of my data has a country of origin field for my projects, and all the old stuff doesn't. So I've got half of my topic filled with this field exists and the other with that field doesn't exist. I'd like to get to the point where I just have a single format for all the historical data. That feels like something I'd like to have, but they're immutable so I don't have a consistent thing. What do I do about that situation, where I have two different formats and I'd really like to unify it into one?
100%, this is a very, very common scenario. And I don't really talk much about what I do, but I used to be in professional services here at Confluent, and I now I'm a success architect, helping guide our customers' architectures over a long period of time. And this is a problem that appears quite often, I want consistency within the data that I have. It is most commonly solved today by what we call topic versioning. Really there's different ways of enforcing data contracts within the Kafka ecosystem, but the most popular way is saying one schema that evolves over time per topic. This is the most common way. And topic versioning really goes to say, well, yes, one schema that evolves over time, but over time I might say this schema is not enough anymore. I want a whole new schema that has other guarantees that my business needs today. And not only do I want new data to follow this schema, all data also needs to follow this schema. Even though this old data may not have some fields, I like to convert this old data to the new format that I expect. So that thing works nicely.
Topic versioning is around the idea that where my topic before with evolution worked great for me, I'm going to create a version two of this topic. In this version two, because Kafka's compatibility features, I am able to migrate all my data from topic one onto topic two with the same data format by creating a caster between version one and version two. I can do this in a multitude of ways, probably the most common way today is through KSQL. It makes it really easy to kind of take data in an old format and then spit it out in a new one. And then the new topic will have the rule that only this format is accepted from now on, it is a strict format. If the field in previous data doesn't exist, I'll make something up, I'll put a no or I'll make an entry string, doesn't matter. I'll have my new format, it's way more strict and now my new topic evolves from there. There's a big migration gap there, where your producers would produce to the version one topic for a little bit, and then you would do a switchover between producers to the old topic and then producers to the version two of it.
So let me check I've got that step in my head. I've got, I don't know, product version one, I create a product version two topic. I write maybe a KSQL job that rewrites all the old historical data into version two, and I leave that job running so that anything new that gets written into version one automatically ends up on version two. Then I switch over all my consumers onto version two and they'd still be getting the stuff being bounced through from the producers. And then finally, I can move the producers over to start writing to topic two.
You got it. That's exactly the steps, yeah.
Okay. What's the trade-offs of doing that? What does it cost me?
The number one trade-off here is literally the cost. You're basically taking data that you've already stored once, you have the store it all over again on the new version. You're probably going to be having to pay for some of the compute that is required in order to translate all this data. And then topic migrations, they're not a fun thing to do. Some companies, I work with a variety of them, some of them have it down like it's a science, they're like, it's not a big deal, we'll do it within an hour. And some companies are not experienced in this type of migrations, mini migrations I guess, and they'll have an issue doing the right cutoffs and whatnot.
So there is some complexity there. Would it be nice to be able to tell Kafka to handle this migration for you? That that'd be the best, but just the capability's not there yet.
It's not yet automated.
Yeah.
Is there some fundamental difference between the companies that do it well and the companies that struggle, or is it just practice and habit?
In my opinion, I think that the most fundamental difference for the companies that have it down is they have made data contracts a very important part of their data flows. I am a big believer of maintaining all of your schemas in a [inaudible 00:37:05] manner and then distributing them onto schema registry or any other systems through that system. And that allows for very public ways on why something evolves, when something evolves and when it's published and taken into production. And that helps people understand okay, so now there's this that is in this topic, and then it'll go over into the other topic. So that is one big difference that I see, schemas and data contracts, they're a central part of data flows within those companies.
And then the other part is being very, and this is just with experience, being very used to understanding how consumers and producers in Kafka work. What is my consumer lag? How do I ensure that I make sure that whenever I migrate from consuming one topic to the other, I am not replaying events, I'm not redoing events that I already did, or I start at the very end from the previous topic onto the new one. Understanding those concepts, it goes a long way so that people can say, well, yeah, I had consumer on version one and I'm just going to do a cut-over, but I'm going to cut over on this offset. And then on the other side, I'll just start from the beginning, and I understand that I will maybe re-consume messages and reprocess them, but that stream I have it in an important system, so it really doesn't matter. Or I will start at this offset, but on the new topic, I know exactly what offset that is. I've been able to locate it either through offset four times and I have been able to configure my consumer group to start from that offset. That's another difference.
So offset mapping from where you finished on topic one to where you're starting on topic two?
Right.
That seems tricky actually, to figure out where that offset is. Do you have any tips?
100%. So a lot of people, the number one way that they avoid all that trouble is to do the full switchover at once. We can call the migration story that we've talked about like a phased migration. In order to avoid all of this, you could just do it in a Big Bang release type of deal, where you stop your producers first, you wait for all that data to be consumed in V1, and then that KSQL transforms some produces onto V2. So now you know that the last message in V2 is the exact same last message than V1. So you just have your consumers start at the end and then move the producers onto V2, yeah.
Okay. I see that as a strategy, but I feel like we've slipped back into making different departments coordinate.
Definitely, yeah. So there's the phased and the advantages that comes with it, and then there's the Big Bang release, where yeah, there is some coordination, but you at least don't have to worry about the other stuff.
Could I do something like, in my process that's writing from version one to version two in my perhaps KSQL process, it occurs to me I could write the offsets from the old topic into the new record and that would give me something I could read to make sure I was in the same place. Is that good idea or a terrible one?
Yeah, [inaudible 00:40:50]. I would say it's not a very good idea because offsets don't map very well between topics, especially whenever you're doing transformations, and even more whenever you're doing exactly once processing. Markers start getting into it, transactions start getting into it. And knowing what offset was on the old topic for a message on the new one might be useful to, oh, well, if I find that offset in the new one, then I'm golden. I know that I need to start from there, but then how do we even find it? There's nothing that really tells us it'll be near the offset in the old topic. And you'll probably have to inject it in something like headers that is not really [inaudible 00:41:42].
Yeah, okay. I feel like I want some more advice, give me some more tips. I feel a bit lost on this particular point.
Yeah, no, for sure. There's also another phased approach where what you could do if you don't want to really couple both teams, you could just tell your consumer to consume from both topics at the same time and allow it to do interpretation from both topics. And then you can stop and start your producer on the new topic with no issues, because you know that there won't really be a double posting, because as soon as you stop producing here, you start producing over here, so new data will go on the new topic. Your consumer is consuming from both anyways. And then once the producer switches over, if you like, you can tell your producers to stop consuming from V1, and then you can just retire that topic.
Okay. Yeah, I like that. I like that.
That's commonly the method that we suggest for DR by the way. You would do a [inaudible 00:42:58] subscription for two topics, one for your local and one for your remote. And then whenever failover comes to happen, you're subscribed to both anyways, so you'll be fine.
Oh, okay. That's an interesting one. So there was one other question I had for you buried in that, you mentioned upcasters and downcasters. Can you define those for me?
Yeah, sure. So a lot of time, there is no real clear-cut way to allow for schema translation between one version and the other that weren't really evolved with the strict rules that forward and backwards and full compatibility set. Sometimes you have a schema one that could technically be put on schema two, but it requires a little bit of logic, maybe some human element, some human knowledge in order to put this schema onto the next one. And that is the concept of what I call schema casters, where you can either up cast or downcast schema to a version that is not really compatible from the rules that regular forward, backwards, or full compatibility gives you, but it can be made compatible given certain human knowledge that is injected onto a service.
This is a more complex way of doing things. This is not where you should go whenever you are starting with your data evolution story. You should always try see if the regular rules can be applied to your data story so that you don't have to do anything custom. But there's a bunch of companies doing really interesting stuff out there. Yeah, this really comes from this company that I've worked with in the past called KOR Financial.
KOR Financial.
They had a great talk around the work that they did around this area in current, which I believe all those talks are online nowadays.
Oh, yeah. We'll put a link to that talk in the show notes.
Thank you, yeah. They're a great, very smart group of people, but they had a very particular problem around, hey, they need to store data for a very long period of time. Then not only do they need to store it, they need to make sure that over that whole period of time, this data can evolve in ways that we couldn't even imagine.
Yeah.
They work with regulatory bodies that don't really care about the rules of forward or backward or full compatibility. They just say ...
They slap some rules on you.
Yeah, exactly. Right. Sometimes, and that's the work of an engineer sometimes. Some business requirements come, and you have to figure it out. With the specific constraints that they had, they wrote what is called a caster registry, and they open sourced it. Yeah. They only solved this problem for Java. They're a big Java shop.
They created a registry, where these casting rules between one Schema to the other can be stored, and then at run time, your program can extract them and leverage them to cast a Schema from a version one to a version two.
Oh, okay. You are making a query on their server saying, "I've got a version one, and I want to read a version four. Can you give me a bunch of rules?"
Exactly.
That's pretty cool.
Exactly. Yeah.
Okay.
It's extremely cool. I think they solved the problem in a very clever way. It does come with its own drawbacks. One, it is a more complex manner of maintaining your data, because you have to write casting rules for every version of the data that you put out there, right?
Yeah.
You want to transfer them. Not only that, these casters are not multi-language compatible. Unless you define a way to transcribe rules onto code for many different languages, you'll have an issue with that. For them, they wrote a registry that stores rules for Java objects, but if you're thinking multi-language support, it's going to take you more time and it's going to take more effort.
Yeah. Yeah.
It's a clever solution to a complex problem, but not every company will find this problem.
Yeah. You have to be in that particular, you're obviously going to get to that world in the crazy life of financial products, where you can keep the data around for 40 years and the rules of it are arbitrarily changed by government many times.
Exactly.
Yeah, yeah.
Exactly.
Okay. Don't start solving that problem until you have It is always good advice in software.
Exactly. Yeah. You got it.
There was one more thing I wanted to pick you up on that you mentioned, which surprised me.
Yeah.
You wouldn't recommend storing your master copy of the Schemas in Schema registry.
Yeah. Yeah. That's a big one that I've learned over time with working with many different customers. Yeah. Listen, I work for Confluent, so I like Schema registry quite a bit, and clearly I work with it a lot.
Yeah.
I think it's a great product, but I think that there is a lot of value after seeing all these companies to maintain your source of truth and your golden record of Schema definitions away from a service that might require a custom exporting mechanism, and maintaining them centrally on a conversion. The recent trend is Gitops, on a Gitops manner, in a Git repository where you can do with that data as you'd like. You can put certain build rules, so that whenever an evolution happens, there is checks with Schema registry. I still think Schema registry plays a big part here.
If I want to build a Schema in my GitHub repo, I wouldn't want to just allow you to put a PR and then merge it with no issues. You have built constraints that say, "Hey, this new version needs to be compatible to the old version. Make a call to Schema registry to make sure that this compatibility rules are being followed." Whenever it actually gets committed onto the main branch, you'd want to make sure that as that merge is happening, that there is a check that says, "Hey, I know that we said that it was good on the PR, but let's make sure that on the whole, on the new build, that the main branch is still buildable and can help."
Against that version. Would you then, as part of your CI hook, push that Schema up to Schema registry?
Yeah, I would. Yeah, on every main branch merge for sure.
Do you know what it makes me think? It's almost like you're keeping your schemers in Git for compile time and in Schema registry from runtime.
Yeah, that's a very good way to put it. It's not only that. Sometimes, and you could get this from Schema registry too, but it's another tool that developers can leverage to say, "Hey, I have a bunch of objects that I need to resolve for my data needs. I can just pull down this Git repo, and then utilize something like the Avro CLI or the Protobuf CLI to render all of these subjects for the language of my choice in order for me to use them."
It allows you to have a nicer way to pull them, a nicer way to make them more publicly accessible, evolve them. Really, it allows you to be even more compatible with other discoverability tooling. Today, data discoverability is a big point of emphasis because everybody's trying to get a data mesh. Everybody's trying to get different business units talking to each other and making sure they can leverage data. Discoverability is the number one problem in this equation. All the other stuff is hard, don't get me wrong.
Even if you have all of this stuff set up to actually stand up the concept of a data mesh within your company, discovering the data, and making it actually consumable easily for all of these other internal customers, call it, is the hardest part, in my opinion. There's some tooling around there to make it easier. With a centralized system for all of your Schemas, not only can you export these things to Schema registry, but if you have a third party rule for data discovery, you could export them to that tool and then discover data in that manner.
Yeah, yeah. I could very easily see someone having a Git repository of all their Avro or Protocol Buffer schemas, and that repository, including easy steps to generate your preferred language code, and maybe even a web front end that let you browse that history, so for part of that discovery story. Do you know of any existing tools to make this easy, or is it still undiscovered country?
Yeah, it's halfway there. Schema Registry has a nice Maven plugin to kind of help you generate Schema publishing, Schema compatibility checks, or to a Schema registry instance. The interpretation onto objects is mostly done by the projects. Avro provides its own way of interpretation and generation, Protobuf does its own way. That's kind of provided by the projects.
I would say, in interlanguage way of ensuring that all these checks are being done, and then the objects being generated. There is not one good single tool out there yet. Most customers that I work with end up writing regular back scripts that are runtime, do alot of these things, and leverage the different tooling to bring it inside a single ceiling.
Okay.
Yeah. It's a young space, it's still evolving. A lot of pulling is yet to come, and that's another aspect of the space that can improve.
It's a new open source project waiting to happen, right?
There you go. Yeah.
Cool.
Also, fun fact, you mentioned, hey, I can see having your Avro Schemas or your Protobuf Schemas, I work with customers that have both in the same Git repository. They have ways to transfer, translate Avro Schemas into Protobuf, and Protobuf Schemas into Avro.
Oh, God.
They give their developers choice. Like, "Hey, you want Avro? You can have it. You want Protobuf? You can have it. All you need to do is follow these rules to make sure that this is transferrable onto the other format later on for others to use." Yeah.
Right.
Yeah.
Yep. Agreement is the hardest thing in computer science, I'm convinced.
Agreed.
Okay. Well, this has been a fascinating tour. I think, we don't normally do this, but I think I'm going to try and open this up to the audience, because I bet people have more questions. If you have a question for Abraham, send us a message. You'll find my contact details on the show notes or link on our YouTube version of this podcast. We might do a follow up episode where we grill you with reader's questions. How does that sound?
I love that. Hey, that's what I do for a job nowadays, right? It's mostly, I work with a lot of customers. They grill me with questions, and sometimes I'll be, "Hey, that's a really good question. I've never heard of that specific problem before, and I'll have to go back and research it." Sometimes I'll be able to tell you the answer off the cuff. I think that'll be a really fun episode.
Okay. We might do a challenge episode. Thanks very much, Abraham.
Yeah, Thank you, Kris. I hope you have a great day.
You too. Catch you again. Thank you, Abraham. Now, hopefully that's answered all of your burning data revolution questions, but if not, then hit us up. Send us some messages, drop us a line, leave us a comment, and I will gladly pull Abraham back in for a follow up episode. Quite fun to do a big Q&A episode. It's a hot topic, right? There's a lot to learn, there's a lot to share, so don't be shy. Let us know.
In the meantime, as well as leaving us your further questions, we have further answers for you. If you look in the show notes, you'll find a link to that KOR financial talk the Abraham mentioned. If you want an in-depth course about how to use Schema Registry, there's one from our very own Danica Fine. You'll find that on Confluent Developer, along with lots of other useful Kafka courses. Go and check it out at Developer.Confluent.io.
With that, it remains for me to thank Abraham Leal for joining us, and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time.
Have you ever struggled with managing data long term, especially as the schema changes over time? In order to manage and leverage data across an organization, it’s essential to have well-defined guidelines and standards in place around data quality, enforcement, and data transfer. To get started, Abraham Leal (Customer Success Technical Architect, Confluent) suggests that organizations associate their Apache Kafka® data with a data contract (schema). A data contract is an agreement between a service provider and data consumers. It defines the management and intended usage of data within an organization. In this episode, Abraham talks to Kris about how to use data contracts and schema enforcement to ensure long-term data management.
When an organization sends and stores critical and valuable data in Kafka, more often than not it would like to leverage that data in various valuable ways for multiple business units. Kafka is particularly suited for this use case, but it can be problematic later on if the governance rules aren’t established up front.
With schema registry, evolution is easy due to its robust security guarantees. When managing data pipelines, you can also use GitOps automation features for an extra control layer. It allows you to be creative with topic versioning, upcasting/downcasting the data collected, and adding quality assurance steps at the end of each run to ensure your project remains reliable.
Abraham explains that Protobuf and Avro are the best formats to use rather than XML or JSON because they are built to handle schema evolution. In addition, they have a much lower overhead per-record, so you can save bandwidth and data storage costs by adopting them.
There’s so much more to consider, but if you are thinking about implementing or integrating with your data quality team, Abraham suggests that you use schema registry heavily from the beginning.
If you have more questions, Kris invites you to join the conversation. You can also watch the KOR Financial Current talk Abraham mentions or take Danica Fine’s free course on how to use schema registry on Confluent Developer.
EPISODE LINKS
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.
Email Us