Get Started Free
Wade Waldron

Wade Waldron

Staff Software Practice Lead

Schema Evolution

Overview

Schema evolution is the act of modifying the structure of the data in our application, without impacting clients. This can be a challenging problem. However, it gets easier if we start with a flexible data format and take steps to avoid unnecessary data coupling. When we find ourselves having to make breaking changes, we can always fall back to creating new versions of our APIs and events to accommodate those changes.

Topics:

  • What is a schema?
  • Do schemaless formats need to evolve?
  • What is schema evolution?
  • How does the data format impact schema evolution?
  • What is a forward-compatible schema?
  • What is a backward-compatible schema?
  • How do we version our APIs and Events?
  • Should my APIs use shared libraries?
  • How can I advertise my schema changes?

Resources

Use the promo code MICRO101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Schema Evolution

Hi, I'm Wade from Confluent.

During my consulting days, one of the most common questions I was asked was how to handle schema evolution.

It's a complicated problem, and there's no single answer.

But if you stick around for a few minutes, I'll introduce you to schemas and how they can be evolved.

When two microservices communicate, they need to agree on the format of communication.

Essentially, we need to define an API.

This API acts as a contract between a microservice and its clients, ensuring that the service meets its promised obligations.

Many data formats encode these details in an explicit schema.

These schemas include details such as what fields are present, what structure they take, and what data types they contain.

But what if our data format doesn't include a schema?

JSON by default is considered schemaless and many people use it that way.

If our API leverages a schemaless format, then do we still need to worry about schemas and schema evolution?

What do you think?

Feel free to pause the video and drop a comment before I give you my answer.

Even if we don't define a schema, an implicit one still exists.

By defining an API, the microservice commits to providing data in a specific format.

Clients will expect the API to follow these rules even without an explicit schema.

Over time, internal and external pressures such as competition and regulations can force the schema to change.

We've all witnessed how modern privacy legislation has forced companies to adapt their systems.

These can be simple changes such as adding or removing a data field, but they can also be complex.

We might decide to switch from a JSON format to Protobuf to improve performance.

Or we might be forced to encrypt data to meet regulatory requirements.

Remember, the schema is a contract and if we break the contract, clients may lose faith in the system.

Therefore, it's critical that we evolve the schema in a non-breaking fashion.

The first thing we should do is choose a data format that is flexible and easy to evolve.

Some formats are more rigid than others.

For example, with JSON, we might be able to add or remove a data field, but renaming it is likely going to be a breaking change.

Meanwhile, Protobuf allows for adding, removing and renaming fields in a non-breaking fashion, but it isn't human readable like JSON.

We also need to consider whether those changes are going to be forward or backward compatible.

Forward compatible changes allow old clients to read data that was written in the new version of the schema.

In a JSON document, adding a new field is considered a forward compatible change.

Clients written against the new version of the schema can read the new field.

Meanwhile, clients written for the old schema can simply ignore the field.

Backward compatible changes allow new clients to read data that was written with an old version of the schema.

Event streams often prefer backward compatible schemas because they contain events written with different schema versions.

When a client attempts to read the event stream, they need to be prepared to handle the messages written with an old version of the schema.

In a JSON document, deleting a field is a backward compatible change.

The current version of the document won't contain the field.

If the client receives an older version that does contain the field, it can ignore it.

Unfortunately, despite our best efforts to maintain compatibility, sometimes we're forced to make breaking changes.

When that happens, we want to minimize the impact on clients.

A common way to handle this is by creating new versions of the API.

In the case of a rest API, this could mean creating a new V2 endpoint.

For a Kafka data stream, we might create a V2 topic.

When we create the new API, we might need to support the old version for a while to allow clients time to migrate

On the client side, we should be careful to avoid data coupling.

Every time the client reads a piece of data from the API, it's coupled to that data.

If the data changes, then the client has to change with it.

Therefore, the client should be careful to only read the data that it cares about to minimize the impact of changes.

For example, it's common to write shared client libraries for a microservices API.

Clients of the API can leverage the shared library rather than re-implementing the logic.

However, for this to work, the library has to read all of the data exposed by the API.

If any part of the data changes, then the client library will need to be updated.

That in turn can force a change to any clients that use the library on alternative is to use a share nothing approach.

Here, each client would implement its own logic for reading the API.

In that case, they can ensure that they only read the data that matters to them.

That way, if the API evolves in ways that the client doesn't care about, it may be able to avoid unnecessary updates.

One final consideration when we evolve our schemas is how to advertise the changes.

A schema registry such as the Confluent Schema Registry can be a useful tool for broadcasting schema changes and enforcing compatibility.

Validating messages against the schemas stored in the registry can ensure those messages are compatible with clients

Meanwhile, clients can use the registry to see what versions of a schema are available and what data is contained within them.

As you can see, there's no single strategy for handling schema evolution.

It ends up being a combination of different approaches.

Just remember, your unique requirements may impact any of the suggestions I've made.

Thanks for joining me today.

Don't forget to follow our courses on Confluent Developer and our YouTube channel.

I've got plenty more videos like this coming, so make sure you like, share and subscribe so you can stay informed.

See you next time.