Data has a long lifecycle, often outliving the programs that originally gathered and stored it. And data originates from a wide variety of systems and programming languages. The more easily we can access that ocean of data, the richer the analysis we can perform.
In an online shopping business, data recorded by the order-processing system and from data user behavior may prove invaluable to the website design department, provided they can actually access it. It's vital to be able to read data from an Event Store regardless of which process and which department put it there originally.
To a large degree, the accessibility of data is determined at write time, by our choice of Event Serializer. Still, the story is certainly not complete until we've read the data back out.
How can I reconstruct the original event from its representation in the event streaming platform?
Use an Event Streaming Platform that integrates well with a schema registry. This makes it easy to encourage (or require) writers to record the event's data description for later use. Having both the event data and its schema readily available makes deserialization easy.
While some data formats are reasonably discoverable, in practice it becomes invaluable to have a precise, permanent record of how the data was encoded at the time it was written. This is particularly true if the data format has evolved over time and the Event Stream may contain more than one encoding of semantically-equivalent data.
Confluent’s Schema Registry stores a versioned history of the data's schema in Apache Kafka® itself. The client libraries can then use this metadata to seamlessly reconstruct the original event data, while we can use the registry API to manually inspect the schemas, or to build libraries for other languages.
For example, in the Event Serializer pattern we wrote a table of fx_trades events. If we want to recall the structure of those events we can ask for the Flink SQL table definition:
DESCRIBE fx_trades;
+---------------+----------------+-------+-----+--------+-----------+
| name | type | null | key | extras | watermark |
+---------------+----------------+-------+-----+--------+-----------+
| trade_id | INT | FALSE | | | |
| from_currency | VARCHAR(3) | TRUE | | | |
| to_currency | VARCHAR(3) | TRUE | | | |
| price | DECIMAL(10, 5) | TRUE | | | |
+---------------+----------------+-------+-----+--------+-----------+
Or we can query the Schema Registry directly to see the structure in a machine-readable format:
curl http://localhost:8081/subjects/fx_trades-value/versions/latest | jq .```
```json
{
"subject": "fx_trades-value",
"version": 1,
"id": 1,
"schema": "{\"type\":\"record\",\"name\":\"record\",\"namespace\":\"org.apache.flink.avro.generated\",\"fields\":[{\"name\":\"from_currency\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"to_currency\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"price\",\"type\":[\"null\",{\"type\":\"bytes\",\"logicalType\":\"decimal\",\"precision\":10,\"scale\":5}],\"default\":null}]}"
}
Unpacking that schema field reveals the Avro specification:
curl http://localhost:8081/subjects/fx_trades-value/versions/latest | jq -rc .schema | jq .
{
"type": "record",
"name": "record",
"namespace": "org.apache.flink.avro.generated",
"fields": [
{
"name": "from_currency",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "to_currency",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "price",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 10,
"scale": 5
}
],
"default": null
}
]
}
An Avro library can use this schema to deserialize the events seamlessly. And any client libraries that are Schema Registry-aware can automate this lookup, allowing us to forget about encodings entirely and focus on the data.
In addition to Avro, Schema Registry supports Protobuf and JSON Schema. See Event Serializer for a discussion of these formats.
While the choice of serialization format is important, it doesn't have to be set in stone. For example, it's straightforward to translate between supported formats with Kafka Streams. For more complex scenarios, we have several strategies for managing schema migration: