Data has a long lifecycle, often outliving the programs that originally gathered and stored it. And data demands a broad audience - the more accessible our data is, the more departments in our organization can find use for it.
In a successful system, data gathered by the sales department in one year may prove invaluable to the marketing department a few years later, provided they can actually access it.
For maximum utility and longevity, data should be written in a way that doesn't obscure it from future readers and writers. The data is more important than today's technology choices.
How does this affect an event-based system? Are there any special concerns for this kind of architecture, or will the programming language's serialization tools suffice?
How can I convert an event into a format understood by the event streaming platform and applications that use it?
Use a language-agnostic serialization format. The ideal format would be self-documenting, space-efficient, and designed to support some degree of backwards and forwards -compatibility. We recommend Avro. (See "Considerations".)
An optional, recommended step is to register the serialization details with a schema registry. A registry provides a reliable, machine-readable reference point for Event Deserializers and Schema Validators, making event consumption vastly simpler.
For example, we can use Avro to define a structure for Foreign Exchange trade deals as:
{"namespace": "io.confluent.developer",
"type": "record",
"name": "FxTrade",
"fields": [
{"name": "trade_id", "type": "long"},
{"name": "from_currency", "type": "string"},
{"name": "to_currency", "type": "string"},
{"name": "price", "type": "bytes", "logicalType": "decimal", "precision": 10, "scale": 5}
]
}
...and then use our language's Avro libraries to take care of serialization for us:
FxTrade fxTrade = new FxTrade( ... );
ProducerRecord<long, FxTrade> producerRecord =
new ProducerRecord<>("fx_trade", fxTrade.getTradeId(), fxTrade);
producer.send(producerRecord);
Alternatively, with Apache Flink® SQL, we can define an Event Stream in a way that enforces that format and records the Avro definition using Confluent’s Schema Registry:
CREATE TABLE fx_trades (
trade_id INT NOT NULL,
from_currency VARCHAR(3),
to_currency VARCHAR(3),
price DECIMAL(10,5)
) WITH (
'connector' = 'kafka',
'topic' = 'fx_trades',
'properties.bootstrap.servers' = 'broker:9092',
'scan.startup.mode' = 'earliest-offset',
'key.format' = 'raw',
'key.fields' = 'trade_id',
'value.format' = 'avro-confluent',
'value.avro-confluent.url' = 'http://schema-registry:8081',
'value.fields-include' = 'EXCEPT_KEY'
);
With this setup, both serialization and deserialization of data is performed automatically by Flink behind the scenes.
Event Streaming Platforms are typically serialization-agnostic, accepting any serialized data from human-readable text to raw bytes. However, by constraining ourselves to more widely-accepted, structured data formats, we can open the door to easier collaboration with other projects and programming languages.
Finding a "universal" serialization format isn't a new problem, or one unique to event streaming. As such we have a number of technology-agnostic solutions readily available. To briefly cover a few:
While the choice of serialization format is important, it doesn't have to be set in stone. It's straightforward to translate between supported formats with Kafka Streams. For more complex scenarios, we have several strategies for managing schema migration:
1 Older programmers will tell tales of the less-discoverable serialization formats used by banks in the 80s, in which deciphering the meaning of a message meant wading through a thick, ring-bound printout of the data specification which explained the meaning of "Field 78" by cross-referencing "Encoding Subformat 22".