This module covers a few best practices related to events and event streams. We will start by taking a look at the numerous benefits provided by event schemas and why they are absolutely necessary for using event streams. Next, we look at metadata and headers, and the role that each plays in an event-driven architecture. We will then focus on naming events and event streams. We take a look at a couple options and provide you with an example of where each works best. Finally, this module explores unique event IDs, as well as some strategies you can use to ensure each event is uniquely identifiable.
Schemas, as provided by Interface Definition Languages, are essential for making event streams consistent and reliable, and for aligning producer and consumer expectations on the format of the event data. Schemas provide structure and definition for the data communicated by an event. Just as explicit schemas are essential for constructing relational database tables, or a querying API on top of a set of data, schemas are also essential for defining the structure of data in an event. Apache Avro, Google’s Protobuf, and JSON Schema are three common IDLs that provide structure, format, and documentation for events.
Schemas enable code generation, as either a producer or a consumer. While your compilation options do depend on both the schema IDL and the programming language you are using, the idea is that you can take the schema and compile it into a class or object suitable for your language. Compiled languages get the benefit of compile-time type checks, significantly reducing mistakes and errors in data creation and usage.
Please note that these are only some examples of code generation - there are many different compilers that can convert a schema into a class or object in the language of your choice.
Schemas also enable evolution to help you handle your changing business requirements. In this example, the Cart fact is evolved to add a total_price field to new events.
Explicit schema evolution rules provide you with a framework for negotiating changes to your events, making it much easier to express how your data may change over time.
And to pin this all together, we rely on the schema registry to streamline event stream usage. The Confluent schema registry provides a common point of reference for both the producer and consumer of an event.
The schema registry plays a very important role in event-driven architectures and is essential for ensuring that both the producer and consumer of an event have a common understanding.
If you would like to learn more about schemas and the schema registry, check out our Schema Registry 101 course. It covers the different kinds of schema formats, schema evolution rules, and how producers, consumers, and the schema registry all work together. It also includes a number of hands-on exercises that use Confluent Cloud to illustrate how it all works in practice.
Next, we’ll take a look at event Metadata and Headers.
An event is composed of a value payload with an explicit schema. Though the key field remains optional, it’s typically populated by a primitive value for partitioning purposes.
Consumers also have access to the event broker's metadata, including the event’s topic offset, partition, topic, and timestamp. The timestamp value may be either the event’s local creation time as supplied by the producer, or it may be the received time provided by the event broker. Timestamps are configurable at a per-topic level, so you can decide which one may work best for your use cases.
Kafka also provides you with record headers. These offer a space for additional context and custom metadata about the event, without affecting the structure of the key or value. Why would you want this? Consider including information about the origin of the event, such as the system that created it, or tracking-related information to help audit workflows and lineages. Or, for another example, your key and value payloads could be fully encrypted for business reasons, but you could use the header to insert unclassified information about the event to allow it to be routed correctly. Headers are not a replacement for the value payload, but rather provide supplemental information.You must explicitly share the format of the key-value with your consumers so that they can appropriately make use of the contents.
Depending on your data governance policies and rules, you may find it useful to implement standardized event headers in your organization. A standardized header can include information pertaining to event tracking, auditing, compliance requirements, and whatever other context that every producer in your organization needs to adhere to. You’ll also need to work with all of your event producers to ensure they can all publish their events as required.
The naming used so far in the modules has been fairly simple and sparse—a bit like using someone’s first name instead of their full name. The names we have used for the delta and fact events have been quite terse and don’t really provide much contextual information.
But events will come from multiple areas of your organization, from different teams and systems, and embedding that contextual data into the name of the event can help with both stream management and discovery.
Here’s one option for naming event streams. In this format, you combine the domain, the event type, and the version of the stream together into a single name.
For example, the orders domain from the previous sample domain mapping could contain a fact stream that contains all of the order facts. You could choose to add .v1 to the end of the name, or you could also go with a standard that anything ending without a version number is simply the initial version.
With this naming option, if we have to make a breaking schema change to the Customers.Advertisement stream, the new Customers.Advertisement stream would be appended with a .v2. This indicates there was a breaking change and that the first version would be eventually removed.
You could also choose to include the service name in the event stream naming convention. Including the service name can reduce the ambiguity about where an event originates, especially when coupled with the domain. One reason to be careful about including the Service name in the stream name is that the service that produces the event may itself change over time.
Consider an event stream produced by code inside a monolithic service. One day that code is refactored into its own microservice with a new name, and ownership of the stream is transferred. The service name in the event stream title may no longer reflect the actual ownership. Names are important with Kafka event streams since they are currently immutable and cannot be renamed.
Using service names tends to make the most sense when a stream is used internally to a domain - data on the inside - such as in the case of delta events for event sourcing. The producer, stream, and consumer are already tightly coupled, and any changes won’t impact anyone outside of the domain.
A third option, should you choose to leverage event headers, is to put the origin inside of the record header. This option provides you with both the origin information of the service that created the event, but also decouples the service identity from the topic name. If you end up changing the service that produces the order events, you can simply update the origin header information.
Standardizing event stream names helps prospective consumers find the data that is relevant to them. It also provides clues as to which version of a stream you’re planning to read from—it’s always a good practice to use the most recent versions of a stream, as earlier versions are usually deprecated and will likely be removed.
Event IDs provide the ability to uniquely identify one specific event. One of the important use cases is for auditing events. An exhaustive audit can reveal how many events were written to an event stream, which can be compared to how many events have been read on the other side. Both missing and duplicate events can be detected, so that you can take appropriate compensating actions. Another use case is to use an automatically incrementing ID to ensure that events are processed in the correct sequence—especially if you expect to see duplicates or late-arriving events.
Hash(event_as_bytes) a20022c4-1fef-11ed-861d-0242ac120002 f49fde47-48a5-409b-8472-6f5ac7979ae8
The first format option to look at is a simple hash of the bytes in the event stream. Many current day hash functions can provide you with an extremely highly probabilistic result of uniqueness. While a hash event_id can provide you with uniqueness for deduplication purposes, it won’t be able to help you out if you’re looking for sequencing information.
The second format tends to be a more useful option, and is similar to the event stream naming strategy discussed earlier in this module. The service name, event type, entity identifier, and a version ID compose the event-identifying string. What does this look like in practice?
Let’s take a look at creating a structured event_id for our trusty ol’ cart event. The event_id includes the shop domain, the cart type, the unique cart_id, and the event version. Although the cart_id uniquely identifies the cart, we rely on the sequence_id to indicate which number of event this is. In this case the 1 identifies that this is the first, and so far only, event to be published for this cart_id. The event_id is published alongside the rest of the event payload, including the item_map and shipping information.
Say we update the event - in this case, we remove some items from the cart, update the sequence_id from 1 to 2, and create the new event_id. Note that we explicitly used the sequence_id as part of the event_id to ensure uniqueness.
So a consumer that reads this event has the option to store the event_id locally in its own data store. It can use this event_id both as part of a deduplication strategy, and also to ensure that it’s receiving all of the events, in the correct order.
When reading the next event for that specific cart_id, the consumer can look at two things:
If it’s not the next in the sequence, the consumer will need to make a decision—does it wait for the proper sequence event? Store it for later? Or just apply it and move on? Exactly what the consumers choose to do is up to them.
While event IDs are not mandatory, they are a best practice. They do not require much effort to implement, and can really save consumers a lot of time and effort when it turns out they do need to deduplicate or reorder events.
We recommend that you examine your need for unique event IDs as early as possible, and come up with a scalable strategy for implementing them. It can be difficult and expensive to change your unique event format at a later time, so it’s generally worth spending the time and effort up front to find something that works for your organization’s use-cases.
Event streams benefit from having a standardized name. It makes it easier for users to find and discover the data they need, while also providing a way to differentiate between similar events from different parts of your business.
Event IDs provide a way to uniquely identify each event. They can be useful for ensuring correct processing order, deduplication, auditing, and debugging.
Kafka’s built-in metadata provides you with information about the event in relation to the topic.
Meanwhile, headers provide you with the ability to add key-value pairs for auditing, tracking, and compliance that live outside of the event payload.
There are a number of factors to consider when designing events and event streams. This course covered the four main dimensions you should consider when designing and building your events and event streams. It also offered best practices related to schemas, schema evolution, identity, naming, and metadata.
If you’d like to keep building on what you’ve learned in this course, we recommend following up with two of our other courses. First is the previously mentioned Schema Registry 101 course—it covers schemas, evolution, and best practices for integrating with the Confluent Schema Registry.
Second, is our Event Sourcing and Event Storage with Apache Kafka course, which covers a few other event-related subjects, including event sourcing and command-query resource separation. These subjects tie into the four main dimensions presented in this course, and will help extend your knowledge on available event-driven patterns.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.