When designing events, we strive to ensure that all of the necessary data for the consumer is included within an event. However, data does not exist in isolation, and it may not be possible to reasonably contain the necessary context of a complex system inside of a single event.
Relational databases often use some degree of normalization in the creation of their data models, whereas document databases often lean toward a more denormalized format. We can draw some inspiration from these domains in our evaluation of the degree of denormalization suitable for event streams. This module looks at the trade-offs between normalization and denormalization of the event data model.
Normalized tables often lead to normalized event streams. One common reason for this is that many people often start their event-driven journey with Connectors that pull data directly in from the database and into a mirrored set of event streams. This approach is not ideal as it creates strong coupling between the internal database tables and the external event streams.
Let’s take a look at something more concrete. Here’s a simple data model that would translate well to a relational database. This model is revisited a few times over the remainder of the course.
An Item has a unique primary key, as well as some details about the name, price, and description. It also has a related Brand, as well as a related Tax Status.
Both the Brand and Tax Status tables are related to the item table by a foreign-key relationship. Though we only have one item in the Item table__, you can imagine that there may be thousands or millions of items, and that many of those items will have the same brand and tax status properties.
One common use case is to set up a connector for each table, pulling the data out of the table, composing it into events, and writing it to a dedicated event stream.
Highly normalized tables are often exposed as-is via a connector, leading to highly normalized event streams. While this is an easy way to quickly get started, it has significant impacts on how your consumers can use the data. They may find it difficult to use the data—for instance, if they want to treat Items differently based on brand name, or if they need to address tax statuses in their processing. While performing joins to resolve the relationships is possible, it’s more challenging in streaming than it is within a relational database.
In contrast, you may choose to denormalize the data before making it available to consumers. Denormalization can also act as an opportunity to create an abstraction layer between the data on the inside and the data on the outside.
Generally speaking, there are two main options for denormalizing data. First, data can be denormalized during event creation time, before it leaves the source system boundary. And second, after the normalized events have already been created, typically by using a purpose built joiner service. We’ll examine each of these options in the second half of the course, but for now, it’s worth examining why relational data is problematic in event streams.
It tends to be a lot easier to use and process events when they’re modeled similar to a document—flat, with few relationships, and with all the data present in a single payload.
In contrast, the relational model works extremely well when coupled with a database engine purpose built to handle relationships - like Postgres or MySQL. Unfortunately, a relational model using some degree of normalization doesn’t lend itself well to the needs of event-stream consumers.
Let’s take a look at why this is the case.
Looking back at the item to brand and tax status relationship, you can see that there are three tables and two foreign key relationships.
We then hooked up a Kafka Connector to each of these tables and created three topics that mirror the tables one to one, including the foreign key relationships.
If an item’s brand and tax status prove to be highly utilized by consumers, then we’re going to see each consumer service having to resolve the foreign key relationships over and over again. On their own, they don’t make for very useful events—and the more normalized your model, the more fine-grained, highly relational event streams you can end up with.
This strategy can incur high costs on consumers—both in engineering time to recreate the join code between applications, as well as in the processing power and storage requirements. Resolving streaming joins at scale can result in a lot of shuffling of data, which incurs processing power, networking, and storage costs. And this is without having applied any business logic!
Furthermore, not all consumer stream processing frameworks support joins, especially on foreign keys, which tends to be more challenging than joins on primary keys.
It’s important to think about making event streams easy to use for your consumers. Highly relational event streams are typically difficult to use—you either need to rebuild the relational model, or you need to denormalize it. And while you can denormalize relational streams with joins using Kafka Streams and ksqlDB, you may want to leverage other frameworks and tools that do not support streaming joins.
Aside from the difficulty in joining relational streams together, consumers are also now coupled on the internal model of the source system. Changes to the source system’s data model may end up impacting downstream consumers.
Let’s look at an example. Say we were to refactor an existing model - we take the item table and refactor it to extract the pricing into its own table.
Unfortunately, this has a significant downstream impact—we’d need to create a new connector and a new stream. Consumers coupled on the item stream model must refactor their code to deal with the changes, including deciding if they need to incorporate the new price stream. A simple upstream refactoring can result in a lot of work for downstream consumers, and it all stems from a tight coupling on the internal upstream model.
The solution to this problem is to isolate the consumers from the internal model using an abstraction layer. This is a specific and explicit external model for consumers to couple on - “data on the outside”.
Changes to the internal model remain isolated in the source systems, while consumers are provided with an officially decoupled data model purpose built for their usage. We can also leverage this abstraction layer to denormalize data on the inside to make it easier to use by downstream event consumers. All we need to do is make sure that we can effectively map the internal model to the external model.
There are a few ways you can implement the abstraction layer.
Let’s take a look at each of these now.
We know it can be painful for consumers to have to join all of the data together. What if you were to push it upstream instead?
In this example, the streams on the left mirror the tables they came from in the database. We join the events using a ksqlDB application based on the foreign-key relationships, and emit a single enriched item stream.
The hands-on exercise in the next module explores this option in depth.
Logically, we’re simply taking the existing foreign-key relationships and squashing them down into a single row.
BrandName has now been made part of the Enriched Item event - we decided to drop the
BrandId as it wasn’t necessary beyond resolving that join.
Next, we’ll also join in tax statuses in the same way.
Here is our final enriched item event.
This denormalized format is much easier to use by end consumers:
It’s also important to note that not all events will be sourced from the same system, and so denormalization at the producer won’t always be possible. In this example, we have a stream of item click events exhibiting user behavior. Note that this event doesn’t contain the richer item information such as the name, the price, and the description.
The item information is in another event stream that is published by another system. Talking to the item click stream consumers, you may discover that the very first thing they do, every single time they use item click stream events, is to join it with the item stream. And since you’re dealing with many click events, you discover that it ends up using a large amount of compute resources.
So just like we did with events sourced by connectors, we can build a purpose-built ksqlDB application to join the item clicks with the detailed item data, and emit it to an enriched item click stream.
The contents of the enriched event contain both the item click data, as well as select fields from the detailed item stream.
Denormalizing already-existing streams is just one way to make events easier to use for end users. Another way is to rely on the source database to denormalize and produce the event to the stream directly. For some databases, like a document database, this is a pretty natural step—your data is already denormalized and it’s easy to produce.
Relational databases are also well suited to denormalizing data before a write—after all, fast queries between relationships are pretty much their bread and butter.
This is where the transactional outbox comes into play.
In the transactional outbox pattern, we create a dedicated outbox table that we will use as an outbox for writing events to the stream.
We wrap all the necessary updates inside of a transaction. We write the internal update to the internal table, and then write the event to the outbox, and end the transaction. Transactions give us a guarantee that an event will only be written to the outbox if the internal state was also updated. The Transactional outbox pattern allows you to both isolate the internal data model as well as remodel the data for export to the outbox.
Finally, to get the data out of the outbox and into Kafka, we can use something like a change-data capture connector. This gets the data from the outbox tables, converts it into events, and writes it out to the event stream. We will also need to ensure that the outbox doesn’t grow indefinitely - we can either delete the data after it gets captured by CDC, or we can periodically delete it out of the table using a scheduled job.
Let’s go back to our item example again to see how this works in practice. Say you have an operation that updates the Item Table. We’re going to rename the item from Anvil to Iron Anvil.
Upon updating the data in the database, you now need to decide if you’re going to update and emit the updated item, complete with the denormalized tax status and brand table.
And so, our denormalization logic would need to execute the Brand and Tax Status Joins to create the new event. Note that the name has been updated to Iron Anvil.
The new enriched item event we created is due to a change of a value in the original item schema itself. But we also need to consider what happens when values change in either rand or tax status. If consumers are going to rely on the data to be eventually consistent, then we need to make sure that other changes to the model are also reflected downstream.
Say the ACME company undergoes a rebranding, and comes up with a new brand name—changing from ACME to Rotunda.
We would create another new event, triggering join logic to capture the change of the new brand name and denormalizing it into a denormalized item event.
Note that Rotunda may have many hundreds or thousands of items that are also updated by this change - it can cause quite a lot of events to be emitted.
There are a number of factors you must consider when choosing to denormalize at the source database.
We’ve already taken a look at selecting the tables to join on, and how the relationships may affect the join logic you need to trigger on.
Updates may occur in multiple areas of your internal model, and you must be sure you have accounted for the necessary changes. Querying a production database for the purposes of creating a denormalized event is also not free—you will need to consider the load on the service to determine feasibility, and either reduce requirements or increase the compute resources if they are insufficient.
And finally, databases are a wonder of multi-threaded concurrency - but you may be unable to capture all intermediate states in a series of rapid-fire concurrent updates. Race conditions may occur, causing you to miss certain transitions in the database records - you will need to evaluate your consumer’s use-cases to make sure it meets their needs.
One more important point before we close out this module—only denormalize where it is appropriate. As a set of principles, first only include data that is commonly used. Exclude any fields that aren’t relevant to downstream consumers, and feel free to leave some streams of less commonly used data in a normalized form, or simply don’t write it at all. Second, beware of denormalizing relationships that lead to data that is either very large, such as a blob of text detailing hundreds of user reviews for an item, or very frequently updated.
Inventory is a very good example of a field that is updated frequently, and that can lead to problems if you aren’t careful in evaluating the frequency of updates.
But let’s say we were to include that in our existing Item Model.
So we add an inventory table in our upstream monolithic database—it keeps track of inventory in real time. Every sale results in a deduction from what we have in inventory—so you can imagine that quite a lot of events will be coming out.
If we join the inventory data in as part of denormalization, we’re going to end up with a very verbose set of events—one every time the inventory updates.
One major downside is that this puts a heavy load on the event broker serving the data, and on each consumer using the data. There will be lots of updates, even if the consumer doesn’t care about inventory levels at all! And while in this example we’re using an external stream processor to join the data together, if you were to try the join each inventory update in a relational database, you’d also be joining each of the brand, tax status, and item table data as well—resulting in an increased load upstream as well as down.
Denormalization makes it easier for consumers to use data, but comes at the expense of more upstream processing and a careful selection of data to include. Consumers may have an easier time building applications and can choose from a wider range of technologies, including those that don’t natively support streaming joins.
Normalized event streams are common when producing data that tightly mirrors a relational model, and are fairly common when using connectors and change-data-capture. The onus of joining data is pushed downstream to the consumer, who typically must use a stream processor like ksqlDB to join it together prior to applying business logic.
Normalization versus denormalization is all about balancing the needs of consumers, the ability of the producer services to join data together, and the unique data model relationships. Listen to your consumer’s needs and work to build out an external abstraction layer to isolate upstream internal data models.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.