Get Started Free
course: Designing Events and Event Streams

Dimension 2: Modeling as Normalized vs. Denormalized

16 min
bellemare-headshot-zoomed

Adam Bellemare

Staff Technologist, Office of the CTO (Presenter)

When designing events, we strive to ensure that all of the necessary data for the consumer is included within an event. However, data does not exist in isolation, and it may not be possible to reasonably contain the necessary context of a complex system inside of a single event.

Relational databases often use some degree of normalization in the creation of their data models, whereas document databases often lean toward a more denormalized format. We can draw some inspiration from these domains in our evaluation of the degree of denormalization suitable for event streams. This module looks at the trade-offs between normalization and denormalization of the event data model.

Normalized Tables Make Normalized Streams

normalized-streams

Normalized tables often lead to normalized event streams. One common reason for this is that many people often start their event-driven journey with Connectors that pull data directly in from the database and into a mirrored set of event streams. This approach is not ideal as it creates strong coupling between the internal database tables and the external event streams.

Consider the Item Example

item-example

Let’s take a look at something more concrete. Here’s a simple data model that would translate well to a relational database. This model is revisited a few times over the remainder of the course.

An Item has a unique primary key, as well as some details about the name, price, and description. It also has a related Brand, as well as a related Tax Status.

foreign-key-relationship

Both the Brand and Tax Status tables are related to the item table by a foreign-key relationship. Though we only have one item in the Item table__, you can imagine that there may be thousands or millions of items, and that many of those items will have the same brand and tax status properties.

Highly Normalized Streams

highly-normalized-streams

One common use case is to set up a connector for each table, pulling the data out of the table, composing it into events, and writing it to a dedicated event stream.

Highly normalized tables are often exposed as-is via a connector, leading to highly normalized event streams. While this is an easy way to quickly get started, it has significant impacts on how your consumers can use the data. They may find it difficult to use the data—for instance, if they want to treat Items differently based on brand name, or if they need to address tax statuses in their processing. While performing joins to resolve the relationships is possible, it’s more challenging in streaming than it is within a relational database.

Normalization vs. Denormalization

denormalized-streams

In contrast, you may choose to denormalize the data before making it available to consumers. Denormalization can also act as an opportunity to create an abstraction layer between the data on the inside and the data on the outside.

Generally speaking, there are two main options for denormalizing data. First, data can be denormalized during event creation time, before it leaves the source system boundary. And second, after the normalized events have already been created, typically by using a purpose built joiner service. We’ll examine each of these options in the second half of the course, but for now, it’s worth examining why relational data is problematic in event streams.

It tends to be a lot easier to use and process events when they’re modeled similar to a document—flat, with few relationships, and with all the data present in a single payload.

In contrast, the relational model works extremely well when coupled with a database engine purpose built to handle relationships - like Postgres or MySQL. Unfortunately, a relational model using some degree of normalization doesn’t lend itself well to the needs of event-stream consumers.

Let’s take a look at why this is the case.

Option 1: Highly Normalized Streams

highly-normalized-streams

Looking back at the item to brand and tax status relationship, you can see that there are three tables and two foreign key relationships.

Streaming Joins Can Be Costly

streaming-joins

We then hooked up a Kafka Connector to each of these tables and created three topics that mirror the tables one to one, including the foreign key relationships.

3-mirror-topics

If an item’s brand and tax status prove to be highly utilized by consumers, then we’re going to see each consumer service having to resolve the foreign key relationships over and over again. On their own, they don’t make for very useful events—and the more normalized your model, the more fine-grained, highly relational event streams you can end up with.

This strategy can incur high costs on consumers—both in engineering time to recreate the join code between applications, as well as in the processing power and storage requirements. Resolving streaming joins at scale can result in a lot of shuffling of data, which incurs processing power, networking, and storage costs. And this is without having applied any business logic!

same-joins

Furthermore, not all consumer stream processing frameworks support joins, especially on foreign keys, which tends to be more challenging than joins on primary keys.

It’s important to think about making event streams easy to use for your consumers. Highly relational event streams are typically difficult to use—you either need to rebuild the relational model, or you need to denormalize it. And while you can denormalize relational streams with joins using Kafka Streams and ksqlDB, you may want to leverage other frameworks and tools that do not support streaming joins.

Consumers Couple on the Internal Model

internal-model-coupling

Aside from the difficulty in joining relational streams together, consumers are also now coupled on the internal model of the source system. Changes to the source system’s data model may end up impacting downstream consumers.

Let’s look at an example. Say we were to refactor an existing model - we take the item table and refactor it to extract the pricing into its own table.

relational-database-refactoring

Unfortunately, this has a significant downstream impact—we’d need to create a new connector and a new stream. Consumers coupled on the item stream model must refactor their code to deal with the changes, including deciding if they need to incorporate the new price stream. A simple upstream refactoring can result in a lot of work for downstream consumers, and it all stems from a tight coupling on the internal upstream model.

Create an Abstraction Layer and External Model

abstraction-layer

The solution to this problem is to isolate the consumers from the internal model using an abstraction layer. This is a specific and explicit external model for consumers to couple on - “data on the outside”.

Changes to the internal model remain isolated in the source systems, while consumers are provided with an officially decoupled data model purpose built for their usage. We can also leverage this abstraction layer to denormalize data on the inside to make it easier to use by downstream event consumers. All we need to do is make sure that we can effectively map the internal model to the external model.

There are a few ways you can implement the abstraction layer.

  • One option is to let the streams mirror the internal model, but reconstruct them into something more useful before the consumers get to them.
  • A second option is to leverage what’s known as the Transactional Outbox pattern to reconstruct your data at the source - in the original database.

Let’s take a look at each of these now.

Option 1: Reconstruct the Streams

We know it can be painful for consumers to have to join all of the data together. What if you were to push it upstream instead?

join-events-together

In this example, the streams on the left mirror the tables they came from in the database. We join the events using a ksqlDB application based on the foreign-key relationships, and emit a single enriched item stream.

The hands-on exercise in the next module explores this option in depth.

Denormalization of Streams

denormalize-streams

Logically, we’re simply taking the existing foreign-key relationships and squashing them down into a single row.

single-row

The BrandName has now been made part of the Enriched Item event - we decided to drop the BrandId as it wasn’t necessary beyond resolving that join. Next, we’ll also join in tax statuses in the same way.

denormalized-items

Here is our final enriched item event.

This denormalized format is much easier to use by end consumers:

  • They don’t have to join the data together
  • They are also no longer directly exposed to the internal data model of the source

Alternative Use Case – User Behavior Events

user-behavior-events

It’s also important to note that not all events will be sourced from the same system, and so denormalization at the producer won’t always be possible. In this example, we have a stream of item click events exhibiting user behavior. Note that this event doesn’t contain the richer item information such as the name, the price, and the description.

Need to Enrich Item Clicks with Item Details

enrich-items

The item information is in another event stream that is published by another system. Talking to the item click stream consumers, you may discover that the very first thing they do, every single time they use item click stream events, is to join it with the item stream. And since you’re dealing with many click events, you discover that it ends up using a large amount of compute resources.

Build a Purpose-Built Joiner

purpose-built-joiner

So just like we did with events sourced by connectors, we can build a purpose-built ksqlDB application to join the item clicks with the detailed item data, and emit it to an enriched item click stream.

Enriched Item Data Is Easier to Use

enriched-data-easier

The contents of the enriched event contain both the item click data, as well as select fields from the detailed item stream.

Denormalizing already-existing streams is just one way to make events easier to use for end users. Another way is to rely on the source database to denormalize and produce the event to the stream directly. For some databases, like a document database, this is a pretty natural step—your data is already denormalized and it’s easy to produce.

Relational databases are also well suited to denormalizing data before a write—after all, fast queries between relationships are pretty much their bread and butter.

This is where the transactional outbox comes into play.

Transactional Outbox Pattern

transactional-outbox-1

In the transactional outbox pattern, we create a dedicated outbox table that we will use as an outbox for writing events to the stream.

transactional-outbox-2

We wrap all the necessary updates inside of a transaction. We write the internal update to the internal table, and then write the event to the outbox, and end the transaction. Transactions give us a guarantee that an event will only be written to the outbox if the internal state was also updated. The Transactional outbox pattern allows you to both isolate the internal data model as well as remodel the data for export to the outbox.

transactional-outbox-3

Finally, to get the data out of the outbox and into Kafka, we can use something like a change-data capture connector. This gets the data from the outbox tables, converts it into events, and writes it out to the event stream. We will also need to ensure that the outbox doesn’t grow indefinitely - we can either delete the data after it gets captured by CDC, or we can periodically delete it out of the table using a scheduled job.

Denormalize Upon Change

denormalize-upon-keep

Let’s go back to our item example again to see how this works in practice. Say you have an operation that updates the Item Table. We’re going to rename the item from Anvil to Iron Anvil.

operation-update-item

Upon updating the data in the database, you now need to decide if you’re going to update and emit the updated item, complete with the denormalized tax status and brand table.

Produce an Updated Event

produce-updated-event

And so, our denormalization logic would need to execute the Brand and Tax Status Joins to create the new event. Note that the name has been updated to Iron Anvil.

The new enriched item event we created is due to a change of a value in the original item schema itself. But we also need to consider what happens when values change in either brand or tax status. If consumers are going to rely on the data to be eventually consistent, then we need to make sure that other changes to the model are also reflected downstream.

Update the Brand Name

update-brand-name

Say the ACME company undergoes a rebranding, and comes up with a new brand name—changing from ACME to Rotunda.

Produce Another Updated Event

produce-another-event

We would create another new event, triggering join logic to capture the change of the new brand name and denormalizing it into a denormalized item event.

Note that Rotunda may have many hundreds or thousands of items that are also updated by this change - it can cause quite a lot of events to be emitted.

Option 2: Producer-Side Denormalization

There are a number of factors you must consider when choosing to denormalize at the source database.

  • What tables do you join?
  • What relationships do you join on?
  • What is the database load?
  • Have you considered race conditions?

We’ve already taken a look at selecting the tables to join on, and how the relationships may affect the join logic you need to trigger on.

Updates may occur in multiple areas of your internal model, and you must be sure you have accounted for the necessary changes. Querying a production database for the purposes of creating a denormalized event is also not free—you will need to consider the load on the service to determine feasibility, and either reduce requirements or increase the compute resources if they are insufficient.

And finally, databases are a wonder of multi-threaded concurrency - but you may be unable to capture all intermediate states in a series of rapid-fire concurrent updates. Race conditions may occur, causing you to miss certain transitions in the database records - you will need to evaluate your consumer’s use-cases to make sure it meets their needs.

Only Denormalize Where Appropriate

One more important point before we close out this module—only denormalize where it is appropriate. As a set of principles, first only include data that is commonly used. Exclude any fields that aren’t relevant to downstream consumers, and feel free to leave some streams of less commonly used data in a normalized form, or simply don’t write it at all. Second, beware of denormalizing relationships that lead to data that is either very large, such as a blob of text detailing hundreds of user reviews for an item, or very frequently updated.

Expanded Example – Include Inventory

include-inventory

Inventory is a very good example of a field that is updated frequently, and that can lead to problems if you aren’t careful in evaluating the frequency of updates.

But let’s say we were to include that in our existing Item Model.

inventory-table

So we add an inventory table in our upstream monolithic database—it keeps track of inventory in real time. Every sale results in a deduction from what we have in inventory—so you can imagine that quite a lot of events will be coming out.

inventory-real-time

If we join the inventory data in as part of denormalization, we’re going to end up with a very verbose set of events—one every time the inventory updates.

inventory-update-event

One major downside is that this puts a heavy load on the event broker serving the data, and on each consumer using the data. There will be lots of updates, even if the consumer doesn’t care about inventory levels at all! And while in this example we’re using an external stream processor to join the data together, if you were to try the join each inventory update in a relational database, you’d also be joining each of the brand, tax status, and item table data as well—resulting in an increased load upstream as well as down.

Balance Consumer and Producer Needs

Denormalization makes it easier for consumers to use data, but comes at the expense of more upstream processing and a careful selection of data to include. Consumers may have an easier time building applications and can choose from a wider range of technologies, including those that don’t natively support streaming joins.

Normalized event streams are common when producing data that tightly mirrors a relational model, and are fairly common when using connectors and change-data-capture. The onus of joining data is pushed downstream to the consumer, who typically must use a stream processor like ksqlDB to join it together prior to applying business logic.

Normalization versus denormalization is all about balancing the needs of consumers, the ability of the producer services to join data together, and the unique data model relationships. Listen to your consumer’s needs and work to build out an external abstraction layer to isolate upstream internal data models.

Use the promo code EVENTDESIGN101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

Dimension 2: Modeling as Normalized vs. Denormalized

Hi, I'm Adam from Confluent. And today we're gonna be talking about modeling events as normalized or denormalized. When designing events, we strive to ensure that all of the necessary data is included within an event. However, data does not exist in isolation, and it may not be possible to reasonably represent an extensive data model inside of a single event. Relational databases often use some degree of normalization in the creation of their data models. In contrast, document databases often lean towards a more denormalized format. We can draw some inspiration from these domains to evaluate the degree of denormalization suitable for event streams. In this module, we'll look at the trade offs between normalization and denormalization of the event data model, and recommend some best practices for you to follow. Normalized tables often lead to normalized event streams. One common reason for this is that many people often start their event driven journey with connectors that pull data directly in from the database and into a mirrored set of event streams. This approach is not ideal, as it creates a strong coupling between the internal database tables and the external event streams. Let's take a look at something a little more concrete. Here is a simple data model that translates well to a relational database. Now, we're gonna be revisiting this model a few times over the remainder of this course. In this model, an event has a unique primary key, as well as some details about the name, price, and description of the item. It also has a related brand, as well as a related tax status. Both the brand and tax status tables are related to the item table by foreign key relationships. Though we only have one item in the item table, you can imagine that there may be actually thousands or millions of items, and that many of those items will have the same brand and same tax status properties. One common use case is to set up a connector for each table, pulling the data out of the database, reformatting it into events, and writing them each to dedicated event streams. This approach leads to highly normalized event streams. And while this is an easy way to get started, it does result in a significant impact on how your consumers can use this data. For one, they will need to resolve foreign key relationships for some use cases, such as processing some items differently based on brand names or tax rates. While performing joins to resolve the relationships is possible, joins are a more challenging and expensive operation in streaming frameworks than it is within a relational database. In contrast, you may choose to denormalize the data before making it available to consumers. Denormalization can also act as an opportunity to create an abstraction layer between the data on the inside and the data on the outside. Generally speaking, there are two main options for denormalizing data. First, data can be denormalized during event creation time, before it leaves the system boundary. And second, data can also be denormalized after it has left the system boundary, but before the consumer uses it. We'll examine each of these options in the second half of the course, but for now, it's worth examining why relational data is problematic in event streams. It tends to be a lot easier to use and process events when they're modeled similar to a document: flat, with few relationships, and with all of the data present in a single payload. In contrast, the relational model works extremely well when coupled with a database engine purpose built to handle relationships, a database like Postgres or MySQL. Unfortunately, a relational model using some degree of normalization does not lend itself well to the needs of event stream consumers. Let's take a look to see why this is the case. Looking back at the item, brand, and tax status relationship, we can see that there are three tables and two foreign key relationships. We then hooked up a Kafka connector to each of these tables and created three topics that mirror the tables one to one, including the foreign key relationships. If an item's brand and tax status prove to be highly utilized by consumers, then we're going to see each consumer service having to resolve the foreign key relationships over and over again. On their own, they don't make for very useful events. And the more normalized your data model, the more fine grained, highly relational event streams you can end up with. Highly normalized streams can incur high costs on consumers, both in engineering time to recreate the join code between applications, as well as processing power and storage requirements. Resolving streaming joins at scale can result in a lot of shuffling of data, which incurs processing power, networking, and storage costs. And this is without even yet having applied any business logic. Furthermore, not all consumer stream processing frameworks support joins. This is especially true for foreign key and non-window joins, which add even more complexity and further reduce consumer framework options. It's important to think about making event streams easy to use for your consumers. Highly relational event streams are typically difficult to use. You either need to rebuild the entire relational model, or you need to denormalize it. And while you can denormalize relational streams with joins using Kafka Streams or ksqlDB, you may want to leverage other frameworks and tools that do not support streaming joins. Aside from the difficulty in joining relational streams together, consumers are also now coupled on the internal model of the source system. Changes to the source system's data model may end up impacting downstream consumers. Let's look at an example. Say we were to refactor an existing model. We take the existing item table and refactor it to extract the pricing into its own table. Unfortunately, this has a significant impact downstream. We'd need to create a new connector and a new stream. Consumers coupled on the item stream model must refactor their code to deal with the changes, including deciding if they need to incorporate the new price stream. A simple upstream refactoring can result in a lot of work for downstream consumers, and it all stems from a tight coupling on the internal upstream model. The solution to this problem is to isolate the consumers from the internal model using an abstraction layer. This is a specific and explicit external model for consumers to couple on. This is the data on the outside. Changes to the internal model remain isolated in the source systems, while consumers are provided with an officially supported decoupled data model purpose built for their usage. We can also leverage the abstraction layer to denormalize data and to make it easier to use for downstream consumers. All we need to do is to make sure that we can effectively map the internal model to the external model. There are a few ways you can implement the abstraction layer. One option is to let the streams mirror the internal model, but then reconstruct them into something more useful before the consumers get to use them. A second option is to leverage what's known as the transactional outbox pattern to reconstruct your data at the source in the original database. Let's take a look at each of these now. In this example, the streams are on one side, and our streams mirror the tables that they came from in the database. Next, we create a ksqlDB application that joins the events based on their foreign key relationships and emits a single enriched item stream. We're going to explore this option in depth in the hands-on exercise in the next module. Logically, we're simply taking the existing foreign key relationships and squashing them down into a single row. First, let's join in the brand data. The BrandName has now been made part of the enriched item event. We decided to drop the BrandId, as it wasn't necessary anymore, beyond resolving that original join. Next, we'll also join in the tax statuses in the same way. Finally, here's our enriched item event. This denormalized format is much easier to use by end consumers. They don't have to join the data together, and they're also no longer directly exposed to the internal data model of the source. It's also important to note that not all events will be sourced from the same system, and so denormalization at the producer, that we'll see in the next section, won't always be possible. In this example, we have a stream of item click events exhibiting user behavior. Note that this event doesn't contain the richer item information such as the name or the price or the description. The item information is in a separate event stream that is published by another system. Talking to the item click stream consumers, you may discover that the very first thing they do every single time they use this click stream is to join it with the item data. And since you're dealing with an awful lot of click events, you'll discover that it ends up using an awful lot of compute resources. So just like we did with event sourced by connectors, we create a purpose built ksqlDB application to join the item clicks with the detailed item data and emit it to an enriched item click stream. The contents of the enriched event stream contain both the item click data as well as select fields from the detailed item stream. Denormalizing already existing streams is just one of the ways to make events easier to use for your consumers. Another way is to rely on the source database to denormalize and produce the event to the stream directly. For some databases, like a document database, this is a pretty natural step. Your data's already denormalized, and it's relatively easy to produce. Relational databases are also pretty well suited to denormalizing data before a write. After all, fast queries between relationships is pretty much its bread and butter. This is where the transactional outbox comes into play. In the transactional outbox pattern, we create a dedicated table that we will use as a buffer for writing events to the stream. We wrap all of the necessary updates inside of a transaction. We write the internal updates to the internal table, and then write the event to the outbox and end the transaction. Transactions give us a guarantee that the event will only be written to the outbox if the internal state was also updated. The transactional outbox pattern allows you to both isolate the internal data model, as well as remodel the data for export to the outbox. Finally, to get the data out of the outbox and into Kafka, we can use something like a change data capture connector. This gets the data from the outbox tables, converts it into events, and writes it to the event stream. We will also need to ensure, though, that the outbox doesn't grow indefinitely. We can either delete the data after it gets captured by change data capture, or we can periodically delete it out of the table using a scheduled job. Let's go back to our item example again to see how this works in practice. Say you have an operation that updates the item table. We're gonna rename the item from Anvil to Iron Anvil. Upon updating the data in the database, we now need to decide if we're going to update and emit the updated item, complete with a denormalized tax status and brand. And so, our denormalization logic would need to execute the brand and tax status joins to create the new event. Note that the name has been updated to Iron Anvil. The enriched item event we created is due to a change of a value in the original item schema itself. But we also need to consider what happens when values change in either the brand or the tax status. If consumers are relying on the data to be eventually consistent, then we need to make sure that other changes in the model are also reflected downstream. Say the Acme company undergoes a rebranding and comes up with a new brand name, changing from Acme to Rotunda. We would create another new event, triggering join logic to capture the change of the new brand name and denormalizing it in into an enriched item event. Note that Rotunda may also have many hundreds or thousands of other items that will also need to be updated by the change. It can cause quite a lot of events to be emitted. There are a number of factors you must consider when choosing to denormalize at the source database. We've already taken a look at selecting the tables to join on, and how the relationships may affect the join logic you need to trigger on. Updates may occur in multiple code paths, so you'll need to ensure that you've tracked them all down. Querying a production database for the purposes of creating a denormalized event is not free. You will need to consider the load on the service to determine feasibility, and either reduce the amount of work you're doing or increase their compute resources if they're insufficient. And finally, databases are a wonder of multi-threaded concurrency, but you may be unable to capture all intermediate states in a series of rapid fire concurrent updates. Race conditions may occur, causing you to miss certain transitions in the database records. You will need to evaluate your consumer's use cases to make sure that this can meet their needs. One more important point before we close out this module: Only de-normalize where it is appropriate. As a set of principles, first, only include data that is commonly used. Exclude any fields that aren't relevant to downstream consumers, and feel free to leave some streams of less commonly used data in a normalized form, or simply don't write them at all. Secondly, beware of denormalizing relationships that lead to data that is either very large, such as a blob of text detailing hundreds of user reviews, or very frequently updated, like inventory. Inventory is a very good example of a field that is updated frequently. And this can lead to problems if you aren't careful in evaluating the frequency of updates. But let's say that we were to include this in our existing item model. So we add an inventory table in our upstream database. It keeps track of inventory in real time. Every sale results in a deduction from what you have in inventory. So you can imagine that's quite a lot of events that will be coming out. If we join the inventory data in as part of denormalization, we're going to end up with a very verbose set of events: one event every time the inventory updates. One major downside is that this puts a heavy load on the event broker serving the data, and on each consumer reading the data. There will be lots of updates, even if the consumer doesn't care about inventory levels at all. And while in this example, we're using an external stream processor to join the data together, if you were trying to join each inventory update in a relational database, you'd also be joining in each of the brand, tax status, and item data in that query as well, resulting in a significant load upstream as well as down. Normalized event streams are common when producing data that tightly mirrors a relational model, as is often the case when using connectors and change data capture. Creating an external abstraction layer is very important for isolating the source data model and protecting consumers from excessive change. Denormalizing events makes it easier for consumers to use data, but it does come at the expense of needing dedicated resources to de-normalize that data. The goal is to make it easy for consumers to build applications, including using technologies that may not support streaming joins. It all comes down to balancing the needs of the consumers, the ability of upstream services to join data together, and the unique data model relationships. Finally, listen to your consumers. They'll let you know what data they need. Work with them to make sure that it's easy to use.