Staff Software Practice Lead
When building a distributed system, developers are often faced with something known as the dual-write problem. It occurs whenever the system needs to perform individual writes to separate systems that can't be transactionally linked. This situation creates the potential for data loss if the developer isn't careful. However, techniques such as the Transactional Outbox Pattern and Event Sourcing can be used to guard against the potential for data loss while also providing added resilience to the system.
Topics:
I've worked on a variety of large systems and have encountered the dual-write problem on many occasions.
The worst part is, I didn't even know it was a problem for a long time.
But, now that I do, I want to share with you how to solve it.
We'll do that by looking at a concrete example of a bank that has built an event-driven microservice that is encountering the dual-write problem.
Let's start by looking at what was supposed to happen.
Although the fraud detection system has been extracted to a microservice,
most of the banking logic still lives inside a legacy monolith.
When a user issues a transaction , whether it is a payment, a deposit, a purchase, or something else,
the record of that transaction is stored in the monolith's database.
Meanwhile, the system should emit a TransactionRecorded event to Apache Kafka.
The Fraud Detection service will listen for the TransactionRecorded events and process them asynchronously.
At least, that's how it's supposed to happen.
Most of the time, this seems to work fine.
However, once in a while, it looks like those TransactionRecorded events are being dropped.
The transactions are logged in the database correctly, but the events never get to Apache Kafka.
This happens because of a flaw in the code.
When a transaction occurs they need to perform two separate writes.
One goes to the database while the other goes to Kafka.
The problem occurs when one write succeeds, but the other fails.
In other words, the write to the database goes through just fine, but the write to Kafka fails.
There are many reasons why this could happen ranging from network or hardware failures to software exceptions and system upgrades.
When it happens, the events are lost.
This is known as the dual-write problem.
It occurs anytime you write to multiple systems that aren't transactionally linked.
If you aren't familiar with the dual-write problem, check out the video linked in the description for more details.
Now, the first thought might be to solve this problem with something like a database transaction.
Unfortunately, there is no way to create a transaction that covers both the database and Apache Kafka.
No matter how we arrange the code there will always be a possibility that one operation will succeed, while the other fails.
And that will leave us in an inconsistent state.
However, there is actually a way to solve this problem using a transaction, just not one that spans the database and Kafka .
The Transactional Outbox Pattern is designed for exactly these situations.
To use it, Tributary can start by creating an "outbox" table.
This is a simple database table containing the events that are sent to Kafka, including event IDs and any other relevant metadata.
When a new financial transaction occurs, it is written to the database and within the same database transaction , an event is written to the outbox table.
Because this happens in a database transaction, it's guaranteed to be atomic.
Either both writes succeed or they both fail.
Once the database transaction is complete, the system can respond to the user to indicate success.
There is no need to wait for the event to be sent to Kafka.
Next, a separate process can
scan the outbox table
and emit the events it finds to Kafka.
This process can be written by the team, but they could also leverage a Change Data Capture or CDC system if the database supports it.
It is critical that this process operates with an at-least-once delivery guarantee.
If an event can't be sent, then it needs to be retried until it succeeds.
In some circumstances, retries could result in duplicate messages.
Therefore, it's important that downstream consumers operate idempotently.
However, that's generally a best practice for Kafka consumers anyway.
This is where metadata such as the event ID can be helpful because it allows the consumer to deduplicate.
Once an event has been successfully emitted to Kafka it can be marked as complete.
Alternatively, the event could be deleted from the outbox table.
However, banks are not usually fond of deleting data so they might keep it around.
The Transactional Outbox Pattern is powerful and would get the job done.
However, let's take a moment and think about how banks operate.
Each financial transaction is recorded as a separate entry in the database.
A bank doesn't just store the balance of your account, they store all of the transactions that led to that balance.
Essentially, bank accounts are event-sourced.
Because data is stored in an event-sourced fashion, they could potentially simplify the process and eliminate the outbox table.
Whenever a new financial transaction is written to the database, a separate process could scan those transactions
and emit them to Kafka.
All of the information needed to construct the event is hopefully available in the individual database records.
That makes this a simple scan, transform, and emit process.
There's no need to create a separate outbox table because the database records are the events.
There is an added benefit to implementing either a transactional outbox or event-sourcing.
Whenever we write to an external system, there is a possibility of failure.
That's true whether we are dealing with a database,
Apache Kafka,
a separate microservice, or anything else.
By storing the data and doing the write asynchronously, we isolate the system from potential failures.
If a failure occurs while trying to write the events to Kafka, that's ok.
As soon the system recovers, the process can resume and since all of the events are stored in the database, there is no risk of data loss.
These failures result in delays, however, from an end-user perspective, a delay is probably better than the system refusing the transaction.
The dual-write problem is tricky, but if we take a moment and ask the right questions, it's not necessarily hard to solve.
But, we shouldn't ignore it because doing so will result in data loss and that's rarely acceptable.
We need to consider the consequences of losing data.
In the case of a financial institution such as Tributary Bank, it's unacceptable.
It could result in fairly harsh consequences not just from customers, but potentially regulators as well.
The transactional nature of banking makes this easier than it might be in other domains.
However, techniques like the transactional outbox and event sourcing can be applied no matter what domain you are working in.
Thank you for joining me today.
Now, don't forget to drop some events into your own outbox.
You can do that with a like, share, or subscribe.
Or even better, send me a question or a comment and I'll do my best to respond.
Otherwise, I'll see you next time.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.