Software Practice Lead
Dive into Kafka and Confluent Cloud's resilience and disaster recovery strategies to safeguard your data and ensure business continuity.
Now, there’s an elephant in the room that we need to discuss. Let me explain.
You know that Kafka has been built to be resilient.
In case of a broker failure, the data is not lost because the remaining nodes have copies of that data.
New leaders would be elected for the affected partitions, and applications would rebalance and get the data from another broker.
When the failing broker comes back online, everything would be back to normal.
But what would happen if the whole cluster failed?
In this module, we’ll see how to protect your business from disasters with Confluent Cloud.
Even though Cloud Service providers invest heavily in infrastructure redundancy, fault tolerance, and disaster recovery mechanisms to minimize the impact of outages and maintain high availability, disasters can still strike at any time.
Consider the following scenarios:
The datacenter hosting your cluster is disrupted, maybe by a power outage.
Or the availability zone with all your clusters is disrupted, maybe by a fire, or flood.
Or all availability zones hosting your clusters in the region are disrupted because of software bugs, incorrect network config change, or lightning strikes.
If you’ve recognized a situation where your business could be at risk, then you must assess the business impact to consider ways of mitigating such a disruption.
This impact can be broken down into the following categories:
direct & indirect impact from data loss or unavailability, revenue, compliance considerations, reputation impact, and finally, customer lifetime value loss.
With Confluent Cloud, it’s easy to scale across multiple data centers, as it’s been designed to leverage the concept of availability zones.
You can create clusters that stretch across 3 zones. Create your topics with a replication factor of 3 and set the ‘min.insync.replicas’ config to 2.
It means that each topic partition has 3 replicas and that the data will not be committed unless at least 2 replicas are considered in-sync.
Both represent safeguards to prevent data loss.
When using Confluent Cloud standard clusters on multiple zones, you get an uptime SLA of 4 nines!
As a heads up, it means that you will have less than 50 minutes of downtime per year.
If one availability zone goes down, you can still perform writes.
If 2 availability zones are down, you can still read data.
Spreading your nodes across several availability zones enables seamless maintenance.
Disasters can sometimes hit an availability zone but they can also bring down a whole region.
If you’re building a mission-critical system, you might want it to be resilient across multiple cloud regions.
The preferred way to achieve that is to have a separate Confluent Cloud cluster in different regions.
This additional cluster needs to be a perfect replica of your data, schemas and consumer offsets.
Cluster Linking is a technology created by Confluent to replicate the data in your topics across clusters.
It’s “geo-replication” re-engineered for the cloud. You set it up between a source and a destination cluster.
It copies data asynchronously, and creates a perfect mirror, we call it byte-for-byte replication.
It’s built right into the broker, so there’s no need to run and manage an extra system like Kafka Connect if you had to use Apache Kafka. It scales up automatically with the addition or shrink of brokers.
You can use Cluster Linking on Apache Kafka as a source but not as a destination.
And best of all, it can replicate consumer offsets too, so your applications can pick up right where they left off when you point them to the destination cluster.
Now, Schema Linking is yet another technology created by Confluent to replicate the schemas across two clusters and keep them in sync.
It’s a really good complement to Cluster Linking but it also can be used independently.
It can also do Bidirectional syncing.
If you wish to do so, it can copy schema from a specific Schema Context which serves as a namespacing mechanism within the Schema Registry.
Finally, Schema Linking supports scenarios such as aggregation, backup, and migration of schemas.
The active/passive pattern is the go-to strategy for cross-region Disaster Recovery with Confluent Cloud because it’s easy to understand and to set up.
And also because it supports a fast failover with minimal data loss. The key benefit is that it maintains message ordering.
The pattern is also straightforward to implement and execute from a Devops perspective.
This disaster recovery strategy works best when the kafka clients and the cluster are in the same region, which is very much recommended for latency anyway.
The active/passive pattern is especially useful for use cases that 'fail forward', meaning they recover by keeping the processing on the DR cluster indefinitely instead of going back to the original primary cluster.
It is also suitable for scenarios that necessitate failing back to the original region within several days or weeks following the outage.
In summary, the active/passive architecture offers a practical, easy-to-implement solution for disaster recovery with Confluent Cloud.
Let’s see how it works.
Here’s what your steady state looks like.
On the left this is your primary cluster, and on the right, the DR cluster.
Between the two, a cluster link is in operation, copying the data and the customer offsets.
There’s also a Schema Link which copies the schemas from the primary to DR.
Note that the Kafka Clients don’t have the broker endpoints or the API credentials hardcoded.
Instead they’ve been injected via a service registry and a secrets manager.
Another option is to use Infrastructure as Code to quickly move the applications and point them to the right cluster, we’ll see how it’s done later in the “automate the road to production” module.
After detecting that a disaster has occurred, an operator would execute the failover command confluent kafka mirror failover or via the REST API to convert the mirror topic in the DR region to be a normal topic.
When the primary region recovers, the cluster link will not write data to the DR region anymore since the failover command has been executed.
Upon failover, the DR Schema Registry will serve schemas for the DR cluster and applications.
You will need to change the DR Schema Registry from IMPORT mode to READWRITE mode so that new schemas can be registered as needed.
Finally, point the clients in the primary region to the DR region.
Since the cluster link is configured to replicate consumer offsets, the consumer groups will pick up where they left off.
Remember, the replication is asynchronous so the consumers must be able to tolerate a few duplicates if some consumer offsets were not copied over when the disaster hit.
When this step is complete, clients can read from and write to the topics in the DR region, your data streaming applications are restored and your business has successfully avoided this outage!
When the primary region recovers from the disaster, there are two options for how to proceed: Fail Forward or Fail Back.
The preferred and most cloud native strategy is to fail forward, also known as fail-and-stay, because public cloud regions are interchangeable, and failing back introduces risk and operational toil.
In Fail Forward, operations continue permanently in the DR region, which becomes the new “primary” region.
In a fail forward scenario, you can choose to set up a brand new DR region, but you can also reuse the original primary Confluent Cloud environment and clusters after the region recovers.
You just have to delete the original schema registry and topics, create a new DR Schema Registry, then create cluster link and schema link the other way around.
Now, in certain cases there may be a latency or cost benefit from operating in a cloud specific region.
If so, then the operations can be failed back to the original region.
However, failing back introduces extra engineering effort, risk and cost.
It should be avoided if there is no business value to be gained.
If you need to fail back to the primary region, delete the topics in the primary cluster and create a cluster link from the DR cluster to the primary cluster.
If you registered new schemas during the outage, then the primary Schema Registry in the original environment needs to be caught up with new schemas that were registered to the DR Schema Registry.
There’s a third option which is the failback after a planned failover, an upcoming feature of Confluent Cloud due in Q3 2023.
In highly regulated industries, companies have to be able to do both unplanned and planned failovers.
A planned failover is essentially one where they go through the exercise of cutting over all the client applications from the primary DC to the secondary DC for audit and compliance reasons.
Once this exercise is over, customers need to have a way to fail back to their original primary cluster so they can go back to their original configuration.
With this approach you don’t have to delete the original topics anymore like in the fallback scenario, thus saving on money and engineering time.
Let’s have a look at a few tips for Disaster Recovery.
Automation and preparation are key to implement a successful Business Continuity strategy.
Having a bunch of people with different skills who can champion the setup, the failover process and the testing is really important.
You should have scripts ready to execute the failover and failback process, but run them manually instead of relying on some metrics which could trigger false positives.
Also, test your disaster recovery process thoroughly by simulating failures in the primary cluster.
Measure the time it takes to resume service and run retrospectives to try to improve on that.
Using a fail forward strategy will reduce risk, engineering effort and costs.
Let’s now see a few tips for your applications when implementing a disaster recovery process.
First, try to make your Kafka consumers idempotent.
Doing this will ensure that even in scenarios where events are duplicated, the system remains consistent.
We will see in the “Productionize Applications” module how to do that. Due to the asynchronous nature of Cluster Linking, your producers and consumers will need to tolerate a small Recovery Point Objective, which means, potentially losing a few messages.
If you want zero-RPO, you should use a multi-zone Confluent Cloud cluster instead.
Due to the asynchronous replication, your clients must also be able to tolerate a few duplicate messages.
You will learn in the “Productionize Applications” module how to deal with those duplicate messages.
It’s a good practice to inject at runtime both the bootstrap server and API keys.
You can use service discovery tools like Kubernetes or Consul for the bootstrap server and Vault or Secrets Manager for the keys.
Whenever a disaster occurs, don’t forget to restart all applications connected to the failing region in the DR region.
KsqlDB and Kafka Streams applications should only mirror input topics, not internal topics like changelog or repartition topics.
The reason is that changelogs and output topics may be out of sync with each other since they are replicated asynchronously.
If you need to get the system back on its feet as quickly as possible, you should run a standby application or ksqldb application in the DR cluster reading from the mirrored input topic.
The alternative is to re-run the application against the DR cluster after failover, and reprocess to rebuild state, but this is a much slower recovery process as rebuilding state can take time.
Check out the Cluster Linking Disaster Recovery and failover documentation on docs.confluent.io if you want to know more.
If you aren’t already on Confluent Developer, head there now using the link in the video description to access other courses, hands-on exercises, and many other resources.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.