Senior Curriculum Developer
High Availability and Disaster Recovery are some of the most important, but often overlooked, aspects of setting up and maintaining a modern data architecture. Cluster linking is the easiest way to set up and maintain a disaster recovery solution. This module walks through some of the benefits and considerations, as well as the steps needed to set up and configure your Disaster Recovery cluster.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
High Availability and Disaster Recovery are one of the most important, but often overlooked, aspects of setting up and maintaining a modern data architecture.
Everyone agrees that having a plan is important, but as we’ve discussed with all the different systems, both on-premise, and cloud, it can not only be difficult, but almost impossible to capture all of the systems you care about and should be included.
In a disaster situation the simpler or more automated the plan, the more likely it will be to be followed.
While it isn’t covered in this course, if you are considering setting up a High Availability cluster within one region you should consider a Confluent Cloud Multi-region cluster.
A multi-region cluster can tolerate availability zone and node failures and comes with a 99.99% SLA.
Cluster Linking comes into play when you want to create a multi-region Disaster Recovery plan.
But before we get into the details of using Cluster Linking for High Availability and Disaster Recovery, we need to talk about some of the high-level steps to creating a robust Disaster Recovery plan.
Step 1 is to determine whether an outage has occurred. A Disaster Recovery plan can only be executed if a disaster is known to have taken place.
This includes steps to verify that a failure isn’t a false positive.
Additionally, you’ll want to decide how your plan is executed and who will be monitoring the process.
While you can set up and program automation to monitor and trigger the failover, doing so can often set you up for false positives.
Step 2 is determining and assigning an RPO and RTO for your data.
RPO stands for Recovery Point Objective while RTO stands for Recovery Time Objective.
RPO is primarily concerned with the amount of data that you can afford to lose when a disaster happens.
Whereas RTO is primarily concerned with the acceptable time a system or service is down after a disaster.
In other words, how much data can be lost during a failure? In order to have zero RPO, synchronous replication is required.
Whereas RTO is concerned with how much time can go by until your applications failover.
In other words, how long can a failover take? In order to have zero RTO, seamless client failover is required.
As you determine the RPO and RTO remember that each use case or scenario can have a different set of requirements.
When thinking about RPO, it is important to consider the difference between data lost due to an outage and permanent data loss.
RPO refers to the set of data that has not yet been replicated to the Disaster Recovery cluster.
This data will be missing during the outage but is not necessarily lost forever.
Permanent data loss will only occur when the outage causes irrevocable damage to all three availability zones in the region, such as a large-scale natural disaster.
Third, are you running an Active/Passive or Active/Active architecture? Active/Passive refers to having your clients produce to one cluster, the primary cluster, and then having the data replicated to another cluster, often referred to as the Disaster Recovery cluster.
Active/Active has both clusters configured with producers and consumers, and therefore both are considered the primary cluster.
The last step is making sure that you’ve configured your clients to failover to the other cluster so they can resume activity.
Once an outage has occurred you need a way for your clients to switch over to your disaster recovery cluster.
To do this successfully there are three steps that need to take place.
Step 1: Configure your clients to connect and dynamically pull the bootstrap server from a Service Discovery tool and the necessary credentials from a key manager.
This way every time your client is restarted, or you need to change your configuration, the client will automatically pull the correct bootstrap server and credentials to log in.
Step 2: Update the bootstrap configuration to point to the Disaster Recovery cluster.
Step 3: Reboot your clients once an outage has occurred.
As long as the client is connected it will continue to use the previous credentials it obtained before the disaster happened.
By restarting your clients they will pull the new information needed and be able to connect to your disaster recovery cluster.
With Cluster Linking offset sync we naturally get very low RTO, which means your consumers can failover and restart very close to the point where they left off.
However, this is an asynchronous process.
This means the data is committed, but the offset might not have been committed before the disaster occurred.
These offsets are written to the destination cluster using the consumer.offset.sync.ms parameter that can be configured as low as 1-second intervals.
Due to this, it’s easy to see that you could have a discrepancy between your messages and offset when your consumers connect to the disaster recovery cluster.
Keep in mind, if you configure your offset sync lower than the default of 30 seconds, it will come at the expense of more bandwidth and throughput during regular operation.
Last, your consumers and producers need to be tolerant of a small RPO.
The whole point of a disaster recovery cluster is to have the data replicated from the source.
However, Cluster Linking is an asynchronous process.
When producers produce messages to the source cluster, they get an acknowledgment or an “ack” from the source cluster independent of when, and often before, the cluster link replicates those messages to the Disaster Recovery cluster.
Therefore, when an outage occurs and a failover is triggered, there may be a small number of messages that have not been replicated.
Producer and consumer applications must be tolerant of this.
You can monitor your exposure to this via the mirroring lag in the Metrics API, CLI, REST API, and Confluent Cloud Console.
Now we are ready to set up and configure our Disaster Recovery cluster.
Since we are going to be using Cluster Linking the setup is very straightforward.
There are four steps in setting up a Disaster Recovery solution using Cluster Linking.
First, you need to have or create a new Dedicated Confluent Cloud cluster with public internet in a region that is different from the source cluster.
Second, you need to create a cluster link from the source cluster.
This link needs to have consumer offset sync enabled and ACL sync enabled.
If you plan on failing over a subset of your consumer groups you’ll need to use a filter to only select those consumer group names, otherwise, you’ll want to sync over everything.
The same goes for your ACLs.
If you plan to failover only a subset of the Kafka clients to the Disaster Recovery cluster, then use a filter to select only those clients.
Otherwise, sync all ACLs.
Now you can create a mirror topic on the Disaster Recovery cluster for each of the primary cluster’s topics.
If you only want Disaster Recovery for a subset of the topics, then only create a mirror topic for that subset.
The last step is enabling auto-create mirror topics.
Which will automatically create a mirror topic on the Disaster recovery cluster as new topics are created.
If you only need Disaster Recovery for a subset of topics, you can scope auto-create mirror topics by topic prefixes or specific topic names.
That’s it. You now have a disaster recovery cluster that mirrors your main cluster.
When deleting a cluster link, first check that all mirror topics are in the STOPPED state.
When a cluster link is deleted, so is the history of any STOPPED topics.
If you need the Last Source Fetch Offset or the Status Time of your promoted or failed-over mirror topics, make sure to save those before you delete the cluster link.
You cannot delete a cluster link that still has mirror topics on it as the delete operation will fail If you are using Confluent for Kubernetes, and you delete your cluster link resource, any mirror topics still attached to that cluster link will be forcibly converted to regular topics by use of the failover API.
Use the link on the screen, or in the course guide below to find out more.
While there isn’t a walkthrough in this course on creating and setting up a Disaster Recovery cluster I highly recommend you take a look at the Confluent documentation that goes through the same information contained in this module and also includes steps on how to monitor your Disaster Recovery cluster and some failover considerations.
You can also follow the tutorial which will help you set up your Disaster Recovery cluster and failover to it.
As was mentioned at the beginning of this module, High Availability and Disaster recovery is one of the most important aspects of maintaining your modern data architecture.
Cluster Linking provides one of the easiest to configure and maintain methods of disaster recovery, allowing you and your team to focus on other areas of delivering a robust and performant Kafka cluster.
If you aren't already on Confluent Developer head there now using the link in the video description to access the rest of this course, the hands-on exercises, and additional resources.