VP Developer Relations
Software Practice Lead
In Apache Kafka®, replication is how data remains durable and fault-tolerant. Each partition in a topic is copied across multiple brokers to protect against data loss if a broker fails. This is controlled by a replication factor, which defines the number of copies (replicas) Kafka maintains for each partition. For example, if the replication factor is set to three, Kafka will store three copies of each partition across different brokers.
Among these replicas, one is designated as the leader, while the others are followers. All writes and reads are directed to the leader. The followers continuously sync data from the leader, ensuring they stay up-to-date. If the leader's broker fails, Kafka automatically elects a new leader from the remaining followers, keeping the system running without data loss. The cluster also restores the number of replicas by creating new copies as needed.
This replication mechanism guarantees fault tolerance and high availability. Even if a broker crashes, the data remains accessible from its replicas. Kafka also supports follower reads for improved latency; you can configure clients to read from the nearest replica if it reduces network delay.
While replication is mostly managed for you in cloud-native services like Confluent Cloud, understanding how it works is crucial if you're running your own Kafka cluster. It’s a cornerstone of Kafka's reliability, enabling seamless failover, load balancing, and strong data durability—all built into the platform.
This is the introduction to Apache Kafka on replication. It would be no good at all if we stored each partition on only one broker. Whether these are bare metal servers or instances in the cloud, disk and server failure can and will happen. You can't count on these things to live forever. And so we need to copy partition data to several other brokers to keep up.
Now that copying that Kafka can do is configurable. You have to choose the replication factor, which here is three. As you can see, each partition is going to have three copies total. You see how one of them is a little darker and the others are sort of ghosted out. The darker one is the lead replica. The other ones are the followers. So there's always one leader and of the replication factor of n, n minus one of them are then followers. And when an application is writing or reading, it's almost always writing or reading from that leader.
And what the followers are doing is they're keeping up as new messages are getting written in to the lead replica. The followers are kind of scraping them off, trying to make sure they have everything replicated as quickly as possible. Now, thanks to replication, if one broker fails, you still have two replicas of the lost partitions. And if the broker that failed had the leader of one of those partitions, then the cluster will elect new leaders of the remaining partitions.
And of course, we need to make copies now. Something's going to have to make sure that we still have three replicas of each partition. So there's more cleanup to do or restoring the failed broker. You know, we're going to restore this to health in some way, but basically here for our discussion of replication, you see that we elect a new leader and we haven't lost any data. We're still up and available and moving along.
As I said, when we're writing data, we write to the leaders. That's always the case. If you're writing, you write to the leader, no matter what, whatever that current leader is. And when you read, you are typically, by default, reading from those lead replicas also. In fairly recent versions of Kafka in the last couple of years, as of this recording, you can configure the client application that's reading to read from the nearest replica. If it's located in some lower latency, closer network connection to a certain broker, you can say, look, I know some of these are going to be followers, but I want to read from these followers because it's faster. And so if you're really pressed for performance, that's a thing that you can do.
Replication is another one of those features that matters a lot. If you're operating Kafka yourself, if you are running the servers, managing the cluster, that's really important. If you're using a cloud service, it's a thing that happens and you sort of don't need to know how or why. But it is, as a feature, a cornerstone of Kafka's reliability. This is how we ensure fault tolerance, load balancing, and basically healthy performance at scale of a cluster. And it's all built-in functionality. Super cool stuff.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.