Get Started Free
Jun Rao

Jun Rao

Co-Founder, Confluent (Presenter)

Cluster Scaling

cluster-scaling

Kafka is designed to be scalable. As our needs change we can scale out by adding brokers to a cluster or scale in by removing brokers. In either case, data needs to be redistributed across the brokers to maintain balance.

Unbalanced Data Distribution

unbalanced-data-distribution

We can also end up needing to rebalance data across brokers due to certain topics or partitions being more heavily used than the average.

Automatic Cluster Scaling with Confluent Cloud

automatic-cluster-scaling-with-confluent-cloud

When using a fully managed service, such as Confluent Cloud, this issue goes away. With Confluent Cloud we don’t need to worry about balancing brokers. In fact, we don’t need to worry about brokers at all. We just request the capacity we need for the topics and partitions we are using and the rest is handled for us. Pretty sweet, huh?

Kafka Reassign Partitions

kafka-reassign-partitions

For self-managed Kafka clusters, we can use the command line utility, kafka-reassign-partitions.sh. First we create a JSON document that lays out how we want our topics and partitions distributed. Then we can pass that document to this command line tool, and if we include the --execute flag, it will go to work moving things around to arrive at our desired state. We can also leave the --execute flag off for a dry run.

This works well, but it is a very manual process and for a large cluster, building that JSON document could be quite the challenge.

Confluent Auto Data Balancer

confluent-auto-data-balancer

Confluent’s Auto Data Balancer takes this up a notch by generating the redistribution plan for us, based on cluster metrics. We use the confluent-rebalancer command to do these tasks, so it still takes some manual intervention, but it does a lot of the work for us.

Confluent Self-Balancing Clusters (SBC)

confluent-self-balancing-clusters

As cool as Auto Data Balancer is, it pales in comparison to Confluent Self-Balancing Clusters! As the name implies, self-balancing clusters maintain a balanced data distribution without us doing a thing. Here are some specific benefits:

  1. Cluster balance is monitored continuously and rebalances are run as needed.
  2. Broker failure conditions are detected and addressed automatically.
  3. No additional tools to run, it’s all built in.
  4. Works with Confluent Control Center and offers a convenient REST API.
  5. Much faster rebalancing.

SBC Metrics Collection and Processing

sbc-metrics-collection-and-processing

When SBC is enabled, each broker will run a small agent which will collect metrics and write them to an internal topic called _confluent-telemetry-metrics. The controller node will aggregate these metrics, use them to generate a redistribution plan, execute that plan in the background, and expose relevant monitoring data to Control Center.

SBC Rebalance Triggers

sbc-rebalance-triggers

SBC has two options for triggering a rebalance. The first one is meant only for scaling and will trigger when adding or removing brokers. The other option is any uneven load. In this case, the rebalance will be based on the metrics that have been collected and will take into consideration things like disk and network usage, number of partitions and replicas per broker, and rack awareness. This is obviously the most foolproof method, but it will use more resources. SBC will throttle replication during a rebalance to minimize the impact to ongoing client workloads.

Fast Rebalancing with Tiered Storage

fast-rebalancing-wtih-tiered-storage

When combined with Tiered Storage, the rebalancing process is much faster and less resource intensive, since only the hotset data and remote store metadata need to be moved.

JBOD vs. RAID

jbod-vs-raid

Sometimes, we may need multiple disks for a broker. In this situation we need to choose between a collection of individual disks (JBOD) or RAID. There are a lot of factors to consider here and they will be different in every situation, but all things being equal RAID 10 is the recommended approach.

Use the promo code INTERNALS101 to get $25 of free Confluent Cloud usage

Disagree? If you believe that any of these rules do not necessarily support our goal of serving the Apache Kafka community, feel free to reach out to your direct community contact in the group or community@confluent.io

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.