Apache Kafka® Performance

Benchmark testing and results for Apache Kafka’s performance on the latest hardware in the cloud

Apart from its other technical merits, Apache Kafka® is known for its scalability and performance. Given differing production environments and workloads, many users like to run benchmarking tests for purposes such as optimizing for throughput or for capacity planning.

This page describes how to benchmark Kafka’s performance on the latest hardware in the cloud, in a repeatable and fully automated manner, and it documents the results from running these tests. The experiments focus on system throughput and system latency, as these are the primary performance metrics for event streaming systems in production. In particular, the throughput test measures how efficient Kafka is in utilizing the hardware, specifically the disks and the CPU. The latency test measures how close Kafka is to delivering real-time messaging including tail latencies of up to p99.9th percentile, a key requirement for real-time and mission-critical applications as well as microservices architectures.

Synopsis of Testing

We conducted benchmarking tests on Kafka and share the information here. The following sections first walk you through a summary of throughput and latency results, then describe the benchmarking framework, testbed, and the workloads. We will finish with a deeper explanation of the results using the various system and application metrics. All of these are open source, so curious readers can reproduce the results for themselves or dig deeper into the collected Prometheus metrics. It is always recommended to compare using one’s own workloads and setups, to understand how the results below translate to production deployments.

First, here is a summary of the conducted experiments.

Kafka Throughput Kafka Latency

Figures: Summary of results

Result
Peak Throughput605 MB/s
p99 Latency5 ms (200 MB/s load)

Benchmarking Setup

The benchmarks were performed using this benchmarking code. All tests deploy four worker instances to drive workload, three broker/server instances, a three-instance Apache ZooKeeper cluster for Kafka, and one instance that runs the monitoring setup (e.g., Prometheus). The network/storage-optimized class of Amazon EC2 instances was selected with enough CPU cores and network bandwidth to support disk I/O bound workloads. In the sections below, any changes made to the baseline configurations are called out.

Disks

The i3en.2xlarge instance type (with 8 vCores, 64 GB RAM, 2 x 2,500 GB NVMe SSDs) was selected for its high 25 Gbps network transfer limit that ensures that the test setup is not network bound. This means that the tests measure the respective maximum server performance measures, not simply how fast the network is. i3en.2xlarge instances support up to ~655 MB/s of write throughput across two disks, which is plenty to stress the servers. See the full instance type definition for details.

Kafka Performance: Disk IO

Figure 1. Establishing the maximum disk bandwidth of i3en.2xlarge instances across two disks, tested using the dd Linux command, to serve as a north star for throughput tests

The maximum disk bandwidth of the selected instances was established with the following dd Linux commands, which is a metric of the environment setup that we need for e.g. interpreting the benchmarking results:

Disk 1
dd if=/dev/zero of=/mnt/data-1/test bs=1M count=65536 oflag=direct
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 210.278 s, 327 MB/s

Disk 2
dd if=/dev/zero of=/mnt/data-2/test bs=1M count=65536 oflag=direct
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 209.594 s, 328 MB/s

OS tuning

Similar to what many system administrators do for Kafka production environments, we optimized several OS settings. Specifically, the OS was tuned for better latency performance using tuned-adm’s latency performance profile, which disables any dynamic tuning mechanisms for disk and network schedulers and uses the performance governor for CPU frequency tuning. It pegs the p-states at the highest possible frequency for each core, and it sets the I/O scheduler to the deadline to offer a predictable upper bound on disk request latency. Finally, it also tunes the power management quality of service (QoS) in the kernel for performance over power savings.

Throughput test

The first measurement was to determine the peak stable throughput that Kafka could achieve. Peak stable throughput is defined as the highest average producer throughput at which consumers can keep up without an ever-growing backlog.

The experiment was designed according to the following principles and expected guarantees:

  • Kafka was configured with 100 partitions across one topic.
  • Messages are replicated 3x for fault tolerance (see below for specific configs).
  • Batching was enabled to optimize for throughput. Up to 1 MB of data is batched for a maximum of 10 ms.

Kafka delivered high throughput, near-saturating the disk I/O available in the testbed. Given its design, each byte produced was written just once onto disk on a code path that has been optimized for almost a decade by thousands of organizations across the world.

Kafka Throughput

Figure 2. Peak stable throughput of Kafka: 100 topic partitions with 1 KB messages, using four producers and four consumers against a three-node Kafka cluster

Kafka was configured to use batch.size=1MB and linger.ms=10 for the producer to effectively batch writes sent to the brokers. In addition, acks=all was configured in the producer along with min.insync.replicas=2 to ensure every message was replicated to at least two brokers before acknowledging it back to the producer. Kafka was able to efficiently max out both the disks on each of the brokers—the ideal outcome for a storage system. See Kafka's driver configuration for details.

Kafka Producer/Consumer Throughput

Figure 3. The graph shows I/O utilization on Kafka brokers and the corresponding producer/consumer throughput (source: Prometheus node metrics). See raw results for details.

Latency test

Given the ever-growing popularity of stream processing and event-driven architectures, another key aspect of messaging systems is end-to-end latency for a message to traverse the pipeline from the producer through the system to the consumer. A useful experiment is therefore to measure the highest stable throughput that Kafka can sustain without showing any signs of over-utilization.

To optimize for latency, the producer configuration was configured to batch messages up to a maximum of 1 ms only (versus the 10 ms used for throughput tests), and to also leave Kafka at its default recommended configuration while ensuring high availability. Kafka was configured to use its default fsync settings (i.e., fsync off). Based on repeated runs, it was decided to measure Kafka’s latency at 200K messages/s or 200 MB/s, which is below the single disk throughput limit of 300 MB/s on this testbed.

Kafka Latency

Figure 4. End-to-end latency for Kafka, measured at 200K messages/s (1 KB message size). See the raw results for details. Note: Latency (ms)—lower is better.

Kafka consistently delivered low latencies. Since the experiment was deliberately set up so that consumers were always able to keep up with the producers, all of the reads were served off of the cache.

Much of Kafka’s performance can be attributed to a heavily optimized read implementation for consumers, built on efficient data organization, without any additional overheads like data skipping. Kafka deeply leverages the Linux page cache and zero-copy mechanism to avoid copying data into user space. Typically many systems, like databases, have built out application-level caches that give them more flexibility to support random read/write workloads. However, for a messaging system, relying on the page cache is a great choice because typical workloads do sequential read/writes. The Linux kernel has been optimized for many years to be smart about detecting these patterns and employ techniques like readahead to vastly improve read performance. Similarly, building on top of the page cache allows Kafka to employ sendfile-based network transfers that avoid additional data copies.

Where To Go From Here & Further Resources

If you are interested in running your own benchmarks in a repeatable and fully automated manner, the benchmarking code is publicly available and can be used to run the performance experiments described in the previous sections.

To learn more about Kafka’s performance, benchmarking, and tuning:

To learn more about other types of testing for your Kafka applications and the ecosystem of related tools, see Testing Apache Kafka®.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.