Senior Developer Advocate (Presenter)
Now that you have a better idea of Kafka internals at a high level, let’s dive into some performance tuning! In this exercise, we’ll be producing events to a Kafka topic in a Confluent Cloud cluster, tweaking client settings, and observing the impact they have on event throughput and latency.
Before we get into the nitty gritty, there are a couple of environment preparation steps that you’ll have to complete.
Complete the following steps to set up the environment used for this exercise. Prior to doing so, you will need to sign up for Confluent Cloud.
Let’s start with creating the cluster that we use for this exercise.
Open URL https://confluent.cloud and log in to the Confluent Cloud console.
Navigate to the default environment.
NOTE: If the Get started with Confluent Cloud tutorial appears in the right side of the console, click Leave tutorial.
Create a basic cluster named perf-test-cluster.
NOTE: If multiple people within your organization will be going through this exercise, append a unique string, e.g., <last name>, to the cluster name to prevent your exercise from conflicting with others.
Now that we have our cluster, let’s create the topic that we will be producing events to.
We also need to create a client configuration file for the cluster that will be needed by the kafka-producer-perf-test command during the exercise.
Since this command utilizes an embedded Java producer to write events to the topic, we need to click the corresponding Java client type.
Here we see the client connection configs which we can copy using the provided button. As you can see, one of the parameters needs to be updated so that it includes a valid cluster API key and secret value. We can create these using the option that is conveniently available on this page.
As you can now see, clicking this option opens a new window that contains the new API key and secret.
After creating the API key and secret, notice the client connection configs have been updated to include these values. We can now copy these configs and use them to create the needed client configuration file on our Java client machine.
We will now create the local client configuration file containing the Confluent Cloud connection settings.
The sample client configuration settings include properties that are needed if the client is using Schema Registry as well as a couple of other properties that are needed by earlier versions of the Java client. We won’t be using these properties so we will delete them.
We will use kafka-producer-perf-test.sh and the confluent CLI during the exercise. Downloading Confluent Platform Community components includes both of these.
Download Confluent Platform:
Run the following command to add the Confluent Platform bin directory to your path:
echo 'export PATH=$PATH:/home/training/confluent-<version>/bin/' >> ~/.bashrc
If you are a Confluent Platform customer, it's best to keep your version of the Confluent CLI which is packaged with your Confluent Platform distribution. Check the compatibility range before updating the CLI to the latest version. Otherwise, run the following command and complete the prompts that appear:
confluent update --major
This concludes the exercise setup.
We’ll be using a performance test script kafka-producer-perf-test.sh to test different producer configuration settings to see how they impact event throughput and latency. Before starting, let’s be sure the client configuration file that will be used by our performance test script to establish a connection with the cluster is available. This file contains both the cluster endpoint as well as authentication settings, so it’s pretty important to have it set up properly. Let’s check out the config file and make sure all the right details are in it.
Run command:
cat java.config
You should see the cluster endpoint as well as your authentication settings. If your config file doesn’t quite look like what you saw in the exercise video, revisit the exercise setup instructions before continuing.
If everything looks good, we can now begin testing the producer client performance. To start off, we’ll just use the default producer configuration values. Two of the most important producer settings related to event throughput and latency are linger.ms and batch.size. The linger setting is the maximum amount of time that the producer will wait while adding events to a record batch before it’s flushed. The batch size setting, of course, is the maximum size that a batch can be before it’s flushed. The defaults for these settings are 0 and about 16 kilobytes, respectively. In the next few tests, we’ll alter the values for these settings and observe how they impact the average throughput and latency.
A quick thing to note before we start running our tests is that we’ll be throttling the record output of the script to make it easier to observe throughput and latency variations.
With that, let’s kick off a test with the default settings.
You’ll notice a couple key options in the performance test command. We’ll set the record size to 1000 bytes and specify that the test run through 3000 records with a throughput of 200. But, of course, the most important property is producer-props where we set linger and batch size. We’ll use the --print-metrics flag to output metrics, limiting the output using grep.
Run command:
kafka-producer-perf-test \
--producer.config /home/training/java.config \
--throughput 200 \
--record-size 1000 \
--num-records 3000 \
--topic perf-test-topic \
--producer-props linger.ms=0 batch.size=16384 \
--print-metrics | grep \
"3000 records sent\|\
producer-metrics:outgoing-byte-rate\|\
producer-metrics:bufferpool-wait-ratio\|\
producer-metrics:record-queue-time-avg\|\
producer-metrics:request-latency-avg\|\
producer-metrics:batch-size-avg"
You should see output similar to the following:
Metric | linger.ms 0 batch.size 16384 | linger.ms 100 batch.size 16384 | linger.ms 100 batch.size 300000 | linger.ms 1500 batch.size 300000 |
---|---|---|---|---|
batch-size-avg | 1215.030 | |||
bufferpool-wait-ratio | 0.000 | |||
outgoing-byte-rate | 75594.490 | |||
record-queue-time-avg | 4.229 | |||
request-latency-avg | 43.292 |
The output you see could be a bit different than shown here, but you should see something similar. In this case, the average batch size was 1215, much lower than the default batch size. This is easily explained by the linger value which was set to 0. Since linger was so low, the batch is being flushed almost as soon as the first record is added. Doing a bit of math here, since our record size equals 1000, it appears that one record is being added to each batch before linger triggers the flush.
Let’s quickly review the remaining output values. The bufferpool wait ratio 0 value indicates that batches are never waiting on previously sent requests in order to be flushed, meaning that Confluent Cloud brokers are able to process requests as fast as we are sending them. Pretty cool!
Outgoing byte rate and request latency average shows the throughput and latency results. And finally we have record queue time, showing how long that records are remaining in the batch prior to being flushed.
Now that we have a good understanding of the output metrics, let’s run the same test, but this time we’ll increase linger from 0 to 100 ms, giving records more time to be added to a batch before it’s flushed.
Run command:
kafka-producer-perf-test \
--producer.config /home/training/java.config \
--throughput 200 \
--record-size 1000 \
--num-records 3000 \
--topic perf-test-topic \
--producer-props linger.ms=100 batch.size=16384 \
--print-metrics | grep \
"3000 records sent\|\
producer-metrics:outgoing-byte-rate\|\
producer-metrics:bufferpool-wait-ratio\|\
producer-metrics:record-queue-time-avg\|\
producer-metrics:request-latency-avg\|\
producer-metrics:batch-size-avg"
You should see output similar to the following:
Metric | linger.ms 0 batch.size 16384 | linger.ms 100 batch.size 16384 | linger.ms 100 batch.size 300000 | linger.ms 1500 batch.size 300000 |
---|---|---|---|---|
batch-size-avg | 1215.030 | 16164.670 | ||
bufferpool-wait-ratio | 0.000 | 0.000 | ||
outgoing-byte-rate | 75594.490 | 68866.596 | ||
record-queue-time-avg | 4.229 | 95.457 | ||
request-latency-avg | 43.292 | 53.649 |
Let’s analyze these results. With linger set to 100 ms, you’ll first notice that the average batch size is much higher—it’s actually just lower than the batch size we specified in the properties. This indicates that batch size is what’s triggering the batch to be flushed now. Notice also that throughput has decreased and latency has increased compared to the previous test. That’s actually not what we want.
Let’s rerun the test, increasing batch size to 300000.
Run command:
kafka-producer-perf-test \
--producer.config /home/training/java.config \
--throughput 200 \
--record-size 1000 \
--num-records 3000 \
--topic perf-test-topic \
--producer-props linger.ms=100 batch.size=300000 \
--print-metrics | grep \
"3000 records sent\|\
producer-metrics:outgoing-byte-rate\|\
producer-metrics:bufferpool-wait-ratio\|\
producer-metrics:record-queue-time-avg\|\
producer-metrics:request-latency-avg\|\
producer-metrics:batch-size-avg"
You should see output similar to the following:
Metric | linger.ms 0 batch.size 16384 | linger.ms 100 batch.size 16384 | linger.ms 100 batch.size 300000 | linger.ms 1500 batch.size 300000 |
---|---|---|---|---|
batch-size-avg | 1215.030 | 16164.670 | 23720.164 | |
bufferpool-wait-ratio | 0.000 | 0.000 | 0.000 | |
outgoing-byte-rate | 75594.490 | 68866.596 | 68720.642 | |
record-queue-time-avg | 4.229 | 95.457 | 109.070 | |
request-latency-avg | 43.292 | 53.649 | 63.836 |
Looking at the output results, we see that the average batch size has increased, but not by that much. This indicates that the linger setting isn’t allowing enough time for batches to reach the limit set by the batch size parameter. Throughput and latency are still moving in the wrong direction.
The next obvious thing to do is increase the linger time. Let’s see how this affects the output.
Run command:
kafka-producer-perf-test \
--producer.config /home/training/java.config \
--throughput 200 \
--record-size 1000 \
--num-records 3000 \
--topic perf-test-topic \
--producer-props linger.ms=1500 batch.size=300000 \
--print-metrics | grep \
"3000 records sent\|\
producer-metrics:outgoing-byte-rate\|\
producer-metrics:bufferpool-wait-ratio\|\
producer-metrics:record-queue-time-avg\|\
producer-metrics:request-latency-avg\|\
producer-metrics:batch-size-avg"
You should see output similar to the following:
Metric | linger.ms 0 batch.size 16384 | linger.ms 100 batch.size 16384 | linger.ms 100 batch.size 300000 | linger.ms 1500 batch.size 300000 |
---|---|---|---|---|
batch-size-avg | 1215.030 | 16164.670 | 23720.164 | 275700.182 |
bufferpool-wait-ratio | 0.000 | 0.000 | 0.000 | 0.000 |
outgoing-byte-rate | 75594.490 | 68866.596 | 68720.642 | 68246.401 |
record-queue-time-avg | 4.229 | 95.457 | 109.070 | 1406.273 |
request-latency-avg | 43.292 | 53.649 | 63.836 | 226.091 |
With linger set to 1500 ms, we can see that the batch sizes are pretty close to the batch size limit. But we’re still losing ground in throughput and latency. It appears that the producer defaults for linger and batch size are actually the best choice for an application that’s producing 200 records per second of 1000 bytes each.
In this exercise, we got a pretty good feel for producer settings and saw how altering these configurations can affect throughput and latency. Although we only checked these settings for a specific application based on number of records and record size, we encourage you to continue playing around with these parameters and see how the producer configs impact an application. A couple of things you might try are:
Then observe the result, paying close attention to all metrics.
When you’re finished running this exercise on your own, you’ll need to tear down the exercise environment, deleting the Confluent Cloud cluster to prevent it from unnecessarily accruing cost and exhausting your promotional credits.
From a terminal window, log into the CLI using the --save flag so that your credentials are saved for use in later commands.
Run command:
confluent login --save
Enter email and password.
List the environments available to you.
Run command:
confluent environment list
With the environment ID, now you list the available clusters.
List the clusters in the environment and their IDs:
confluent kafka cluster list \
--environment <environment ID>
We used the performance test cluster in this exercise, so we’ll use that ID along with the environment ID to run the delete cluster command.
Delete the perf-test-cluster cluster:
confluent kafka cluster delete <perf-test-cluster ID> \
--environment <environment ID>
As a sanity check, we’ll list the environment clusters again to confirm that the performance test cluster no longer exists.
Confirm the perf-test-cluster cluster no longer exists in the environment:
confluent kafka cluster list \
--environment <environment ID>
The cluster is not listed so the environment tear down is complete. And with that step complete, we’ve successfully wrapped up our first Kafka internals exercise! You should now have a pretty good idea of how producer configuration settings can impact your application and how to fine-tune these settings to optimize throughput and latency.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.
Now that you have a better idea of Kafka internals at a high level, let's hit the ground running and check out some performance tuning. In this exercise, we'll be producing events to a Kafka topic in a Confluent Cloud cluster, tweaking some client settings, and then observing the impact they have on event throughput and latency. But before we get into that, there are a couple of prerequisites that you'll have to go through just to make sure your environment is properly set up. The instructions on how to do so are included in the writeup just below the video, so take a look at those before we continue on with the exercise. When you're finished with that, we can dive in. If you've completed the exercise setup, you should have created a configuration file using the Confluent Cloud Console Java client configuration section. This configuration file includes the cluster endpoint that will be used by our performance test script to establish a connection with the cluster. The config file also contains authentication settings, so it's pretty important to have it set up properly. Let's check out the config file and make sure all of the right details are in it. Like I said, you should see the cluster endpoint as well as your authentication settings. If your config file doesn't quite look like this, revisit the setup instructions before continuing. If everything looks good, we can now begin testing the producer client performance. To start off, we'll just use the default producer configuration values. Two of the most important producer settings related to event throughput and latency are linger milliseconds and batch size. The linger setting is the maximum amount of time that the producer will wait while adding events to a record batch before it's flushed. The batch size setting, of course, is the maximum size that a batch can be before it's flushed. The defaults for these settings are zero and about 16 kilobytes, respectively. In the next few tests, we'll alter the values for these settings and observe how they impact the average throughput and latency. A quick thing to note before we start running our tests is that we'll be throttling the record output of the script to make it easier to observe throughput and latency variations. With that, let's kick off a test with the default settings. You'll notice a couple key options in the performance test command. We'll set the record size to 1000 bytes and specify that the test run through 3,000 records with a throughput of 200. But of course, the most important property is producer props where we set linger and batch size. We'll use the print metrics flag to output metrics, limiting the output using grep. The output you see could be a bit different than shown here, but you should see something similar. In this case, the average batch size was 1,215, much lower than the default batch size. This is easily explained by the linger value which was set to zero. Since linger was so low, the batch is being flushed almost as soon as the first record is added. Doing a bit of math here, since our record size equals 1000, it appears that one record is being added to each batch before linger triggers the flush. Let's quickly review the remaining output values. The buffer pool weight ratio zero value indicates that batches are never waiting on previously sent requests in order to be flushed, meaning that Confluent Cloud brokers are able to process requests as fast as we're sending them. Pretty cool. Outgoing byte rate and request latency average shows the throughput and latency results. And finally, we have a record queue time showing how long that records are remaining in the batch prior to being flushed. Now that we have a good understanding of the output metrics, let's run the same test, but this time, we'll increase linger from zero to 100 milliseconds, giving records more time to be added to a batch before it's flushed. Let's analyze these results. With linger set to 100 milliseconds, you'll first notice that the average batch size is much higher. It's actually just lower than the batch size we specified in the properties. This indicates that the batch size is what's triggering the batch to be flushed now. Notice also that throughput has decreased and latency has increased compared to the previous test. That's actually not what we want. Let's rerun the test, increasing batch size to 300,000. Looking at the output results, we see that the average batch size has increased, but not by that much. This indicates that the linger setting isn't allowing enough time for batches to reach the limits set by the batch size parameter. Throughput and latency are still moving in the wrong direction. The next obvious thing to do is to increase the linger time. Let's see how this affects the output. With linger set to 1500 milliseconds, we can see that the batch sizes are pretty close to the batch size limit, but we're still losing ground in throughput and latency. It appears that the producer defaults for linger and batch size are actually the best choice for an application that's producing 200 records per second of 1,000 bytes each. In this exercise, we got a pretty good feel for producer settings and saw how altering these configurations can have an affect on throughput and latency. But keep in mind that we only check these settings for a specific application based on the number of records and record size. So I do encourage you to continue playing around with the performance tuning script and change the input parameters to reflect an application that you're currently working on. In order to do that, you might remove the throttling on the performance test by setting the throughput equal to negative one, and then altering the num-records and record-size values. Observe the results, and let me know what you find. That said, when you're finished running this exercise on your own, you'll need to tear down the exercise environment, deleting the Confluent Cloud cluster, and preventing it from unnecessarily accruing cost and exhausting your promotional credits. Let's walk through the teardown process together by using the Confluent CLI. From a terminal window, log into the CLI. I recommend using the --save flag so that your credentials are saved for use in later commands. List the environments available to you. With the environment ID, now you can list the available clusters. We use the performance test cluster in this exercise so we'll use that ID along with the environment ID to run the delete cluster command. And of course, as a sanity check, we'll list the environment clusters again to confirm that the performance test cluster no longer exists. The cluster is not listed, so the environment teardown is complete. And with that, we've successfully wrapped up our first Kafka Internals exercise. You should now have a pretty good idea of how producer configuration settings can impact your application and how to fine-tune these settings to optimize throughput and latency. I hope you found this interesting and that you'll join me in the following exercises as we continue to dive deeper into Kafka Internals. See ya there.