May 25, 2021 | Episode 160

Running Apache Kafka Efficiently on the Cloud ft. Adithya Chandra

  • Transcript
  • Notes

Focused on optimizing Apache Kafka® performance with maximized efficiency, Confluent’s Product Infrastructure team has been actively exploring opportunities for scaling out Kafka clusters. They are able to run Kafka workloads with half the typical memory usage while saving infrastructure costs, which they have tested and now safely rolled out across Confluent Cloud. 

After spending seven years at Amazon Web Services (AWS) working on search services and Amazon Aurora as a software engineer, Adithya Chandra decided to apply his expertise in cluster management, load balancing, elasticity, and performance of search and storage clusters to the Confluent team.

Last year, Confluent shipped Tiered Storage, which moves eligible data to remote storage from a Kafka broker. As most of the data moves to remote storage, we can upgrade to better storage volumes backed by solid-state drives (SSDs). SSDs are capable of higher throughput compared to hard disk drives (HDDs), capable of fast, random IO, yet more expensive per provisioned gigabyte. Given that SSDs are useful at random IO and can support higher throughput, Confluent started investigating whether it was possible to run Kafka with lesser RAM, which is comparatively much more expensive per gigabyte compared to SSD. Instance types in the cloud had the same CPU but half the memory was 20% cheaper.

In this episode, Adithya covers how to run Kafka more efficiently on Confluent Cloud and dives into the following:

  • Memory allocation on an instance running Kafka
  • What is a JVM heap? Why should it be sized? How much is enough? What are the downsides of a small heap?
  • Memory usage of Datadog, Kubernetes, and other processes, and allocating memory correctly
  • What is the ideal page cache size? What is a page cache used for? Are there any parameters that can be tuned? How does Kafka use the page cache?
  • Testing via the simulation of a variety of workloads using Trogdor
  • High-throughput, high-connection, and high-partition tests and their results
  • Available cloud hardware and finding the best fit, including choosing the number of instance types, migrating from one instance to another, and using nodepools to migrate brokers safely, one by one
  • What do you do when your preferred hardware is not available? Can you run hybrid Kafka clusters if the preferred instance is not widely available?
  • Building infrastructure that allows you to perform testing easily and that can support newer hardware faster (ARM processors, SSDs, etc.)

Continue Listening

Episode 161June 8, 2021 | 32 min

Adopting OpenTelemetry in Confluent and Beyond ft. Xavier Léauté

Collecting internal, operational telemetry from Confluent Cloud services and thousands of clusters is no small feat. Traditionally, this data needs to be collected in multiple ways to satisfy all the different requirements. However, this sometimes leads to discrepancies between various systems. With OpenTelemetry, we can collect data in a vendor-agnostic way. Many vendors already integrate with OpenTelemetry, which gives us the flexibility to try out different observability solutions with minimal effort, without the need to rewrite applications or deploy new agents.

Episode 162June 10, 2021 | 9 min

Confluent Platform 6.2 | What’s New in This Release + Updates

Based on Apache Kafka® 2.8, Confluent Platform 6.2 introduces Health+, which offers intelligent alerting, cloud-based monitoring tools, and accelerated support so that you can get notified of potential issues before they manifest as critical problems that lead to downtime and business disruption.

Episode 163June 15, 2021 | 25 min

Boosting Security for Apache Kafka with Confluent Cloud Private Link ft. Dan LaMotte

Enabling private links on the cloud is increasingly important for security across networks and even the reliability of stream processing. Staff Software Engineer II Dan LaMotte and his team focus on making secure connections for customers to utilize Confluent Cloud. With the option of private links, you can now also build microservices that use new functionality that wasn’t available in the past. You no longer need to segment your workflow, thanks to completely secure connections between teams that are otherwise disconnected from one another.

Got questions?

If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.

Email Us

Never miss an episode!

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.