Apache Kafka® FAQs

Here are some of the questions that you may have about Apache Kafka and its surrounding ecosystem.

If you’ve got a question that isn’t answered here then please do ask the community.

Getting Started

Apache Kafka® is an open source, event streaming platform. It provides the ability to durably write and store streams of events and process them in real time or retrospectively. Kafka is a distributed system of servers and clients that provide reliable and scalable performance.

Learn more about what Kafka is in this free Kafka 101 training course.

To get started with Kafka check out the free Kafka 101 training course, join the community, try the quick start, and attend a meetup.

After that, explore all of the other free Apache Kafka training courses and resources on Confluent Developer, and check out the documentation.

Kafka's performance at scale can be attributed to the following design characteristics:

  • Writing to and reading from Kafka is linear and avoids random-access seek times
  • Brokers rely heavily on the filesystem page cache, so access to data is often from memory
  • Kafka uses zero copy for efficient internal passing of data
  • It can scale horizontally, allowing multiple instances of an application to read data in parallel
  • Batching and compression can be highly tuned to further improve performance based on specific application needs

You can learn more about some of the benchmark tests for Kafka performance, as well as the design principles that make Kafka so fast.

Kafka is used widely for many purposes, including:

You can see a few of the many thousands of companies who use Kafka in this list.

As an event streaming platform, Kafka is a great fit when you want to build event-driven systems.

Kafka naturally provides an architecture in which components are decoupled using asynchronous messaging. This design reduces the amount of point-to-point connectivity between applications, which can be important in avoiding the infamous "big ball of mud" architecture that many companies end up with.

Kafka can scale to handle large volumes of data, and its broad ecosystem supports integration with many existing technologies. This makes Kafka a good foundation for analytical systems that need to provide low latency and accurate information.

Event streaming has been applied to a wide variety of use cases, enabling software components to reliably work together in real time.

Kafka has become a popular event streaming platform for several reasons:

Kafka is an event streaming system that may be a good fit anywhere you use a message bus, queuing system, or database. It excels at the real-time processing of data, so it may be an especially good match if all your data matters, but the latest data is particularly important. For example, instant messaging, order processing, warehouse notifications and transportation all need to store and process large amounts of data, and handling the latest data swiftly is essential. (See more example use cases.)

You can use Kafka with nearly any programming language, and there are step-by-step getting started guides for the most popular languages, as well as quick examples on this page.

For more on how Kafka works, see our Kafka 101 course, and to understand how event-driven systems work (and why they work so well), see our guide to Thinking in Events.

The Kafka broker is written in Java and Scala.

Client libraries are available in many languages and frameworks including Java, Python, .NET, Go, Node.js, C/C++, and Spring Boot. There is also a REST API for interacting with Kafka. Many other languages are available with Kafka community-built libraries.

All technologies have tradeoffs, but in recent years Kafka has seen tremendous adoption as people move away from traditional messaging queues such as IBM MQ, RabbitMQ, and ActiveMQ. Common reasons why people move to Kafka are for its durability, speed, scalability, large ecosystem, and event-stream processing integrations.

You can read more about Kafka’s benefits over traditional messaging middleware in this white paper, and this article that benchmarks Kafka against other systems.

Kafka has five core APIs for JVM-based languages:

  • The Admin API allows you to manage topics, brokers, ACLs and other objects
  • The Producer API allows client applications to send events to topics in a Kafka cluster (docs)
  • The Consumer API allows client applications to read streams of event data from topics in a Kafka cluster (docs)
  • The Streams API is used to create stream-processing applications
  • The Connect API is used to implement integrations with systems external to Kafka

Apache Kafka® Concepts

Kafka runs as a cluster of one or more servers, known as brokers.

Brokers store event data on disk and provide redundancy through replication. Kafka scales horizontally, which gives resilience along with performance benefits, and allows clients to process data in parallel.

Clients interact with the cluster using APIs defined in binary protocols, communicated over TCP/IP. Clients produce or consume event data, typically using language-based libraries.

Client libraries are available in many languages and frameworks, including Java, C/C++, C#, Python, Go, Node.js, and Spring Boot. There is also a REST API for interacting with Kafka.

Kafka runs in a cluster of nodes, called brokers. Brokers are designed to be simple. They do not process data themselves and they only view event data as opaque arrays of bytes. A broker's main job is to provide the core functionality of storing, replicating, and serving data to clients.

Brokers replicate data amongst other brokers in the cluster for the purposes of resilience and availability.

A producer application connects to a set of initial brokers in the cluster, and requests metadata information about the cluster. Using this information, the producer determines which broker to send events to, based on a partitioning strategy.

Consumer applications often work in coordination, called a consumer group. One of the brokers in a cluster has the additional responsibility of coordinating applications that are part of a consumer group.

For a full introduction to Kafka design, see this free Kafka 101 course.

Kafka brokers store data on local disk. Broker behavior is controlled by configuration parameters, and the directory that Kafka stores data in is configured by the log.dir value.

Moving "warm" data to other storage types yet keeping it accessible is possible using Tiered Storage, implemented in both Confluent Cloud and Confluent Platform. There are also community efforts underway as part of KIP-405.

You can read more about how Kafka stores data here.

In Kafka, events are produced and consumed in batches. The maximum size for a batch of events is controlled by the message.max.bytes configuration on the broker. This can be overridden on a per-topic basis by setting the max.message.bytes configuration for the topic.

Somewhat confusingly, these two parameters are indeed different on the broker (message.max.bytes) and topic (max.message.bytes)!

For more details see the configuration documentation. Be sure to check out the free Kafka 101 course on Confluent Developer.

A consumer group in Kafka is a single logical consumer implemented with multiple physical consumers for reasons of throughput and resilience.

When a single consumer cannot keep up with the throughput of messages that Kafka is providing, you can horizontally scale that consumer by creating additional instances of it.

The partitions of all the topics are divided among the consumers in the group. As new group members arrive and old members leave, the partitions are reassigned so that each member receives its proportional share of partitions. This is known as rebalancing the group. Kafka keeps track of the members of a consumer group and allocates data to them.

To utilize all of the consumers in a consumer group, there must be as many partitions as consumer group members. If there are fewer partitions than consumers in the group, then there will be idle consumers. If there are more partitions than consumers in the group, then consumers will read from more than one partition.

You can learn more about consumers in this free Apache Kafka 101 course.

Kafka consumers may be subscribed to a set of topics as part of a consumer group. When consumers join or leave the group, a broker, acting as the coordinator, assigns partitions to the consumers in the group as a way of evenly spreading the partition assignments across all members. This is known as rebalancing the group. You can read more about it in this blog.

The offset in a Kafka topic is the location of any given event in a particular partition. In Kafka, events are stored in topics, which are linear and append only. Topics are further divided into partitions. Offsets are monotonically increasing and are represented by a 64-bit integer.

For more information on Kafka fundamentals, see this Kafka 101 course.

An in-sync replica (ISR) is a Kafka topic partition’s replica (known as a follower), which has enough of the data from the primary version of the partition (known as the leader), that it is said to be in sync.

The degree to which a follower can lag from the leader and still be deemed to be in sync is configured by the broker setting replica.lag.time.max.ms.

For more details see this deep-dive blog and the documentation.

Lag in Kafka can refer to two things:

  • How far behind a consumer is in reading the available messages. Depending on the consumer’s purpose, it may need to read and process messages at a greater rate. A consumer group provides for stateless horizontal scalability.

  • How far behind a follower partition is in replicating the messages from the leader partition. If a follower gets too far behind then it becomes an out-of-sync replica.

Install and Run

Confluent Platform includes Apache Kafka. You can download and unpack the community edition by running this command:

wget https://packages.confluent.io/archive/7.0/confluent-community-7.0.0.tar.gz
tar -xf confluent-community-7.0.0.tar.gz

Confluent provides multiple ways to install and run Kafka, including Docker, ZIP, TAR, Kubernetes, and Ansible.

A great alternative to having to install and run Kafka yourself is to use the fully managed cloud service provided by Confluent Cloud.

The recommended approach for running Kafka on Windows is to run it under WSL 2 (Windows Subsystem for Linux). This blog post provides step-by-instructions showing you how to run Kafka on Windows.

Note that while this is fine for trying out Kafka, Windows isn’t a recommended platform for running Kafka with production workloads. If you are using Windows and want to run Kafka in production, then a great option is to use Confluent Cloud along with the provided Confluent CLI (which is supported on Microsoft Windows) to interact with it.

If you install Kafka using the .tar.gz package from the Kafka website, it will be in whichever folder you unpack the contents into. The same applies if you install Confluent Platform manually.

If you install Confluent Platform using the RPM or Debian packages, then you will by default find the data files in /var/lib and configuration files in /etc (e.g. /etc/kafka).

For more details on specific paths and installation methods see the documentation.

  1. Paste the Docker Compose file below into a file called docker-compose.yml.

    ---
    version: '3'
    services:
      zookeeper:
        image: confluentinc/cp-zookeeper:7.0.0
        container_name: zookeeper
        environment:
          ZOOKEEPER_CLIENT_PORT: 2181
          ZOOKEEPER_TICK_TIME: 2000
    
      broker:
        image: confluentinc/cp-kafka:7.0.0
        container_name: broker
        ports:
        # To learn about configuring Kafka for access across networks see
        # https://www.confluent.io/blog/kafka-client-cannot-connect-to-broker-on-aws-on-docker-etc/
          - "9092:9092"
        depends_on:
          - zookeeper
        environment:
          KAFKA_BROKER_ID: 1
          KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
          KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_INTERNAL:PLAINTEXT
          KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092,PLAINTEXT_INTERNAL://broker:29092
          KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
          KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
          KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
  2. Run docker-compose up -d.

For more details see the Docker quick start.

There are many other Docker images you can run alongside the broker including one for a Kafka Connect worker, a ksqlDB instance, a REST proxy, Schema Registry, etc. For a Confluent Platform demo that includes those services along with a greater set of features, including security, Role-Based Access Control, Kafka clients, and Confluent Replicator, see cp-demo.

Confluent for Kubernetes (CFK) runs on Kubernetes and provides a cloud-native control plane with a declarative Kubernetes-native API approach to configure, deploy, and manage Apache Kafka®, Connect workers, ksqlDB, Schema Registry, Confluent Control Center, and resources like topics and rolebindings, through Infrastructure as Code (IaC).

To install CFK, run the following:

helm repo add confluentinc https://packages.confluent.io/helm
helm repo update
helm upgrade --install confluent-operator confluentinc/confluent-for-kubernetes

For an overview of Kafka on Kubernetes, see this talk from Kafka Summit.

Once you have downloaded Confluent Platform, start ZooKeeper using the Confluent CLI.

confluent local services zookeeper start

Or you can start the whole platform all at once:

confluent local services start

Keep in mind that you can also launch the Apache Kafka broker in KRaft mode (which is experimental as of Confluent Platform 7.0.0), which means that it runs without ZooKeeper. See this page for more details.

Topics in Kafka

A Kafka topic describes how messages are organized and stored. Topics are defined by developers and often model entities and event types. You can store more than one event type in a topic if appropriate for the implementation.

Kafka topics can broadly be thought of in the same way as tables in a relational database, which are used to model and store data. Some examples of Kafka topics would be:

  • orders
  • website_clicks
  • network_events
  • customers

Topics can be partitioned, and partitions are spread across the available Kafka brokers.

To read data from a Kafka topic in your application, use the Consumer API provided by one of the client libraries (for example Java, C/C++, C#, Python, Go, Node.js, or Spring Boot).

You can also read data from a Kafka topic using a command-line interface (CLI) tool such as kcat (formerly known as kafkacat) or kafka-console-consumer.

Confluent also provides a web interface for browsing messages in a Kafka topic, available on-premises and on Confluent Cloud.

To list Kafka topics use the kafka-topics command-line tool:

./bin/kafka-topics --bootstrap-server localhost:9092 --list

Using Confluent you can also view a list of topics with your web browser.

Many Kafka users have settled on between 12 and 24 partitions per topic, but there really is no single answer that works for every situation.

There are a few key principles that will help you in making this decision, but ultimately, performance testing with various numbers of partitions is the safest route:

  • The topic partition is the unit of parallelism in Kafka
    • Writes to different partitions can be fully parallel
    • A partition can only be read by one consumer in a group, so a consumer group can only effectively grow to the number of partitions
  • Increasing the number of partitions can be a costly exercise, for two reasons:
    • New partitions will remain empty until the data is manually redistributed
    • Ordering guarantees within a partition will be lost
  • Higher partition counts will require more resources on brokers and clients
    • Partitions are backed by index and data files, so more partitions means more open file handles
    • Producers will buffer records for each partition before sending them, so more partitions will require more memory for those buffers

You can read more in this blog post by Jun Rao (one of the original creators of Apache Kafka®).

You can create a Kafka topic with the kafka-topics.sh command-line tool:


./bin/kafka-topics.sh --create --partitions 1 --replication-factor 1 \
                      --topic my-topic --bootstrap-server localhost:9092

You can also use the Confluent CLI to create a topic:

confluent kafka topic create <topic> [flags]

Another option is the Confluent Cloud Console, where you can simply click the Create topic button on the Topics page.

While there is no set limit to the number of topics that can exist in a Kafka cluster, currently Kafka can handle hundreds of thousands of topics, depending on the number of partitions in each.

With Kafka's new KRaft mode, that number will be in the millions.

You can delete a Kafka topic with the kafka-topics.sh tool:

./bin/kafka-topics.sh --delete --topic my-topic \
                      --bootstrap-server localhost:9092

You can also use the Confluent CLI to delete a topic:

confluent kafka topic delete my-topic [flags]

Another option is the web-based Confluent Cloud Console, where you can click on the topic on the Topics page, then go to the Configuration tab and click Delete topic.

To count the number of messages in a Kafka topic, you should consume the messages from the beginning of the topic and increment a counter.

This Kafka Tutorial shows some specific examples using either a command-line tool or ksqlDB.

For further discussion see this blog post.

To delete the contents of a Kafka topic, do the following:

  1. Change the retention time on the topic:

    ./bin/kafka-configs --bootstrap-server localhost:9092 --alter \
                        --entity-type topics --entity-name my_topic \
                        --add-config retention.ms=0
  2. Wait for the broker log manager process to run.

    If you inspect the broker logs, you'll see something like this:

    INFO [Log partition=orders-0, dir=/tmp/kafka-logs] 
      Found deletable segments with base offsets [0] due to 
      Retention time 0ms breach (kafka.log.Log)
  3. Restore the retention time on the topic to what it was previously, or remove it as shown here:

    ./bin/kafka-configs --bootstrap-server localhost:9092 --alter \
                        --entity-type topics --entity-name orders \
                        --delete-config retention.ms

A few things to be aware of when clearing a topic:

  • Are you trying to recreate a "message queue?" Then you might have the wrong model of a log—check out our free Kafka 101 course for a refresher.
  • Are you trying to clear out the topic as part of a test process? Are you aware that offsets cannot be reset even if messages are reaped from a topic? Also do you know that record deletion is nondeterministic? Maybe it’s possible to use ephemeral topics for your testing or short-term use cases. If you need the same topic name, that could be a code/configuration smell.

Kafka Clients

Here's an example in Python of a Kafka producer. For more languages and frameworks, including Java, .NET, Go, JavaScript, and Spring see our Getting Started guides.

from confluent_kafka import Producer

# Configure a Producer
config = {
  "bootstrap.servers": "localhost:8080"
}
producer = Producer(config)

# Create an error-handling callback.
def delivery_callback(err, msg):
  if err:
      print(f"ERROR: Message failed delivery: {err}")
  else:
      print(f"Produced event key = {msg.key()} value = {msg.value()}")

# Produce data.
mykey = ...
myvalue = ...
producer.produce("mytopic", mykey, myvalue, callback=delivery_callback)

# Cleanup
producer.flush()

Check out our client code samples and free training courses to learn more.

Here's an example of a consumer in Python. For more languages and frameworks, including Java, .NET, Go, JavaScript, and Spring see our Getting Started guides.

from confluent_kafka import Consumer

# Configure a Consumer
config = {
  "bootstrap.servers": "localhost:8080",
  "group.id": "example-consumer-group",
  "auto.offset.reset": "earliest"
}
consumer = Consumer(config)

consumer.subscribe(["mytopic"])

# Poll for new messages from Kafka and print them
try:
  while True:
    msg = consumer.poll(1.0)
    if msg is None:
      print("Waiting...")
    elif msg.error():
      print(f"ERROR: {msg.error()}")
    else:
      print(f"Consumed event key = {msg.key().decode('utf-8')} value = {msg.value().decode('utf-8')}")
except KeyboardInterrupt:
  pass
finally:
  # Cleanup.
  consumer.close()

Check out our client code samples and free training courses to learn more.

A consumer group in Kafka is a single logical consumer implemented with multiple physical consumers for reasons of throughput and resilience.

When a single consumer cannot keep up with the throughput of messages that Kafka is providing, you can horizontally scale that consumer by creating additional instances of it.

The partitions of all the topics are divided among the consumers in the group. As new group members arrive and old members leave, the partitions are reassigned so that each member receives its proportional share of partitions. This is known as rebalancing the group. Kafka keeps track of the members of a consumer group and allocates data to them.

To utilize all of the consumers in a consumer group, there must be as many partitions as consumer group members. If there are fewer partitions than consumers in the group, then there will be idle consumers. If there are more partitions than consumers in the group, then consumers will read from more than one partition.

You can learn more about consumers in this free Apache Kafka 101 course.

Group ID is a configuration item in Kafka consumers. If you want multiple consumers to share a workload, you give them the same group.id. If you want a consumer to work independently, you give it a unique group.id. Group ID is just a string.

Kafka consumers are built to scale across multiple machines, by working in groups. Each consumer tells the Kafka broker which group it belongs to, and then outbound messages will be automatically load-balanced among members of that group.

For example, you might have a purchases topic. The system you’re writing is to trigger a notification when a customer buys something. Sending out large volumes of email can be slow, so you might have five machines all consuming with a group.id = "email", so they can share that load. At the same time, you might want to summarize those purchases to get sales figures, and that might only need one machine with its own group.id = "sales".

Group IDs should not be confused with client IDs. A group ID will affect the way records are consumed, but a client ID is just a label.

A client ID in Kafka is a label you define that names a particular consumer or producer. You can give your client a friendly name so that debugging is easier. For details see the consumer and producer documentation.

Client IDs should not be confused with group IDs. A group ID will affect the way records are consumed, but a client ID is just a label.

To connect to Kafka you need to find a client library for your language, or use the REST Proxy. There are client libraries for over 20 programming languages, so finding one is generally easy.

You can see some language-specific examples further down this page, or take a look at the documentation for your language's client, but the process is always:

Consumers (read-only connections)

  1. Create a Consumer, configuring it with connection details like the host:port of your Kafka bootstrap server.
  2. Tell that consumer which topic(s) you want to consume.
  3. Poll for new events and handle them as they come in.

Producers (write-only connections)

  1. Create a Producer, configuring it with connection details like the host:port of your Kafka bootstrap server.
  2. Send events from your code, passing the topic to write to, the event, and an error callback.

Kafka connections are separated into producers (write-only connections) and consumers (read-only connections).

Example Java producer

import org.apache.kafka.clients.producer.*;
...

// Configure and connect.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

Producer<String, String> producer = new KafkaProducer<>(props);

// Create an event.
String myKey = ...;
Object myValue = ...;
producer.send(
  new ProducerRecord<>("mytopic", myKey, myValue),
  (event, exception) -> {
    if (exception != null) {
      exception.printStackTrace();
    };
  });
}

// Cleanup
producer.flush();
producer.close();

Example Java consumer

import org.apache.kafka.clients.consumer.*;

// Configure and connect.
Properties props = loadConfig(args[0]);
props.put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("group.id", "example-consumer-group");
props.put("auto.offset.reset", "earliest");

Consumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("mytopic"));

try {
  while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
      String key = record.key();
      String value = record.value();
      System.out.println(String.format("Consumed event key = %s value = %s", key, value));
    }
  }
} finally {
  // Cleanup.
  consumer.close();
}

For more, see the full Getting Started with Apache Kafka and Java walkthrough.

Kafka Streams

Kafka Streams is a Java library for building applications and microservices. It provides stream processing capabilities native to Apache Kafka.

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("TextLinesTopic");
KTable<String, Long> wordCounts = textLines
    .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("\\W+")))
    .groupBy((key, word) -> word)
    .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"));
wordCounts.toStream().to("WordsWithCountsTopic", Produced.with(Serdes.String(), Serdes.Long()));

Here are some of the things you can do with Kafka Streams:

  • Transformations
  • Filtering
  • Aggregations
  • Joining
  • Merging and splitting streams

Applications using Kafka Streams can be stateful, provide exactly-once semantics, and can be scaled horizontally in exactly the same way you would deploy and scale any other Java application.

Learn more about Kafka Streams in this free course.

Because Kafka Streams is part of Apache Kafka, it has very good integration with Kafka itself. This means that things like exactly-once processing semantics are possible, and security is tightly integrated.

With Kafka Streams you use your existing development, testing, and deployment tools and processes. You don’t need to deploy and manage a separate stream processing cluster.

Learn more about Kafka Streams in this free course.

Stream processing, also known as event-stream processing (ESP), real-time data streaming, and complex event processing (CEP), is the continuous processing of real-time data—directly as it is produced or received.

Both Kafka Streams and ksqlDB allow you to build applications that leverage stream processing.

Learn more about Kafka Streams in this free course or get started with ksqlDB by taking its free course.

Kafka Streams is a distributed processing framework similar to Apache Flink or Spark Streaming. But it offers some distinct advantages over these other stream-processing libraries:

  • Kafka Streams is simply a Java app. You create your application, build a JAR file and start it. No dedicated processing cluster is needed!
  • Kafka Streams can dynamically scale when needed. For more processing power you just start a new application instance. To scale down, you stop one or more instances. In either case, Kafka Streams will dynamically handle resource allocation and continue working.

To split a stream using Kafka Streams you use the KStream#split method, which returns a BranchedKStream.

The BranchedKStream allows you to create different branches based on predicates. For example:

myStream = builder.stream(inputTopic);
           myStream.split()
              .branch((key, appearance) -> "drama".equals(appearance.getGenre()),
                   Branched.withConsumer(ks -> ks.to("drama-topic")))
              .branch(
                   (key, appearance) -> "fantasy".equals(appearance.getGenre()),
                   Branched.withConsumer(ks -> ks.to("fantasy-topic")))
              .branch(
                   (key, appearance) -> true,
                   Branched.withConsumer(ks -> ks.to("default-topic")));

Here are some more resources to learn about splitting a stream:

No, Kafka Streams applications do not run inside the Kafka brokers.

Kafka Streams applications are normal Java applications that happen to use the Kafka Streams library. You would run these applications on client machines at the perimeter of a Kafka cluster. In other words, Kafka Streams applications do not run inside the Kafka brokers (servers) or the Kafka cluster—they are client-side applications.

ksqlDB

ksqlDB is a way of interacting with a Kafka cluster using SQL. It allows you to write high-level stream operations (CREATE STREAM ...), queries (SELECT ... FROM ...), and aggregations (GROUP BY ...)—using a language that will be familiar to anyone with a background in relational databases.

Under the hood, it can be thought of as a declarative language sitting on top of Kafka Streams. Many tasks you might have wanted Kafka Streams for, such as joins and aggregations, can be written and deployed in minutes with ksqlDB.

(Note: ksqlDB was originally released under the name "KSQL." Older documentation may still refer to it as "KSQL," but it's the same thing.)

ksqlDB is licensed under the Confluent Community License, which is a source-available license, but not an open source license under the OSI definition.

You're free to download, modify and redistribute the source code for ksqlDB, save for a few excluded purposes that the license FAQ explains in detail.

ksqlDB (formerly "KSQL") is largely similar to SQL. An example statement might look like this:

SELECT TS, USER, LAT, LON
  FROM USER_LOCATION_STREAM
EMIT CHANGES;

ksqlDB strives to be compatible with the SQL standard wherever appropriate, and is an active member of the standards committee, which is working to extend SQL to cover event-streaming databases.

ksqlDB is available to install from Docker, Debian, RPM, or as a Tarball.

You can also get it standalone, as part of Confluent Platform, or on Confluent Cloud.

Under the hood, ksqlDB is powered by Kafka Streams, which is in turn built on top of Kafka's consumer/producer architecture. ksqlDB provides the high-level language and easy deployment of new streams/tables, while behind the scenes Kafka Streams provides the processing, persistence and scaling engine.

For a deep dive, see Rick Spurgeon's blog post Sharpening your Stream Processing Skills with Kafka Tutorials.

ksqlDB has a dedicated Java client (JavaDoc) that lets you interact with your ksqlDB server directly from your Java code. Here's an example:

client.streamQuery("SELECT * FROM MY_STREAM EMIT CHANGES;")
    .thenAccept(result -> {
      System.out.println("Query has started. Query ID: " + result.queryID());

      RowSubscriber subscriber = new RowSubscriber();
      result.subscribe(subscriber);
    }).exceptionally(e -> {
      System.out.println("Request failed: " + e);
      return null;
    });

The Java client allows you to create and manage streams, tables and persistent queries, insert new data, and run streaming and batch-style queries.

You can also write user-defined functions in Java for ksqlDB.

ksqlDB has community-supported clients for .NET, Golang and Python. There is also the Confluent-supported Java client.

Each client lets you interact with the ksqlDB server directly from your code. To illustrate, here's a Golang example:

k := `SELECT
  TIMESTAMPTOSTRING(WINDOWSTART,'yyyy-MM-dd','Europe/London') AS WINDOW_START,
  DOG_SIZE,
  DOGS_CT
FROM DOGS_BY_SIZE
WHERE DOG_SIZE=?;`

stmnt, err := ksqldb.QueryBuilder(k, "middle")
if err != nil {
	log.Fatal(err)
}

fmt.Println(*stmnt)

The exact features vary by language, so check their documentation, but in general they allow you to create and manage streams, tables and persistent queries, insert new data, and run streaming and batch-style queries.

ksqlDB places a large subset of Kafka Streams functionality into an easier-to-use, easier-to-deploy package. So, "use ksqlDB when you can," is a fair rule of thumb.

A more nuanced answer should take into account your specific use cases, as well as the programming languages your team is comfortable with: Kafka Streams is only available in Java and Scala, whereas ksqlDB is open to anyone that can write a SQL statement.

This blog post goes into good detail about the tradeoffs.

It's fair to describe ksqlDB as an accessible, high-level way of using Kafka Streams. Streams is the engine under the hood, so you can expect ksqlDB to perform similarly.

That raises the question "How fast is Kafka Streams?" and that depends on a number of factors including your topics' partition sizes, the size of the persistent datasets in your tables and joins, and the number of ksqlDB server instances you spread the load over. For a deep dive on that topic, see Guozhang Wang's Kafka Summit talk, Performance Analysis and Optimizations for Kafka Streams Applications.

ksqlDB was originally released under the name KSQL. They're the same thing, but older documentation may still refer to it as KSQL.

KSQL is also still used to refer to the actual language used to program ksqlDB.

Kafka Connect

Kafka Connect is a tool that provides integration for Kafka with other systems, both sending to and receiving data from them. It is part of Apache Kafka. Kafka Connect is configuration-driven—–you don’t need to write any code to use it.

Kafka Connect manages crucial components of scalable and resilient integration including:

  • Offset tracking
  • Restarts
  • Schema handling
  • Scale out

With Kafka Connect, you can use hundreds of existing connectors oru you can write your own connectors. You can use Kafka Connect with managed connectors in Confluent Cloud or run it yourself. Kafka Connect is deployed as its own process (known as a worker), separate from the Kafka brokers.

Learn more about Kafka Connect in this free course.

Kafka Connect is used for integrating other external systems with Kafka. This includes:

  • Database CDC (snapshotting an entire database table into Kafka, then sending every subsequent change to that table)
  • Streaming data from a message queue such as ActiveMQ or RabbitMQ into Kafka
  • Pushing data from a Kafka topic to a cloud data warehouse such as Snowflake or BigQuery
  • Streaming data to NoSQL stores like MongoDB or Redis from Kafka

_Learn more about Kafka Connect in a free course.

The component that runs connectors in Kafka Connect is known as a worker.

Kafka Connect workers can be deployed on bare metal, Docker, Kubernetes, etc. Here's how you'd run it directly from the default Kafka installation:

./bin/connect-distributed ./etc/kafka/connect-distributed.properties

Many connectors are also available as a fully managed service in Confluent Cloud.

You can run Kafka Connect yourself or use it as a fully managed service in Confluent Cloud.

If you are running Kafka Connect yourself, there are two steps to creating a connector in Kafka Connect:

  1. Run your Kafka Connect worker. Kafka Connect workers can be deployed on bare metal, Docker, Kubernetes, etc. Here's how you'd run it directly from the default Kafka installation:

    ./bin/connect-distributed ./etc/kafka/connect-distributed.properties
  2. Use the REST API to create an instance of a connector:

    curl -X PUT -H  "Content-Type:application/json" http://localhost:8083/connectors/sink-elastic-01/config \
        -d '{
        "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
        "topics"         : "orders",
        "connection.url" : "http://elasticsearch:9200",
        "type.name"      : "_doc",
        "key.ignore"     : "false",
        "schema.ignore"  : "true"
        }'

To use Kafka Connect on Confluent Cloud you can use the web interface to select and configure the connector that you want to use. There is also a CLI and API for managed connectors on Confluent Cloud.

The specific configuration elements will vary for each connector.

Kafka can easily integrate with a number of external databases through Kafka Connect. Depending on the data source, Kafka Connect can be configured to stream both incremental database changes or entire databases row-by-row into Kafka.

Learn more about it in this module of the free Kafka Connect pipelines course.

Change Data Capture (CDC) can easily be done using a Kafka Connect source connector. It works with Kafka Connect by monitoring a database, recording changes, and streaming those changes into a Kafka topic for downstream systems to react to. Depending on your source database, there are a number of Kafka Connectors available, including but not limited to MySQL (Debezium), Oracle, and MongoDB (Debezium).

Some of these connectors are built in conjunction with Debezium, an open-source CDC tool.

Learn more in this free training course.

If you're using Confluent Cloud you can take advantage of the managed connectors provided.

The process of installing Kafka Connect is relatively flexible so long as you have access to a set of Kafka Brokers. These brokers can be self-managed, or brokers on a cloud service such as Confluent Cloud.

Workers – the components that run connectors in Kafka Connect – can be installed on any machines that have access to Kafka brokers.

Kafka Connect can be installed:

  • on bare metal machines in either a standalone (one Kafka Connect instance) or distributed (multiple Kafka Connect instances forming a cluster) modes;
  • in containers using Docker, Kubernetes, etc.

After installing the Kafka Connect worker you will need to install Kafka Connect plugins such as connectors and transformers.

XML data are just formatted strings, so they're quite at home in the value field of a Kafka message. If you want to move XML files into Kafka, there are a number of connectors available to you:

See the Quick Start for more information, as well as the deep-dive blog Ingesting XML data into Kafka.

Syslog data can be ingested into Kafka easily using the Syslog Source Connector.

For more information on how to get started with your syslog data in Kafka, check out this blog post!

CSV files can be parsed and read into Kafka using Kafka Connect. There are several connectors that can do the job:

See the Quick Start for more information.

Security

Apache Kafka does not support end-to-end encryption natively, but you can achieve this by using a combination of TLS for network connections and disk encryption on the brokers.

If you are looking for a more seamless end-to-end encryption solution, keep an eye on KIP-317, as it aims to provide that functionality.

By default, Apache Kafka data is not encrypted at rest. Encryption can be provided at the OS or disk level using third-party tools.

In Confluent Cloud, data is encrypted at rest. More details can be found here.

Apache Kafka supports various SASL mechanisms, such as GSSAPI (Kerberos), OAUTHBEARER, SCRAM, LDAP, and PLAIN. The specific details for configuring SASL depend on the mechanism you are using. Detailed instructions for each type can be found here.

To configure ACLs in Apache Kafka, you must first set the authorizer.class.name property in server.properties: authorizer.class.name=kafka.security.authorizer.AclAuthorizer. This will enable the out-of-the-box authorizer.

Then you can add and remove ACLs using the kafka-acls.sh script. Here's an example of adding an ACL:

bin/kafka-acls.sh --authorizer-properties zookeeper.connect=localhost:2181 \
--add --allow-principal User:Tim --allow-host 123.456.789.0 \
--operation Read --operation Write --topic my-topic

More details and different use cases can be found here.

Apache Kafka does not support Role-Based Access Control (RBAC) by default.

Confluent adds RBAC support to Kafka, allowing you to define group policies for accessing services (reading/writing to topics, accessing Schema Registry, etc.) and environments (dev/staging/prod, etc.), across all of your clusters. You can learn more about it in the blog post Introducing Cluster Authorization Using RBAC, Audit Logs, and BYOK and the reference documentation Authorization using Role-Based Access Control

Kafka Operations

The most common way to monitor Kafka is by enabling JMX. JMX can be enabled by setting the JMX_PORT environment variable, for example:

JMX_PORT=9999 bin/kafka-server-start.sh config/server.properties

Once JMX is enabled, standard Java tooling such as jconsole can be used to observe Kafka status.

The documentation provides more details related to monitoring Kafka, including available metrics, and Confluent provides Confluent Control Center for an out-of-the-box Kafka cluster monitoring system.


If you want to inspect the health of a broker that is already running, and you have access to the server, you can check that the process is running:

jps | grep Kafka

And you can also check that it is listening for client connections (port 9092 by default):

nc -vz localhost 9092

The default ports used for Kafka and for services in the Kafka ecosystem are as follows:

Service Default Port
Kafka Clients 9092
Kafka Control Plane 9093
ZooKeeper 2181
Kafka Connect 8083
Schema Registry 8081
REST Proxy 8082
ksqlDB 8088

By default Kafka listens for client connections on port 9092. The listeners configuration is used to configure different or additional client ports. For more details on configuring Kafka listeners for access across networks see this blog about advertised.listeners.

If you have terminal access to the broker machine, you can pass the --version flag to many of the Kafka commands to see the version. For example:

bin/kafka-topics.sh --version
3.0.0 (Commit:8cb0a5e9d3441962)

If your Kafka broker has remote JMX enabled, you can obtain the version with a JMX query, for example:

bin/kafka-run-class.sh kafka.tools.JmxTool \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --object-name kafka.server:type=app-info \
  --attributes version --one-time true
Trying to connect to JMX url: service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi.
"time","kafka.server:type=app-info:version"
1638974783597,3.0.0

If you need to do software upgrades, broker configuration updates, or cluster maintenance, then you will need to restart all of the brokers in your Kafka cluster. To do this, you can do a rolling restart. Restarting the brokers one at a time provides high availability since it avoids downtime for your end users.

See the rolling restart documentation for a detailed workflow, including considerations and tips for other cluster maintenance tasks.

Use the kafka-server-stop.sh script located in the installation path's bin directory:

bin/kafka-server-stop.sh

This will work if you've installed Kafka using the Confluent Platform or Apache Kafka tarball installations.

For more details and other installation options such as RPM and Debian see the documentation and Confluent Developer.

The directory where Kafka stores data is set by the configuration log.dir on the broker.

By default this is /tmp/kafka-logs. If you are using the RPM or Debian installation of Confluent Platform, the default data directory is /var/lib/kafka.

Apache ZooKeeper™ is a service for coordinating configuration, naming, and other synchronization tasks for distributed systems.

Currently, ZooKeeper provides the authoritative store of metadata holding the system’s most important facts: broker information, partition locations, replica leadership, and so on.

In the future, ZooKeeper will not be required for Kafka, once KRaft mode is production ready.

Generally, production environments can start with a small cluster of three nodes and scale as necessary. Specifically, ZooKeeper should be deployed in 2n + 1 nodes, where n is any number greater than 0. The odd number of servers is required in order to allow ZooKeeper to perform majority elections for leadership.

Did you know, you may not even need ZooKeeper?

  • In the future, ZooKeeper will not be required for Kafka operation, see this KRaft mode explanation for details.
  • Confluent Cloud provides a fully managed Kafka service so you don't have to be concerned with either ZooKeeper or KRaft.

As of version 3.0, Kafka still needs Apache ZooKeeper when deployed in production.

Kafka version 2.8 and onwards includes a preview mode of Kafka Raft metadata mode, known as KRaft. With KRaft, there is no need for ZooKeeper since Kafka itself is responsible for metadata management using a new "Event-Driven Consensus" mechanism.

Learn more about KRaft here.

As of Kafka 3.0, an early release of Kafka's KRaft mode is available to preview, but it is not ready for production workloads. Development is underway to fully support all of the features currently provided by ZooKeeper.

You can learn more about KRaft mode here.

Monitoring Kafka

When monitoring Kafka, you need to be able to answer questions such as:

  • Are my applications receiving all data?
  • Are my business applications showing the latest data?
  • Why are my applications running slowly?
  • Do I need to scale up?
  • Can my data get lost?

There are many monitoring options for your Kafka cluster and related services. If you are using Confluent, you can use Confluent Health+, which includes a cloud-based dashboard, has many built-in triggers and alerts, has the ability to send notifications to Slack, PagerDuty, generic webhooks, etc., and integrates with other monitoring tools.

To use Health+, you'll need to enable Confluent Telemetry Reporter, which is already part of Confluent Platform, or you can install it separately:

yum install confluent-telemetry

This documentation page shows the collected metadata that powers Health+.

There are also various open source tools that can be combined to build powerful monitoring solutions, such as Prometheus and Grafana, or Beats, Elasticsearch, and Kibana, as well as various other tools discussed here.

You can enable remote JMX-based monitoring tools such as Prometheus to connect to your Kafka services, including brokers, Kafka Connect workers, etc., as well as to your clients, e.g. producers and consumers. Be sure to start the services with the JMX port open.

For example, to start a broker with JMX port 9999 open, run the following from your prompt:

export JMX_PORT=9999
./bin/kafka-server-start.sh

For additional documentation on JMX, see the documentation.

Consumer lag is an important performance indicator. It tells you the offset difference between the producer’s last produced message and the consumer group’s last commit. A large consumer lag, or a quickly growing lag, indicates that the consumer is not able to keep up with the volume of messages on a topic.

The key metrics to monitor for consumer lag is the MBean object: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<client_id>

To see consumer lag in action, see the scenario in this example.

There are a wide variety of tools that can monitor your Kafka cluster and applications. If you're already using Confluent Cloud or Confluent Platform, a good place to start is Confluent Health+, which offers a cloud-based dashboard and numerous built-in triggers, alerts, and integrations.

For Apache Kafka deployments, you can consider JMX-based monitoring tools or you can build your own integration with other open source tools such as Datadog or Prometheus.

Confluent

Confluent Platform is a complete, self-managed, enterprise-grade distribution of Apache Kafka®.

It enables you to connect, process, and react to your data in real-time using the foundational platform for data in motion, which means you can continuously stream data from across your organization to power rich customer experiences and data-driven operations.

See the product page for Confluent Platform or its documentation page.

There is no such thing as "Confluent Kafka." Apache Kafka® is a project owned by the Apache Software Foundation. Confluent is one of the companies that contribute to its development.

Confluent provides a managed Kafka service called Confluent Cloud as well as on-premises software called Confluent Platform, which includes Kafka.

Components in Confluent Platform use a mix of Apache 2.0, Confluent Community License, and an enterprise license.

A developer license allows full use of Confluent Platform features free of charge for an indefinite duration. However, the license is limited to a single broker configuration per cluster. The developer license gives developers the freedom to try the Confluent Platform commercial features available in a non-production setting.

A trial (evaluation) license allows a free trial of commercial features in a production setting, and expires after 30 days.

For full details of the licenses and which components use which license, please refer to the documentation.

New signups to Confluent Cloud receive $400 to spend within during their first 60 days.

Many components in Confluent Platform are either open source (Apache 2.0), or source available (Confluent Community License). These include:

  • Apache Kafka®
    • Kafka Connect
    • Kafka Streams
  • ksqlDB
  • Confluent Schema Registry
  • Confluent REST Proxy
  • Some of the connectors for Kafka Connect

Other components of Confluent require an enterprise license. They can be used under a 30-day trial, or indefinitely under the developer license, when used with a single broker cluster in a non-production setting.

For full details of the licenses and terms please see the documentation.

Schema Registry

Apache Avro is a serialization framework that relies on language agnostic schemas written in JSON, which adhere to its specific format. Avro supports clients written in Java, Python, C, and C#.

Since Avro uses schemas and supports many languages, it's a perfect fit for Schema Registry and Apache Kafka. For more information, read the introduction to Avro and documentation that covers using Avro with Schema Registry.

The Confluent Schema Registry provides a centralized serving layer for your schemas and also provides a RESTful interface for storing and retrieving schemas written in either Avro®, JSON Schema, or Protobuf.

Schema Registry lives outside of and separately from your Kafka brokers. When using it, your producers and consumers still talk to Kafka to publish and read data (messages) to and from topics. But concurrently, they also talk to Schema Registry to send and retrieve schemas that describe the data models for the messages.

Working with schemas guarantees the formats of the objects you produce into and consume from an Apache Kafka broker. But making sure that every developer is using the same schema version can be challenging. Schema Registry gives you a centralized serving layer for your schemas that makes it much easier for everyone to stay in sync.

For a more in-depth explanation on the benefits of using Schema Registry, refer to two blogs by Gwen Shapira: Schemas, Contracts, and Compatibility and Yes, Virginia, You Really Do Need a Schema Registry.

The domain objects in your applications form implicit contracts with all of the developers that work on or interact with them. Schemas make these contracts explicit, reducing the chances of introducing errors when making changes, and thus helping with collaboration.

For a more in-depth explanation read How I Learned to Stop Worrying and Love the Schema.

One good way to check if Schema Registry is running is to use the REST API to run a request. For example, you could list all subjects in Schema Registry:

curl -X GET http://<SR HOST>:8081/subjects

If Schema Registry is up and running, you'll receive a valid response similar to this:

["Kafka-value","Kafka-key","my-cool-topic-value"]

Otherwise you'll get an error.

See Schema Registry API examples and a Schema Registry cheatsheet.

Architecture and Terminology

An event-driven architecture is an architecture based on producing, consuming, and reacting to events, either within a single application or as part of an intersystem communication model. Events are communicated via event streams, and interested consumers can subscribe to the event streams and process the events for their own business purposes.

Event-driven architecture enables loose coupling of producers and consumers via event streams, and is often used in conjunction with microservices. The event streams provide a mechanism of asynchronous communication across the organization, so that each participating service can be independently created, scaled, and maintained. Event-driven architectures are resistant to the impact of intermittent service failures, as events can simply be processed when the service comes back up. This is in contrast to REST API / HTTP communication, where a request will be lost if the server fails to reply.

An event stream is a durable and replayable sequence of well-defined domain events. Consumers independently consume and process the events according to their business logic requirements.

A topic in Apache Kafka in an example of an event stream.

Streaming data enables you to create applications and services that react to events as they happen, in real time. Your business can respond to changing conditions as they occur, altering priorities and making accommodations as necessary. The same streams of operational events can also be used to generate analytics and real-time insights into current operations.

Check out the free courses to learn more about the benefits of thinking in events as well as building streaming data pipelines.

Event sourcing is the capture of all changes to the state of an object, frequently as a series of events stored in an event stream. These events, replayed in the sequence in which they occurred, can be used to reconstruct both the intermediate and final states of the object.

An event broker hosts event streams so that other applications can consume from and publish to the streams via a publish/subscribe protocol. Kafka's event brokers are usually set up to be distributed, durable, and resilient to failures, to enable big data scale event-driven communication in real time.

Stream processing is an architectural pattern where an application consumes event streams and processes them, optionally emitting its own resultant events to a new set of event streams. The application may be stateless, or it may also build internal state based on the consumed events. Stream processing is usually implemented with a dedicated stream processing technology, such as Kafka Streams or ksqlDB.

Topics (also known as Event streams) are durable and partitioned, and they can be read by multiple consumers as many times as necessary. They are often used to communicate state and to provide a replayable source of truth for consumers.

Queues are usually unpartitioned and are frequently used as an input buffer for work that needs to be done. Usually, each message in a queue is dequeued, processed, and deleted by a single consumer.

Distributed computing is an architecture in which components of a single system are located on different networked computers. These components communicate and coordinate their actions across the network, either using direct API calls or by sending messages to each other.

A microservice is a standalone and independently deployable application, hosted inside of a container or virtual machine. A microservice is purpose-built to serve a well-defined and focused set of business functions, and communicates with other microservices through either event streams or direct request-response APIs. Microservices typically leverage a common compute resource platform to streamline deployments, monitoring, logging, and dynamic scaling.

Read more about microservices and Kafka in this blog series.

Command Query Responsibility Segregation (CQRS) is an application architecture that separates commands (modifications to data) from queries (accessing data). This pattern is often used alongside event sourcing in event-driven architectures.

You can learn more about CQRS as part of the free Event Sourcing and Event Storage with Apache Kafka® training course

A REST API is an Application Programming Interface (API) that adheres to the constraints of the Representational State Transfer (REST) architectural style. REST API is often used as an umbrella term to describe a client and server communicating via HTTP(S).

You can use the REST Proxy to send and receive messages from Apache Kafka.

A data lake is a centralized repository for storing a broad assortment of data sourced from across an organization, for the primary purpose of cross-domain analytical computation. Data may be structured, semi-structured, or unstructured, and is usually used in combination with big data batch processing tools.

Data may be loaded into a data lake in batch, but is commonly done by streaming data into it from Kafka.

Data mesh is an approach to solving data communication problems in an organization by treating data with the same amount of rigor as any other product. It is founded on four pillars: data as a product, domain ownership, federated governance, and self-service infrastructure. This strategy is a formal and well-supported approach for providing reliable, trustworthy, and effective access to data across an organization.

Real-time data analysis, also known as streaming analytics, refers to the processing of streams of data in real time to extract important business results. This processing is usually implemented with a dedicated stream processing technology, such as Kafka Streams or ksqlDB.

An enterprise service bus (ESB) is a software platform that routes and distributes data among connected applications, without requiring sending applications to know the identity or destination of receiving applications. ESBs differ from Kafka in several key ways. ESBs typically have specialized routing logic, whereas Kafka leaves routing to the application management side. ESBs also do not generally support durable and replayable events, a key feature for services using Kafka.

This blog discusses in more detail the similarities and differences between Kafka and ESBs.

Learn more with these free training courses

Apache Kafka 101

Learn how Kafka works, how to use it, and how to get started.

Spring Framework and Apache Kafka®

This hands-on course will show you how to build event-driven applications with Spring Boot and Kafka Streams.

Building Data Pipelines with Apache Kafka® and Confluent

Build a scalable, streaming data pipeline in under 20 minutes using Kafka and Confluent.

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free