Most frequently asked questions and answers about Apache Kafka, answered by Kafka experts.
Kafka runs as a cluster of one or more servers, known as brokers.
Brokers store event data on disk and provide redundancy through replication. Kafka scales horizontally, which gives resilience along with performance benefits, and allows clients to process data in parallel.
Clients interact with the cluster using APIs defined in binary protocols, communicated over TCP/IP. Clients produce or consume event data, typically using language-based libraries.
Client libraries are available in many languages and frameworks, including Java, C/C++, C#, Python, Go, Node.js, and Spring Boot. There is also a REST API for interacting with Kafka.
Kafka runs in a cluster of nodes, called brokers. Brokers are designed to be simple. They do not process data themselves and they only view event data as opaque arrays of bytes. A broker's main job is to provide the core functionality of storing, replicating, and serving data to clients.
Brokers replicate data amongst other brokers in the cluster for the purposes of resilience and availability.
A producer application connects to a set of initial brokers in the cluster, and requests metadata information about the cluster. Using this information, the producer determines which broker to send events to, based on a partitioning strategy.
Consumer applications often work in coordination, called a consumer group. One of the brokers in a cluster has the additional responsibility of coordinating applications that are part of a consumer group.
For a full introduction to Kafka design, see this free Kafka 101 course.
Kafka brokers store data on local disk. Broker behavior is controlled by configuration parameters, and the directory that Kafka stores data in is configured by the log.dir value.
Moving "warm" data to other storage types yet keeping it accessible is possible using Tiered Storage, implemented in both Confluent Cloud and Confluent Platform. There are also community efforts underway as part of KIP-405.
You can read more about how Kafka stores data here.
In Kafka, events are produced and consumed in batches. The maximum size for a batch of events is controlled by the message.max.bytes configuration on the broker. This can be overridden on a per-topic basis by setting the max.message.bytes configuration for the topic.
Somewhat confusingly, these two parameters are indeed different on the broker (message.max.bytes) and topic (max.message.bytes)!
For more details see the configuration documentation. Be sure to check out the free Kafka 101 course on Confluent Developer.
A consumer group in Kafka is a single logical consumer implemented with multiple physical consumers for reasons of throughput and resilience.
When a single consumer cannot keep up with the throughput of messages that Kafka is providing, you can horizontally scale that consumer by creating additional instances of it.
The partitions of all the topics are divided among the consumers in the group. As new group members arrive and old members leave, the partitions are reassigned so that each member receives its proportional share of partitions. This is known as rebalancing the group. Kafka keeps track of the members of a consumer group and allocates data to them.
To utilize all of the consumers in a consumer group, there must be as many partitions as consumer group members. If there are fewer partitions than consumers in the group, then there will be idle consumers. If there are more partitions than consumers in the group, then consumers will read from more than one partition.
You can learn more about consumers in this free Apache Kafka 101 course.
Kafka consumers may be subscribed to a set of topics as part of a consumer group. When consumers join or leave the group, a broker, acting as the coordinator, assigns partitions to the consumers in the group as a way of evenly spreading the partition assignments across all members. This is known as rebalancing the group. You can read more about it in this blog.
The offset in a Kafka topic is the location of any given event in a particular partition. In Kafka, events are stored in topics, which are linear and append only. Topics are further divided into partitions. Offsets are monotonically increasing and are represented by a 64-bit integer.
For more information on Kafka fundamentals, see this Kafka 101 course.
An in-sync replica (ISR) is a Kafka topic partition’s replica (known as a follower), which has enough of the data from the primary version of the partition (known as the leader), that it is said to be in sync.
The degree to which a follower can lag from the leader and still be deemed to be in sync is configured by the broker setting replica.lag.time.max.ms.
For more details see this deep-dive blog and the documentation.
Lag in Kafka can refer to two things:
How far behind a consumer is in reading the available messages. Depending on the consumer’s purpose, it may need to read and process messages at a greater rate. A consumer group provides for stateless horizontal scalability.
How far behind a follower partition is in replicating the messages from the leader partition. If a follower gets too far behind then it becomes an out-of-sync replica.
Learn how Kafka works, how to use it, and how to get started.
This hands-on course will show you how to build event-driven applications with Spring Boot and Kafka Streams.
Build a scalable, streaming data pipeline in under 20 minutes using Kafka and Confluent.