Get Started Free
Untitled design (21)

Tim Berglund

VP Developer Relations

Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Kafka Consumers

In Apache Kafka®, consumers are the client applications responsible for reading data from Kafka topics. While producers write data into topics, consumers retrieve it, process it, and often send it onward for additional processing or storage. Any application that pulls data from Kafka—whether for analytics, database updates, or real-time processing—is a consumer.

To connect to a Kafka cluster, consumers use the KafkaConsumer API. Configuration is similar to producers, requiring properties like:

  • bootstrap.servers: A list of broker addresses to connect to. This doesn’t need to be exhaustive—just enough for the consumer to discover the rest of the cluster.
  • group.id: A unique identifier for the consumer group (more on this in the next section).

Once configured, the consumer subscribes to one or more topics. You can even use a regular expression to subscribe to multiple topics dynamically. After subscribing, the consumer enters an infinite loop, continuously polling for new messages. This is normal in streaming applications since data is constantly flowing—there is no "end of data."

When polling, the consumer fetches ConsumerRecords, which contain:

  • Key: The unique identifier of the message.
  • Value: The data payload.
  • Partition: The partition the message came from.
  • Timestamp: When the event was recorded.
  • Headers: Optional metadata.

Consumers iterate over these records to process each message, often logging, transforming, or sending it to other systems.

Here's an example of code:


Properties config = loadProperties("kafka.properties");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(config);
consumer.subscribe(List.of("thermostat_readings"));
while (true) {
   ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(10));
   for (ConsumerRecord<String, String> record : records) {
      System.out.println("Message=" + record.key() + ":" + record.value());
      // further process the record...
}

Consumer Offsets and Recovery

Kafka tracks the offset of each consumed message. This is called offset tracking and ensures that if a consumer goes offline, it can resume from where it left off. These offsets are committed back to Kafka itself, stored in an internal topic. If a consumer crashes and is restarted, it reads its last committed offset and continues processing without missing data.

Consumer Groups and Parallelism

To scale processing, Kafka supports consumer groups. All consumers in the same group share the work of reading from a topic's partitions. Kafka assigns each partition to one consumer in the group—no two consumers in the same group read from the same partition. This enables parallelism:

  • If a topic has three partitions and three consumers in the group, each consumer processes messages from one partition independently.
  • If more consumers are added, Kafka automatically rebalances the partitions across them.
  • If a consumer fails, its partitions are reassigned to the remaining consumers to maintain processing.

Importantly, Kafka’s log-based design means consuming data does not delete it. Multiple consumers—even in different consumer groups—can read the same messages independently. This allows for separate applications to process the same event streams without conflict.

In summary, consumers are how data is retrieved from Kafka for processing. Through consumer groups, you can scale horizontally, process in parallel, and guarantee fault tolerance—all while preserving message integrity and order within each partition.

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo codes KAFKA101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud storage and skip credit card entry.

Consumers

This is the introduction to Apache Kafka about consumers. Now, consumers are the client components outside of the cluster responsible for reading data from Kafka topics and ideally doing something with it. So using the consumer API is very similar in principle to using the producer that we covered in the last module. And again, we'll be doing Java in these examples. In Java, you use a class called KafkaConsumer to connect to the cluster. You give it that same map of key-value pairs, the properties file that's got the configuration. We show here just the bootstrap servers.

In real life, there could be 5 or 10 of these for this, that, and the other thing, some security settings and things. At a minimum though, you at least need to point the consumer to the cluster so it knows what brokers to try to access to be able to read the rest of the metadata about the cluster. As we explained in the section on producers, you don't need to give it the full list of all the brokers. That would probably be inconvenient. It just needs enough that one of those will be up and available. It'll be able to talk to that one and learn what it needs about the whole cluster. And with that, we connect to the Kafka cluster with the KafkaConsumer object.

Once we've got that, we need to give it a list of topics to subscribe to. Well, what if you only wanna subscribe to one topic? What if you don't want a list? That's okay, you just use a list of one thing. That subscribe method needs to take a list and so you'll give it a list of one thing. And for the stout-hearted, that can also be a regular expression that subscribes to all topics that match the RegEx. That's possible to do if you want to do that. I'm not saying you should, I'm just saying you can.

After we've subscribed to one or more topics, we enter an infinite loop. Maybe you're not used to that, you think that's kind of a bad thing. Here, it's perfectly fine. If you think about it, this event-driven code, you just, this is what you do forever. You read messages from the topic. There is no last message. The idea of a last thing is kind of batch-oriented thinking. This is streaming data, so there's always more of it. So we call this poll method that tells the library to go and ask the cluster whether there are any new messages for it to consume.

Now, deep under the hood of the consumer library, that method is aware of which partitions are implicated. So which topics are we subscribed to? What partitions does that topic have? Where are the leaders? What brokers have the leader partition for each of the partitions in the topic? So this really is asking specific computers in the Kafka cluster, specific processes out there, whether they've got any new messages. When they do, we get back a collection called ConsumerRecords, and that is a generic type. In Java, you see the angle brackets and String, String. That means that we're saying the type of the key is a string and the type of the value is a string. That's just like what we produced. You want those types to match what you're producing.

Now, string isn't a terrible type for a key. Often you have primitive types like string or int or long or something like that for the key. The value, usually that's some kind of domain object. You've modeled that in the type system of your language. There is in data terms, a schema there, and that schema matters. We're covering that up for now. We're just trying to get the ideas into our heads and really look at what the code looks like in its simplest form. We're gonna add some schema to that in a little bit.

We get that collection of consumer records and we can iterate over that. It might just be one record. We could get many records back if this is a high-traffic topic, there could be many of them waiting for us. So we iterate over that and you see here what we're most interested in is the key and the value, the message itself. And in this genius little bit of computation, we are logging them to the console. It's okay if you wanna use this code, go ahead, it's free. You could just copy it and use it.

Now there's more in there than just key and value. And you can see that in the docs. You can know things like what partition did this come from? What's the timestamp? All kinds of other little bits of metadata. The headers, if you wanna know the headers, but key and value are usually in first place. Those are the things that you really need to know.

But you know, there's a bigger story in consumers than just the network plumbing. They're a lot more interesting. Let's talk through some of it. As a reminder, Kafka is not a queue, it's a log. A topic is fundamentally a log, which means that the act of consuming doesn't make messages go away. We can have these two independent consumers reading the same message. When Consumer A reads it, Consumer B can still come along and read it. Consumer C, D, E, F, G, H, I, J, L, M, N, O, P, any of those, however many consumers you have, they can all read those same messages. So very, very important.

Some of you might be wondering at this point, what happens if the consumer goes away? Like, okay, everybody can read independently, but what if a consumer dies and then you restore it? How does it know where it was? Well, on a periodic basis, a consumer basically reports back to the cluster the most recent offset it's consumed. This is called consumer offset tracking or consumer offset commits, because the act of the consumer telling the cluster is called a commit.

Now, those offsets are stored in a special purpose internal topic. So Kafka kind of uses Kafka to build this Kafka functionality. It's pretty cool. But that consumer there, as you can see, will read 1, will read 2, and then, oh, 3 didn't work. So it doesn't commit the offset 3 there back to the cluster. And maybe you'll restore it, it'll try again, whatever the recovery is. So when it does properly process number 3, it commits that offset.

This is a little more complicated in the actual details. A consumer doesn't necessarily commit every single offset it consumes that could be pretty expensive. They're usually batched up and there's some threshold, like a certain number of commits or a certain amount of time after which it will commit those offsets back to the cluster, all documented and available for your optimization, if that's a thing that you need to dive into.

Now, consumers can scale. They form what are called consumer groups. So right here, I've got Consumer A and this three-partition topic. So when Consumer A is reading messages from that topic, they could come from any of the three partitions and it can only really be processing one at a time. Consumers are fundamentally single-threaded things. And of course, hey, I could have Consumer B and what does Consumer B do? Well, Consumer A is reading messages, Consumer B is reading messages and A and B are operating independently, but they're both single-threaded, just one message from one partition at a time.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.