Get Started Free
Untitled design (21)

Tim Berglund

VP Developer Relations

Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Kafka Topics

In Apache Kafka®, topics are the core abstraction for storing and processing data. Unlike traditional databases where data is stored in tables and updated with each new event, Kafka uses logs to record every event in sequence. This means that instead of replacing old values, each new event is simply appended to the log, preserving the entire history. For example, if you’re monitoring a smart thermostat, every temperature reading is added as a new event in the log, allowing you to track changes over time.

Each event in a topic is called a message, and it consists of three main parts:

  • A key, which identifies the source of the event (like a thermostat ID or user ID).
  • A value, which contains the actual data (like the temperature reading or the click action).
  • A timestamp, which marks the exact time the event was produced.

Messages in a topic are immutable—once written, they cannot be changed. This guarantees a reliable and consistent history of events. If you want to transform or filter data, you don’t modify the original topic; instead, you create a new topic that reflects those changes. This keeps Kafka’s logs clean and traceable.

It’s important to understand that topics are not queues. In a queue, when you read a message, it’s gone—nobody else can read it. But in a Kafka topic, messages remain available for multiple consumers to read independently. This allows for replayability and fault tolerance, as data can be processed multiple times without being lost.

Kafka also supports log compaction. If you only care about the latest state of each key (e.g., the most recent temperature reading from each thermostat), you can enable compaction to remove old versions, keeping your storage lean without losing important information.

Finally, you can configure retention policies to manage storage efficiently, choosing how long Kafka should keep data—whether it’s forever, for a set number of days, or until it’s compacted.

In summary, Kafka topics are powerful, immutable logs that allow you to store, process, and replay streams of events reliably, while keeping full historical context intact.

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo codes KAFKA101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud storage and skip credit card entry.

Topics

In our introduction to Kafka, let's look at topics. In a traditional database, you usually store data in tables. For example, if you're managing a network of smart thermostats in homes, favorite example of mine, you'll probably have a table called thermostat readings to collect the data from those devices. This table has columns for each key piece of data like the identifier for the sensor, location of the house, the temperature here measured in Celsius. Friends, in the United States, this is kind of a comfy indoor temperature. And also there's a column for the timestamp when the reading was taken. When you receive new readings, what do you do? You insert them, you put them on the end.

Now, if the temperature of a sensor that we've seen before or a sensor we've seen before reports back, say the kitchen warms up a little bit from 22 to 24 C, starting to get a little bit warm, you update the temperature in the field that corresponds to that sensor. We kind of all know how this works. The problem here is that we've lost context. We don't remember what the temperature of the kitchen used to be. This could be a missed opportunity to gather insights like how quickly the kitchen heats up, what time of day it heats up. And sure, we could come up with database schemas that model this, classically tables and relational databases don't do so well with log type use cases. But the typical way of representing stuff in a table is that the thing has a row and when the thing in the world changes, you change the row.

Kafka uses a fundamentally different abstraction. Instead of relying on tables, it uses logs. And a log is a sequence of data items. We call them messages, sometimes we call them events. We put them in the log and when you add to the log, you add it to the end and you importantly, don't change things you've already put in the log. That's how a log works. In Kafka, these logs are called topics. Topics are where we store messages. And I'll say again, this is a defining characteristic of messages in Kafka. They are immutable. Once they're written, you can't change them. You can forget them, you can't change them though once they're in a topic. It's an event, it's happened. You don't get to rewrite history.

Back to our home thermostat application. If we use Kafka instead of a database, put the stuff in a Kafka topic, now we have a topic called thermostat readings. Each message in this topic represents a sensor reading. I'm showing them here in JSON. They can be in lots of different formats, but Avro isn't very fun to look at on a slide. So let's use JSON. It's a lot easier to see. So as the sensors are putting new readings into the topic, new messages are appended to the end. And just like our example last time, we're gonna get another reading from the kitchen. And there we go. Now we see it used to be 22, now it's 24. We've got that whole history of the events as this sequence of immutable messages in the topic.

Having just one topic, of course, just like having one table in a database. I mean, who would do that? You can have thousands of topics in a Kafka cluster with data in different formats for different purposes. So you can keep your data organized and your system managed well. Now, in that previous example where I just showed you JSON, that's handy for the slide, but internally Kafka actually doesn't care at all. It's completely schema-less. Messages are just bytes. So you can put anything in there. Now on the outside, as we'll see in a later module, of course, you don't want to deal with an array of bytes. You've got lots of schema management tools. You can use formats like Avro and protocol buffers and JSON schema.

Also remember, these messages are immutable. So what if you've got a topic that's got things in it that you want to change? Well, what you do, and this is typical of how you deal with immutable data, even in language libraries and languages that specialize in immutable data, you make a copy of it. So there's that first topic there. It's got thermostat readings. What if we want to kind of filter that to only the really hot readings? Well, we've got hot locations and there's that SQL query there that's reading from the events in the first topic and writing them through a filter into a second topic. We'll talk later in the module on stream processing about how that SQL would work and what kind of tooling is involved, but you need this basic idea in your mind that with immutable data, you can absolutely transform it. You just make new topics to reflect the transformed data.

Now, a common misunderstanding, or at least misstatement, Kafka does not have queues. A topic is not a queue, it's a log. So the difference is when I read from a queue, I've taken that item out of the queue. Nobody else can read it anymore. It's gone. There's a certain ephemerality to data in a queue. In a log, when I read a message from a log, it's still there. I can read it again. Someone else can read it. It's there. It's this ordered sequence of immutable records that I can read as often as I like. We're gonna go into that in some more detail in a future module. There are broad implications to that difference. It's not a trivial point. So every time I hear somebody say Kafka queue, a little piece of me dies. And I like to think I'm a person who's really full of life, but you know, I can only have that little piece die so often. So please don't say Kafka queue. These are logs. The distinction matters and matters a lot.

Let's talk about the structure of the messages we put into a topic. Now, the most important part really is the value field. This is the sine qua non. You have to have this. This is the actual what happened part of the event. You know, an event is kind of a when and a what. This is recording what. Again, I'm showing JSON here because it is ever so much more fun to look at, but it can be Avro. It can be Protobuf. It can be some custom serialization of your own. It can be an integer. It can be a raw string. Really, you're just not limited here. You've got the value.

Then you have the key. Now, typically the key is an identifier that the payload relates to. Again, in our smart thermostat example, unique ID of the thermostat, a user ID, an account ID, something like that. It's not mandatory to have a key, but it's typical to use them. There's a lot of great things in Kafka data modeling that are unlocked if you use a key. You have to choose the key carefully because it's gonna help you distribute data efficiently across nodes in your cluster. We'll talk about how that works in a little bit, but here the sensor ID is 42, so the key is 42.

A message also carries a timestamp. That represents the instant the producer created the message or when the Kafka broker received it. There's some pieces we don't have all on the map yet. When you're producing a message, you're using some kind of client language library. You can actually tell the library, this is the time of the event. If you don't do that, basically a time will be assigned for you when the message arrives at the cluster. But this really tracks the when that it happened.

So in the thermostat reading message, that read_at field, that's the natural candidate for us to use for the timestamp. There are also headers. This is just a map of some key value pairs. They are untyped strings, and you can use these to add little bits of further context about the data. This is not some covert little sidebar, extra value kind of thing, really should be used as lightweight metadata. Finally, a message knows what topic it belongs to. And also it has an offset. The offset starts at zero with the first message produced to a topic and increases by one for every message. Yes, it really is just that simple.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.