Many use cases allow for Events to be removed from an Event Stream in order to preserve space and prevent the reading of stale data.
How can we remove events from an Event Stream based on a criteria, such as event age or Event Stream size?
The solution for limited retention will depend on the Event Streaming Platform. Most platforms will allow for deletion of events using an API or with an automated process configured within the platform itself. Because Event Streams are modeled as immutable event logs in the Event Store, events will be removed from the beginning of an Event Stream moving forward (i.e., oldest events are being removed first). Event Processing Applications that are reading from the stream will not be given the deleted events.
Apache Kafka® implements a Limited Retention Event Stream by default. With Kafka, Event Streams are modeled as Topics. Kafka provides two types of retention policy, which can be configured on a per-topic basis or as a default for new topics.
With time-based retention, events will be removed from the topic after the event timestamp indicates an event is older than the configured log retention time. On Kafka this is configured with the
log.retention.hours setting, which can be set as a default to apply to all topics or on a per-topic basis. Additionally, Kafka respects a
log.retention.ms settings to define shorter retention periods.
The following example sets the retention period of a topic to one year:
For more guidance and configuring time-based data retention, see the Kafka Broker Configurations documentation for default settings, or the Modifying Topics section for modifying existing topics.
With size-based retention, events will begin to be removed from the topic once the total size of the topic violates the configured maximum size. Kafka supports a
log.retention.bytes configure. For example, to configure the maximum size of a topic to 100GB you could set the configuration as follows:
For more guidance and configuring size-based data retention, see the Kafka Broker Configurations documentation for default settings, or the Modifying Topics section for modifying existing topics.
For either method of configuring retention, Kafka does not immediately remove events one by one when they violate the configured retention settings. To understand how they are removed, we first need to explain that Kafka topics are further broken down into topic partitions (see Partitioned Placement). Partitions themselves are further divided into files called segments. Segments represent a sequence of the events in a particular partition, and these segments are what is being removed once a violation of the retention policy has occurred. We can further fine-tune the removal algorithm with additional settings such as
log.retention.check.interval.ms and segment configuration, such as