Enhance your career, get your certificate as a Data Streaming Engineer | Get your Certificate

Tutorial

How to filter duplicate events per time window from a Kafka topic with ksqlDB

Consider a topic with events that represent clicks on a website. Each event contains an IP address, a URL, and a timestamp. In this tutorial, we'll use ksqlDB to deduplicate these click events.

Setup

Let's start with a stream representing user click events:

CREATE STREAM clicks (ip_address VARCHAR, url VARCHAR)
    WITH (KAFKA_TOPIC='clicks',
          PARTITIONS=1,
          FORMAT='JSON');

Deduplicate events per time window

The first step toward deduplication is to create a table to collect the click events by time window:

CREATE TABLE detected_clicks AS
    SELECT
        ip_address AS key1,
        url AS key2,
        AS_VALUE(ip_address) AS ip_address,
        COUNT(ip_address) as ip_count,
        AS_VALUE(url) AS url,
        FORMAT_timestamp(FROM_UNIXTIME(EARLIEST_BY_OFFSET(ROWTIME)), 'yyyy-MM-dd HH:mm:ss.SSS') AS timestamp
    FROM clicks WINDOW TUMBLING (SIZE 2 MINUTES, RETENTION 1000 DAYS)
    GROUP BY ip_address, url
    EMIT CHANGES;

As we’re grouping by IP address and URL, these columns will become part of the primary key of the table. Primary key columns are stored in the Kafka message’s key. As we’ll need them in the value later, we use AS_VALUE to copy the columns into the value and set their name. To avoid the value column names clashing with the key columns, we add aliases to rename the key columns.

As it stands, the key of the detected_clicks table contains the IP address and URL columns, and as the table is windowed, the window start and end timestamps. Create another stream that will only contain ip_address, ip_count, url, and timestamp from the topic backing the detected_clicks table:

CREATE STREAM raw_values_clicks (ip_address STRING, ip_count BIGINT, url STRING, timestamp STRING)
    WITH (KAFKA_TOPIC='DETECTED_CLICKS',
          PARTITIONS=1,
          FORMAT='JSON');

Now we can filter out duplicates by only retrieving records with a ip_count of 1:

SELECT
    ip_address,
    url,
    timestamp
FROM raw_values_clicks
WHERE ip_count = 1
EMIT CHANGES;

Running the example

Prerequisites

Docker running via Docker Desktop or Docker Engine
Docker Compose. Ensure that the command docker compose version succeeds.

Run the commands

Clone the confluentinc/tutorials GitHub repository (if you haven't already) and navigate to the tutorials directory:

git clone git@github.com:confluentinc/tutorials.git
cd tutorials

Start ksqlDB and Kafka:

docker compose -f ./docker/docker-compose-ksqldb.yml up -d

Next, open the ksqlDB CLI:

docker exec -it ksqldb-cli ksql http://ksqldb-server:8088

Run the following SQL statements to create the clicks stream backed by Kafka running in Docker and populate it with test data.

CREATE STREAM clicks (ip_address VARCHAR, url VARCHAR)
    WITH (KAFKA_TOPIC='clicks',
          PARTITIONS=1,
          FORMAT='JSON');

INSERT INTO clicks (ip_address, url) VALUES ('10.0.0.1', 'https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html');
INSERT INTO clicks (ip_address, url) VALUES ('10.0.0.12', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');
INSERT INTO clicks (ip_address, url) VALUES ('10.0.0.13', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');

INSERT INTO clicks (ip_address, url) VALUES ('10.0.0.1', 'https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html');
INSERT INTO clicks (ip_address, url) VALUES ('10.0.0.12', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');
INSERT INTO clicks (ip_address, url) VALUES ('10.0.0.13', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');

Next, create a table to collect the click events by time window. Note that we first tell ksqlDB to consume from the beginning of the stream.

SET 'auto.offset.reset'='earliest';

CREATE TABLE detected_clicks AS
    SELECT
        ip_address AS key1,
        url AS key2,
        AS_VALUE(ip_address) AS ip_address,
        COUNT(ip_address) as ip_count,
        AS_VALUE(url) AS url,
        FORMAT_timestamp(FROM_UNIXTIME(EARLIEST_BY_OFFSET(ROWTIME)), 'yyyy-MM-dd HH:mm:ss.SSS') AS timestamp
    FROM clicks WINDOW TUMBLING (SIZE 2 MINUTES, RETENTION 1000 DAYS)
    GROUP BY ip_address, url
    EMIT CHANGES;

Now create another stream that will only contain ip_address, ip_count, url, and timestamp from the topic backing the detected_clicks table:

CREATE STREAM raw_values_clicks (ip_address STRING, ip_count BIGINT, url STRING, timestamp STRING)
    WITH (KAFKA_TOPIC='DETECTED_CLICKS',
          PARTITIONS=1,
          FORMAT='JSON');

Finally, filter out duplicates by only retrieving records with a ip_count of 1:

SELECT
    ip_address,
    url,
    timestamp
FROM raw_values_clicks
WHERE ip_count = 1
EMIT CHANGES;

The query output should look like this:

+-----------+----------------------------------------------------------------------------------------+-------------------------+
|IP_ADDRESS |URL                                                                                     |TIMESTAMP                |
+-----------+----------------------------------------------------------------------------------------+-------------------------+
|10.0.0.1   |https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html|2024-09-24 17:39:48.650  |
|10.0.0.12  |https://www.confluent.io/hub/confluentinc/kafka-connect-datagen                         |2024-09-24 17:39:48.688  |
|10.0.0.13  |https://www.confluent.io/hub/confluentinc/kafka-connect-datagen                         |2024-09-24 17:39:48.724  |
+-----------+----------------------------------------------------------------------------------------+-------------------------+

When you are finished, exit the ksqlDB CLI by entering CTRL-D and clean up the containers used for this tutorial by running:

docker compose -f ./docker/docker-compose-ksqldb.yml down

Do you have questions or comments? Join us in the #developer-confluent-io community Slack channel to engage in discussions with the creators of this content.

Apache Iceberg ™

Kafka® 101

Apache Flink® SQL

Apache Flink® Table API: Processing Data Streams in Java

Designing Event-Driven Microservices

Apache Flink® 101

Building Flink® Apps in Java

Kafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Articles

Patterns

FAQs

NEWBlog

Streamables

Learn More

Language Guides

Tutorials

Demos

Meetups

Community Slack

Community Catalysts

Community Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2026

Past Current and Kafka Summit events

How to filter duplicate events per time window from a Kafka topic with ksqlDB

How to filter duplicate events per time window from a Kafka topic with ksqlDB

Setup

Deduplicate events per time window

Running the example

Prerequisites

Run the commands