Get Started Free
Tutorial

How to compute the minimum or maximum value of a field with Kafka Streams

How to compute the minimum or maximum value of a field with Kafka Streams

An aggregation in Kafka Streams is a stateful operation used to perform a "clustering" or "grouping" of values with the same key. An aggregation in Kafka Streams may return a different type than the input value. In this example the input value is a MovieTicketSales object but the result is a YearlyMovieFigures object used to keep track of the minimum and maximum total ticket sales by release year. You can also use windowing with aggregations to get discrete results per segment of time.

       builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), movieSalesSerde))
              .groupBy((k, v) -> v.releaseYear(),
                      Grouped.with(Serdes.Integer(), movieSalesSerde))
              .aggregate(() -> new YearlyMovieFigures(0, Integer.MAX_VALUE, Integer.MIN_VALUE),
                      ((key, value, aggregate) ->
                              new YearlyMovieFigures(key,
                                      Math.min(value.totalSales(), aggregate.minTotalSales()),
                                      Math.max(value.totalSales(), aggregate.maxTotalSales()))),
                      Materialized.with(Serdes.Integer(), yearlySalesSerde))
              .toStream()
              .peek((key, value) -> LOG.info("Aggregation min-max results key[{}] value[{}]", key, value))
              .to(OUTPUT_TOPIC, Produced.with(Serdes.Integer(), yearlySalesSerde));

Let's review the key points in this example

   .groupBy((k, v) -> v.releaseYear(),

Aggregations must group records by key. Since the stream source topic doesn't define any, the code has a groupByKey operation on the releaseYear field of the MovieTicketSales value object.

        .groupBy((k, v) -> v.releaseYear(), Grouped.with(Serdes.Integer(), movieSalesSerde) 

Since you've changed the key, under the covers Kafka Streams performs a repartition immediately before it performs the grouping.
Repartitioning is simply producing records to an internal topic and consuming them back into the application. By producing the records the updated keys land on the correct partition. Additionally, since the key-value types have changed you need to provide updated Serde objects, via the Grouped configuration object to Kafka Streams for the (de)serialization process for the repartitioning.

.aggregate(() -> new YearlyMovieFigures(0, Integer.MAX_VALUE, Integer.MIN_VALUE),
                      ((key, value, aggregate) ->
                              new YearlyMovieFigures(key,
                                      Math.min(value.totalSales(), aggregate.minTotalSales()),
                                      Math.max(value.totalSales(), aggregate.maxTotalSales()))),
                      Materialized.with(Serdes.Integer(), yearlySalesSerde))

This aggregation performs a running average of movie ratings. To enable this, it keeps the running sum and count of the ratings. The aggregate operator takes 3 parameters (there are overloads that accept 2 and 4 parameters):

  1. An initializer for the default value in this case a new instance of the YearlyMovieFigures object which is a Java POJO containing current min and max sales.
  2. An Aggregator instance which performs the aggregation action. Here the code uses a Java lambda expression instead of a concrete object instance.
  3. A Materialized object describing how the underlying StateStore is materialized.
 .toStream()
 .to(OUTPUT_TOPIC, Produced.with(Serdes.Integer(), yearlySalesSerde));

Aggregations in Kafka Streams return a KTable instance, so it's converted to a KStream. Then results are produced to an output topic via the to DSL operator.