Partitioned Parallelism

If service goals mandate high throughput, it is useful to be able to distribute event storage, as well as event production and consumption, for parallel processing. Distributing and concurrently processing events enables an application to scale.

Problem

How can I allocate events across Event Streams and Tables so that they can be concurrently processed by distributed Event Processors?

Solution

partitioned-parallelism

Use a partitioned event stream, and then assign the events to different partitions of the stream. Essentially, a partition is a unit of parallelism for storing, reading, writing, and processing events. Partitioning enables concurrency and scalability in two main ways:

Event partitioning also impacts application semantics: placing events into a given partition guarantees that the ordering of events is preserved per partition (but typically not across different partitions of the same stream). This ordering guarantee is crucial for many use cases; very often, the sequencing of events is important (for example, when processing retail orders, an order must be paid before it can be shipped).

Implementation

With Apache Kafka®, streams (called topics) are created either by an administrator or by a streaming application such as the streaming database ksqlDB. The number of partitions is specified at the time the topic is created. For example:

ccloud kafka topic create myTopic --partitions 30

Events are placed into a specific partition according to the partitioning algorithm of the Event Source, such as an Event Processing Application. All events assigned to a given partition have strong ordering guarantees.

The common partitioning schemes are:

  1. Partitioning based on the event key (such as the customer ID for a stream of customer payments), where events with the same key are stored in the same partition
  2. Round-robin partitioning, which provides an even distribution of events per partition
  3. Custom partitioning algorithms, tailored to specific use cases

In a Kafka-based technology, such as a Kafka Streams application or ksqlDB, the processors can scale by working on a set of partitions concurrently and in a distributed manner. If an event stream's key content changes because of how the query is processing the rows -- for example, to execute a JOIN operation in ksqlDB between two streams of events -- the underlying keys are recalculated, and the events are sent to a new partition in the new topic to perform the computation. (This internal operation is often called distributed data shuffling.)

CREATE STREAM stream_name
  WITH ([...,]
        PARTITIONS=number_of_partitions)
  AS SELECT select_expr [, ...]
  FROM from_stream
  PARTITION BY new_key_expr [, ...]
  EMIT CHANGES;

Considerations

In general, a higher number of stream partitions results in higher throughput. To maximize throughput, you need enough partitions to utilize all distributed instances of an Event Processor (for example, servers in a ksqlDB cluster). Be sure to choose the partition count carefully based on the throughput of Event Sources (such as Kafka producers, including connectors), Event Processors (such as ksqlDB or Kafka Streams applications), and Event Sinks (such as Kafka consumers, including connectors). Also be sure to benchmark performance in the environment. Plan the design of data patterns and key assignments so that events are distributed as evenly as possible across the stream partitions. This will prevent certain stream partitions from being overloaded relative to other stream partitions. See the blog post Streams and Tables in Apache Kafka: Elasticity, Fault Tolerance, and Other Advanced Concepts to learn more about partitions and dealing with partition skew.

References