Partitioned Parallelism

If service goals mandate high throughput, it is useful to be able to distribute event storage, as well as event production and consumption, for parallel processing. Distributing and concurrently processing events enables an application to scale.

Problem

How can I allocate events across Event Streams and Tables so that they can be concurrently processed by distributed Event Processors?

Solution

partitioned-parallelism

Use a partitioned event stream, and then assign the events to different partitions of the stream. Essentially, a partition is a unit of parallelism for storing, reading, writing, and processing events. Partitioning enables concurrency and scalability in two main ways:

Platform scalability: different Event Brokers can concurrently store and serve Events to Event Processing Applications
Application scalability: different Event Processing Applications can process Events concurrently

Event partitioning also impacts application semantics: placing events into a given partition guarantees that the ordering of events is preserved per partition (but typically not across different partitions of the same stream). This ordering guarantee is crucial for many use cases; very often, the sequencing of events is important (for example, when processing retail orders, an order must be paid before it can be shipped).

Implementation

With Apache Kafka®, streams (called topics) are created either by an administrator or by a streaming application. The number of partitions is specified at the time the topic is created. For example:

confluent kafka topic create myTopic --partitions 30

Events are placed into a specific partition according to the partitioning algorithm of the Event Source, such as an Event Processing Application. All events assigned to a given partition have strong ordering guarantees.

The common partitioning schemes are:

Partitioning based on the event key (such as the customer ID for a stream of customer payments), where events with the same key are stored in the same partition
Round-robin partitioning, which provides an even distribution of events per partition
Custom partitioning algorithms, tailored to specific use cases

In a Kafka-based technology, such as a Kafka Streams application or Apache Flink® using one of its Kafka connectors, the processors can scale by working on a set of partitions concurrently and in a distributed manner. If an event stream's key content changes because of how the query is processing the rows -- for example, to execute a join in Kafka Streams between two streams of events -- the underlying keys are recalculated, and the events are sent to a new partition in the new topic to perform the computation. (This internal operation is often called distributed data shuffling.)

Considerations

In general, a higher number of stream partitions results in higher throughput. To maximize throughput, you need enough partitions to utilize all distributed instances of an Event Processor (for example, Kafka Streams application instances). Be sure to choose the partition count carefully based on the throughput of Event Sources (such as Kafka producers, including connectors), Event Processors (such as Kafka Streams or Flink applications), and Event Sinks (such as Kafka consumers, including connectors). Also be sure to benchmark performance in the environment. Plan the design of data patterns and key assignments so that events are distributed as evenly as possible across the stream partitions. This will prevent certain stream partitions from being overloaded relative to other stream partitions. See the blog post Streams and Tables in Apache Kafka: Elasticity, Fault Tolerance, and Other Advanced Concepts to learn more about partitions and dealing with partition skew.

References

The blog post How to choose the number of topics/partitions in a Kafka cluster? provides helpful guidance for selecting partition counts for your topics.
For a processing parallelism approach that subdivides the unit of work from a partition down to an event or event key, see the Confluent Parallel Consumer for Kafka.

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog