If service goals mandate high throughput, it is useful to be able to distribute event storage, as well as event production and consumption, for parallel processing. Distributing and concurrently processing events enables an application to scale.
How can I allocate events across Event Streams and Tables so that they can be concurrently processed by distributed Event Processors?
Use a partitioned event stream, and then assign the events to different partitions of the stream. Essentially, a partition is a unit of parallelism for storing, reading, writing, and processing events. Partitioning enables concurrency and scalability in two main ways:
Event partitioning also impacts application semantics: placing events into a given partition guarantees that the ordering of events is preserved per partition (but typically not across different partitions of the same stream). This ordering guarantee is crucial for many use cases; very often, the sequencing of events is important (for example, when processing retail orders, an order must be paid before it can be shipped).
With Apache Kafka®, streams (called topics) are created either by an administrator or by a streaming application. The number of partitions is specified at the time the topic is created. For example:
confluent kafka topic create myTopic --partitions 30
Events are placed into a specific partition according to the partitioning algorithm of the Event Source, such as an Event Processing Application. All events assigned to a given partition have strong ordering guarantees.
The common partitioning schemes are:
In a Kafka-based technology, such as a Kafka Streams application or Apache Flink® using one of its Kafka connectors, the processors can scale by working on a set of partitions concurrently and in a distributed manner. If an event stream's key content changes because of how the query is processing the rows -- for example, to execute a join in Kafka Streams between two streams of events -- the underlying keys are recalculated, and the events are sent to a new partition in the new topic to perform the computation. (This internal operation is often called distributed data shuffling.)
In general, a higher number of stream partitions results in higher throughput. To maximize throughput, you need enough partitions to utilize all distributed instances of an Event Processor (for example, Kafka Streams application instances). Be sure to choose the partition count carefully based on the throughput of Event Sources (such as Kafka producers, including connectors), Event Processors (such as Kafka Streams or Flink applications), and Event Sinks (such as Kafka consumers, including connectors). Also be sure to benchmark performance in the environment. Plan the design of data patterns and key assignments so that events are distributed as evenly as possible across the stream partitions. This will prevent certain stream partitions from being overloaded relative to other stream partitions. See the blog post Streams and Tables in Apache Kafka: Elasticity, Fault Tolerance, and Other Advanced Concepts to learn more about partitions and dealing with partition skew.