Get Started Free

Pipeline

A single Event Stream or Table can be used by multiple Event Processing Applications, and its Events may go through multiple processing stages along the way (e.g., filters, transformations, joins, aggregations) to implement more complex use cases.

Problem

How can a single processing objective for a set of Event Streams and/or Tables be achieved through a series of independent processing stages?

Solution

pipeline

We can compose Event Streams and Tables in an Event Streaming Platform via an Event Processing Application to a create a pipeline—also called a topology—of Event Processors, which continuously process the events flowing through them. Here, the output of one processor is the input for one or more downstream processors. Pipelines, notably when created for use cases such as Streaming ETL, may include Event Source Connectors and Event Sink Connectors, which continuously import and export data as streams from/to external services and systems, respectively. Connectors are particularly useful for turning data at rest in such systems into data in motion.

Taking a step back, we can see that pipelines in an Event Streaming Platform help companies build a "central nervous system" for data in motion.

Implementation

As an example we can use Apache Flink® SQL to run a stream of events through a series of processing stages, thus creating a Pipeline that continuously processes data in motion.

CREATE TABLE orders ( 
  customer_id INTEGER NOT NULL,
  order_id STRING NOT NULL,
  item_id STRING,
  price DOUBLE
);

We'll also create a (continuously updated) customers table that will contain the latest profile information about each customer, such as their current home address.

CREATE TABLE customers (
  customer_id INTEGER NOT NULL,
  customer_name STRING,
  address STRING
);

Next, we create a new stream by joining the orders stream with our customer table:

CREATE TABLE orders_enriched AS
  SELECT o.customer_id AS cust_id, o.order_id, o.price, c.customer_name, c.address
  FROM orders o
  LEFT JOIN customers c 
  ON o.customer_id = c.customer_id;

Next, we create a stream, where we add the order total to each order by aggregating the price of the individual items in the order:

CREATE TABLE orders_with_totals AS
  SELECT cust_id, order_id, sum(price) AS total 
  FROM orders_enriched
  GROUP BY cust_id, order_id;

Considerations

  • The same event stream or table can participate in multiple pipelines. Because streams and tables are stored durably, applications have a lot of flexibility how and when they process the respective data, and they can do so independently from each other.
  • The various processing stages in a pipeline create their own derived streams/tables (such as the orders_enriched table in the Flink SQL example above), which in turn can be used as input for other pipelines and applications. This allows for further and more complex composition and re-use of events throughout an organization.

References

This pattern was influenced by Pipes and Filters in Enterprise Integration Patterns by Gregor Hohpe and Bobby Woolf. However, it is much more powerful and flexible because it is using Event Streams as the pipes.

Confluent Cloud is a fully managed Apache Kafka service available on all three major clouds. Try it for free today.

Try it for free