Scaling and Fault Tolerance

3 min

Michael Drogalis

Principal Product Manager (Presenter)

Scaling and Fault Tolerance

So far in this course, you’ve considered a single server and the way that it recovers when a new node takes its place. The next scenario to think about is a cluster with multiple nodes, and how the cluster scales compute while achieving fault tolerance.

Stateless Scaling

When you have a single server in a stateless scenario, it is assigned all of the partitions from the streams or tables it’s reading from. In the image below, a single server is reading from eight different partitions and working hard to process all of the data that is available.

Similar to the way that ksqlDB enjoys Apache Kafka® consumer guarantees, when you add servers to a cluster, ksqlDB inherits the benefits of the Kafka consumer group protocol. If you add a second server to the cluster above, the work for all of the partitions is now scaled out across two nodes, (a) and (b).

The work rebalances for you automatically, and effectively, ksqlDB processes the data twice as fast as it did with one server.

And the same also holds true when you add a much larger number of servers—for example, eight: data is still processed with the maximum amount of parallelism, because there are eight partitions. Notice how much faster the data is processed in the eight-server image below compared to the single-server image above.

As in the prior scenarios, ksqlDB takes care of assigning the right processing work to the right servers at the right time, and it happens automatically.

Stateful Scaling

Stateful scaling in ksqlDB works similarly to stateless scaling, except that in stateful scaling, state needs to be sharded across servers. You learned about data locality in How Stateful Operations Work, and stateful scaling is a prime example of how a server needs to have local access to the data that it is going to process. You can shard state by key:

So there are four nodes, each of which runs a consumer (a-d), and the data is partitioned by sensors 1-8; ksqlDB reshuffles the data to make sure that the same rows with the same key always go to the same partition, and thus to the same server in the cluster. Similar to the scenario with materialized views, the backing state for each piece of data here is written to a changelog. Thus, if a node in the cluster is lost, a new node can step in and read the changelog to recover. For example, if server b went down, its partitions might be assigned to server a and server c, each of which would read from the changelog to repopulate.

ksqlDB effectively combines the client protocol, the consumer group protocol, and the changelog design to scale state in a fault-tolerant way.

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo code KSQLDB101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Get Started

Scaling and Fault Tolerance

Hi, I'm Michael Drogalis with Confluent. And in this module, we're gonna be looking at how scaling and fault tolerance work with ksqlDB. What we've looked at so far is really just having one server and how it's able to recover when a new node takes place. But what happens when you have more than one server in your cluster? How does compute scale? How does fault tolerance work? Let's look at all of that now. We'll start by examining how scaling works in ksqlDB in the most basic sense. Now I've omitted the query here. It can be any stateless query that you like. But when you have one server, it will be assigned all the partitions for the stream or tables that it's reading from. In this case, it's reading from eight different partitions. This server is doing all of the work. It's processing all of the data that's available. In earlier modules we learned how ksqlDB inherits all of the properties of the Kafka consumer protocol. That means that when you add servers, it also inherits the properties of the consumer group protocol. Here, I add another server to the cluster, depicted with labels A and B. They're running the same persistent query, but now it's scaled out. The work for all the partitions is shared across the two nodes. The work rebalances for you automatically. So you get some number of partitions assigned to server A and some to server B. And this will be evenly distributed. Now, ksqlDB is processing the data twice as fast. The same will hold true if you add eight servers. It's processing the data with the maximum amount of parallelism. Notice how much faster this animation is playing than the others. It's not sped up. It's that it's actually running with 8X parallelism so it can process the data faster. KsqlDB will rebalance and reassign the right processing work to the right servers at the right time. When you scale up, this happens dynamically, safely and automatically. What we just saw was how scaling works in a stateless context. But what happens when you're doing a stateful operation like materializing a table. Here, it does the same thing, but the state needs to be sharded across each of the servers. Remember how we talked about data locality? This is a prime example of how a server needs to have the data it's going to process locally available. In this case, the state is sharded by key. So if I had servers A, B, C and D, and I have sensors one through eight, it will reshuffle the data to make sure that the same rows for the same key always go to the same partition and therefore to the same server in the cluster. Here again, the backing data for each piece of state is written out to a change log. So if a node in the cluster is lost, a new node can step in and read the change log to recover. So imagine we lost server B, the remaining partitions might be reassigned to servers A and C. The change logs would be used to repopulate those tables. It combines its usage of the client protocol, the consumer group protocol and the change log designed to scale state in a fault-tolerant way. That's how scalability and fault tolerance work in ksqlDB.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

Modules: Start from lesson 1
Total 7