Experiencing Failure Recovery (Exercise)

David Anderson

Software Practice Lead

Experiencing Failure Recovery (Exercise)

Overview

In this exercise you will explore checkpointing and observe how Flink uses checkpoints to recover from failures.

Setup

Begin by enabling and configuring checkpointing for the jobs created by the SQL Client. You can do this by setting the checkpointing interval in the SQL Client:

set 'execution.checkpointing.interval' = '1000';

It will be easy to see what's going on if you alter the pageviews table so that events arrive more slowly:

ALTER TABLE `pageviews` SET ('rows-per-second' = '1');

You should also change the way that results are displayed, which will make it easier to follow along:

set 'sql-client.execution.result-mode' = 'changelog';

Start a query

select count(*) from pageviews;

Failure and recovery

Now open a second terminal window in the directory where you have the learn-apache-flink-101-exercises repo, and simulate a failure by killing Flink's task manager:

docker compose kill taskmanager

At this point the query that is running inside the SQL console will stall. That's fine; just leave it in that stuck state for now. If you now examine the job in the Flink UI (at http://localhost:8081) you'll see that job manager is trying to restart the job, but lacks the resources to do so.

If this was a proper Flink deployment, and a task manager had failed, the cluster management framework (e.g., Kubernetes) would automatically start a new task manager and the job would recover without any manual intervention. But in this case, when you are ready for the query to resume, you'll need to manually start a new task manager. But before doing so, arrange your windows so you can see what happens in the SQL console once the job is able to resume the stalled query.

Now start a new task manager, to replace the one you killed earlier. You will then see that the select count(*) from pageviews query has resumed, just as though nothing had gone wrong.

docker compose up --no-deps -d taskmanager

In this scenario, where the results are being displayed in the SQL Client, it was enough to enable checkpointing to get Flink SQL to run this query with exactly once semantics.

If you want to achieve the same consistency guarantee with a Kafka sink, more configuration is required. See the documentation on consistency guarantees for the Kafka sink for more information.

Do you have questions or comments? Join us in the #confluent-developer community Slack channel to engage in discussions with the creators of this content.

Use the promo codes FLINK101 & CONFLUENTDEV1 to get $25 of free Confluent Cloud usage and skip credit card entry.

Get Started

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Articles

Patterns

FAQs

Blog

NEWStreamables

NEWLearn More

Language Guides

Tutorials

Demos

Language Guides

Tutorials

Demos

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

Meetups

Ask the Community

Community Catalysts

NEWCommunity Use Cases

Confluent Developer Newsletter

Data Streaming Awards

NEWCurrent 2024

Kafka Summit 2024 - Bangalore

Kafka Summit 2024 - London

Current 2023

Kafka Summit 2023

NEWKafka® 101

NEWApache Flink® SQL

NEWApache Flink® Table API: Processing Data Streams in Java

NEWDesigning Event-Driven Microservices

NEWApache Flink® 101

NEWBuilding Flink® Apps in Java

NEWKafka® 101

Kafka® Connect 101

Kafka Streams 101

Schema Registry 101

ksqlDB 101

Data Mesh 101

Articles

Patterns

FAQs

Blog

Modules: Start from lesson 1
Total 15