Get Started Free
course: Apache Flink® 101

Experiencing Failure Recovery (Exercise)

David Anderson

David Anderson

Software Practice Lead

Experiencing Failure Recovery (Exercise)

Overview

In this exercise you will explore checkpointing and observe how Flink uses checkpoints to recover from failures.

Setup

Begin by enabling and configuring checkpointing for the jobs created by the SQL Client. You can do this by setting the checkpointing interval in the SQL Client:

set 'execution.checkpointing.interval' = '1000';

It will be easy to see what's going on if you alter the pageviews table so that events arrive more slowly:

ALTER TABLE `pageviews` SET ('rows-per-second' = '1');

You should also change the way that results are displayed, which will make it easier to follow along:

set 'sql-client.execution.result-mode' = 'changelog';

Start a query

select count(*) from pageviews;

Failure and recovery

Now open a second terminal window in the directory where you have the learn-apache-flink-101-exercises repo, and simulate a failure by killing Flink's task manager:

docker compose kill taskmanager

At this point the query that is running inside the SQL console will stall. That's fine; just leave it in that stuck state for now. If you now examine the job in the Flink UI (at http://localhost:8081) you'll see that job manager is trying to restart the job, but lacks the resources to do so.

If this was a proper Flink deployment, and a task manager had failed, the cluster management framework (e.g., Kubernetes) would automatically start a new task manager and the job would recover without any manual intervention. But in this case, when you are ready for the query to resume, you'll need to manually start a new task manager. But before doing so, arrange your windows so you can see what happens in the SQL console once the job is able to resume the stalled query.

Now start a new task manager, to replace the one you killed earlier. You will then see that the select count(*) from pageviews query has resumed, just as though nothing had gone wrong.

docker compose up --no-deps -d taskmanager

In this scenario, where the results are being displayed in the SQL Client, it was enough to enable checkpointing to get Flink SQL to run this query with exactly once semantics.

If you want to achieve the same consistency guarantee with a Kafka sink, more configuration is required. See the documentation on consistency guarantees for the Kafka sink for more information.

Use the promo code FLINK101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.