course: Apache Iceberg™️

Hands-On Exercise: Tagging and Branching

Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Exercise 5: Tagging and Branching

Overview

So far, you've worked with snapshots by their IDs or timestamps. That works, but try explaining to your CFO that "the quarterly report uses snapshot 7851662191363948742." Not exactly user-friendly.

Tags and branches bring Git-like semantics to Iceberg tables:

  • Tags are named references to specific snapshots. Think of them as bookmarks: "this is the data we used for the Q4 report" or "this is the state before the big migration."

  • Branches are independent lines of development. You can write to a branch without affecting the main table, validate the changes, and then either publish them or throw them away. This is the foundation of Write-Audit-Publish (WAP) workflows.

In this exercise, you'll tag important snapshots for compliance, create a staging branch, load data into it, validate the data, and publish it to production.

Learning Objectives

By the end of this exercise, you will be able to:

  • Create and manage tags on snapshots
  • Query data at a specific tag
  • Create branches for isolated writes
  • Implement a Write-Audit-Publish workflow
  • Fast-forward the main branch to incorporate validated changes

Prerequisites

  • Completed Exercise 4 (or have the environment running)
  • About 25 minutes

Step 1: Set Up

Start the environment if needed:

docker compose up -d

Run Spark SQL:

docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"

Create a table representing a financial ledger—something where data quality matters:

CREATE NAMESPACE IF NOT EXISTS demo.finance;
USE demo.finance;

DROP TABLE IF EXISTS transactions;

CREATE TABLE transactions (
    transaction_id BIGINT,
    account_id BIGINT,
    transaction_date DATE,
    amount DECIMAL(15, 2),
    description STRING,
    verified BOOLEAN
)
USING iceberg
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');

Load some initial verified transactions:

INSERT INTO transactions VALUES
    (1001, 100, CAST('2025-01-02' AS DATE), 5000.00, 'Initial deposit', true),
    (1002, 100, CAST('2025-01-05' AS DATE), -150.00, 'Office supplies', true),
    (1003, 101, CAST('2025-01-03' AS DATE), 12000.00, 'Initial deposit', true),
    (1004, 101, CAST('2025-01-10' AS DATE), -3200.00, 'Equipment purchase', true),
    (1005, 102, CAST('2025-01-04' AS DATE), 8500.00, 'Initial deposit', true);

Verify the data:

SELECT * FROM transactions ORDER BY transaction_id;

Step 2: Create a Tag for Month-End Close

It's January 31st. The finance team needs to close the books. Let's tag the current state so we can always reproduce January's numbers:

ALTER TABLE transactions CREATE TAG `jan-2025-close`;

Verify the tag was created:

SELECT * FROM demo.finance.transactions.refs;

You should see:

      name       |  type  | snapshot_id | max_reference_age_in_ms | ...
-----------------+--------+-------------+-------------------------+
 jan-2025-close  | TAG    | ...         | NULL                    |
 main            | BRANCH | ...         | NULL                    |

Notice that main is also listed—it's the default branch that every Iceberg table has.


Step 3: Continue Normal Operations

February arrives. New transactions come in:

INSERT INTO transactions VALUES
    (1006, 100, CAST('2025-02-01' AS DATE), -89.99, 'Software subscription', true),
    (1007, 102, CAST('2025-02-03' AS DATE), -1200.00, 'Contractor payment', true),
    (1008, 101, CAST('2025-02-05' AS DATE), 4500.00, 'Client payment received', true);

The current table now has 8 transactions:

SELECT COUNT(*) FROM transactions;

Step 4: Query the Tagged Snapshot

The auditors arrive. They want to see exactly what the books looked like at January close. No problem:

SELECT *
FROM transactions VERSION AS OF 'jan-2025-close'
ORDER BY transaction_id;

Only the original 5 transactions. The February data doesn't exist at that tag.

Let's verify the account balances match what was reported:

SELECT
    account_id,
    SUM(amount) AS balance_at_jan_close
FROM transactions VERSION AS OF 'jan-2025-close'
GROUP BY account_id
ORDER BY account_id;

You should see:

 account_id | balance_at_jan_close
------------+----------------------
        100 |              4850.00
        101 |              8800.00
        102 |              8500.00

These numbers are immutable. No matter what happens to the table going forward, this tag will always show the same data.


Step 5: Create a Staging Branch

Now let's tackle a more complex scenario. Your ETL pipeline has a new batch of transactions to load, but you want to validate them before they hit production.

Create a staging branch:

ALTER TABLE transactions CREATE BRANCH staging;

Check the refs:

SELECT name, type, snapshot_id
FROM demo.finance.transactions.refs;

You now have three refs: main, staging, and jan-2025-close. Notice that staging starts at the same snapshot as main.


Step 6: Write to the Staging Branch

To write to a branch using Write Audit Publish, you must first enable it

ALTER TABLE demo.finance.transactions
    SET TBLPROPERTIES ('write.wap.enabled' = 'true');

Then you must set which branch you want to use as the write target:

SET spark.wap.branch = staging;

Now any writes go to the staging branch, not main. Load the new batch:

INSERT INTO transactions VALUES
    (1009, 100, CAST('2025-02-10' AS DATE), -500.00, 'Consulting fee', false),
    (1010, 103, CAST('2025-02-10' AS DATE), 15000.00, 'New client deposit', false),
    (1011, 101, CAST('2025-02-11' AS DATE), -750.00, 'Travel expenses', false),
    (1012, 100, CAST('2025-02-12' AS DATE), 2000.00, 'Refund received', false);

Now check both branches:

-- Production (main branch) - need to query explicitly
SELECT COUNT(*) AS main_count
FROM transactions VERSION AS OF 'main';

-- Staging branch (current write target, also queryable by name)
SELECT COUNT(*) AS staging_count
FROM transactions VERSION AS OF 'staging';

Production has 8 transactions. Staging has 12. The branches have diverged.


Step 7: Validate the Staged Data

Before publishing, let's run some data quality checks on staging:

-- Check for any negative balances (business rule violation)
SELECT
    account_id,
    SUM(amount) AS balance
FROM transactions VERSION AS OF 'staging'
GROUP BY account_id
HAVING SUM(amount) < 0;
account_id	balance
Time taken: 0.222 seconds

No negative balances—good.

-- Check for unverified transactions (need review)
SELECT *
FROM transactions VERSION AS OF 'staging'
WHERE verified = false;

The new transactions are unverified. In a real workflow, you might:

  1. Run automated validation rules
  2. Send alerts if anomalies are detected
  3. Have a human review and approve

You can check that Production is still untouched:

SELECT COUNT(*)
FROM transactions VERSION AS OF 'main'
WHERE verified = false;

Zero unverified in production.

Let's mark the transactions in staging as verified (remember, we're still writing to staging):

UPDATE transactions
SET verified = true
WHERE verified = false;

Verify the update (still on staging):

SELECT transaction_id, verified
FROM transactions VERSION AS OF 'staging'
WHERE transaction_id >= 1009;

All verified.

Step 8: Publish the Staged Changes

The data is validated. Time to publish to production.

First, let's see where each branch points:

SELECT name, type, snapshot_id
FROM demo.finance.transactions.refs;

Staging is ahead of main. To publish, we fast-forward main to staging's snapshot:

CALL system.fast_forward('demo.finance.transactions', 'main', 'staging');

This moves the main branch pointer to match staging. No data is copied—it's just a metadata update.

Now reset the write target back to main:

ALTER TABLE demo.finance.transactions
    UNSET TBLPROPERTIES ('write.wap.enabled');

Verify production now has all the data:

SELECT * FROM transactions ORDER BY transaction_id;

All 12 transactions, all verified.


Step 9: Clean Up the Branch

The staging branch served its purpose. You can drop it:

ALTER TABLE transactions DROP BRANCH staging;

Check refs:

SELECT name, type FROM demo.finance.transactions.refs;

Just main and jan-2025-close remain.


Step 10: Create Another Tag

Let's tag February's close:

ALTER TABLE transactions CREATE TAG `feb-2025-close`;

Now you can compare any two points in time:


SELECT period, account_id, balance FROM (
-- January balances
SELECT 'January' AS period, account_id, SUM(amount) AS balance
FROM transactions VERSION AS OF 'jan-2025-close'
GROUP BY account_id
UNION ALL
-- February balances
SELECT 'February' AS period, account_id, SUM(amount) AS balance
FROM transactions VERSION AS OF 'feb-2025-close'
GROUP BY account_id
) results
ORDER BY period, account_id;

You should see:

period	account_id	balance
February	100	6260.01
February	101	12550.00
February	102	7300.00
February	103	15000.00
January	100	4850.00
January	101	8800.00
January	102	8500.00

This is incredibly powerful for financial reporting, audit trails, and trend analysis.


Step 11: Tag Management

Tags can have retention policies. Create a tag that auto-expires:

ALTER TABLE transactions CREATE TAG `temp-debug-tag` RETAIN 7 DAYS;

After 7 days, this tag will be eligible for cleanup during snapshot expiration.

List all tags with their properties:

SELECT
    name,
    type,
    snapshot_id,
    max_reference_age_in_ms / 86400000 AS retention_days
FROM demo.finance.transactions.refs
WHERE type = 'TAG';

The result is:

name	type	snapshot_id	retention_days
feb-2025-close	TAG	1372680281834730592	NULL
jan-2025-close	TAG	3726597251635442711	NULL
temp-debug-tag	TAG	1372680281834730592	7.0

Make sure to frequently drop tags you no longer need to keep your table refs tidy:

ALTER TABLE transactions DROP TAG `temp-debug-tag`;

Step 12: Clean Up

Exit Spark SQL:

exit;

If you're done:

docker compose down -v

What You Learned

  • Tags are named, immutable references to snapshots—perfect for compliance and reproducibility
  • Branches enable isolated writes without affecting production
  • Write-Audit-Publish (WAP) workflow: stage → validate → fast-forward
  • fast_forward publishes a branch by moving the main pointer
  • Tags and branches are metadata-only operations—instant and cheap

Practical Use Cases

Tags:

  • Regulatory compliance ("this is Q4 2024 as reported to the SEC")
  • Machine learning reproducibility ("model v2.3 was trained on this snapshot")
  • Release markers ("before-migration", "after-migration")

Branches:

  • ETL validation before publish
  • A/B testing different data transformations
  • Experimentation without risk to production
  • Blue/green data deployments

Next Up

In Exercise 6, we'll look at table maintenance—compacting small files, expiring old snapshots, and keeping your Iceberg tables healthy over time.

Do you have questions or comments? Join us in the #developer-confluent-io community Slack channel to engage in discussions with the creators of this content.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.