
Software Practice Lead
So far, you've worked with snapshots by their IDs or timestamps. That works, but try explaining to your CFO that "the quarterly report uses snapshot 7851662191363948742." Not exactly user-friendly.
Tags and branches bring Git-like semantics to Iceberg tables:
Tags are named references to specific snapshots. Think of them as bookmarks: "this is the data we used for the Q4 report" or "this is the state before the big migration."
Branches are independent lines of development. You can write to a branch without affecting the main table, validate the changes, and then either publish them or throw them away. This is the foundation of Write-Audit-Publish (WAP) workflows.
In this exercise, you'll tag important snapshots for compliance, create a staging branch, load data into it, validate the data, and publish it to production.
By the end of this exercise, you will be able to:
Start the environment if needed:
docker compose up -dRun Spark SQL:
docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"Create a table representing a financial ledger—something where data quality matters:
CREATE NAMESPACE IF NOT EXISTS demo.finance;
USE demo.finance;
DROP TABLE IF EXISTS transactions;
CREATE TABLE transactions (
transaction_id BIGINT,
account_id BIGINT,
transaction_date DATE,
amount DECIMAL(15, 2),
description STRING,
verified BOOLEAN
)
USING iceberg
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');Load some initial verified transactions:
INSERT INTO transactions VALUES
(1001, 100, CAST('2025-01-02' AS DATE), 5000.00, 'Initial deposit', true),
(1002, 100, CAST('2025-01-05' AS DATE), -150.00, 'Office supplies', true),
(1003, 101, CAST('2025-01-03' AS DATE), 12000.00, 'Initial deposit', true),
(1004, 101, CAST('2025-01-10' AS DATE), -3200.00, 'Equipment purchase', true),
(1005, 102, CAST('2025-01-04' AS DATE), 8500.00, 'Initial deposit', true);Verify the data:
SELECT * FROM transactions ORDER BY transaction_id;It's January 31st. The finance team needs to close the books. Let's tag the current state so we can always reproduce January's numbers:
ALTER TABLE transactions CREATE TAG `jan-2025-close`;Verify the tag was created:
SELECT * FROM demo.finance.transactions.refs;You should see:
name | type | snapshot_id | max_reference_age_in_ms | ...
-----------------+--------+-------------+-------------------------+
jan-2025-close | TAG | ... | NULL |
main | BRANCH | ... | NULL |Notice that main is also listed—it's the default branch that every Iceberg table has.
February arrives. New transactions come in:
INSERT INTO transactions VALUES
(1006, 100, CAST('2025-02-01' AS DATE), -89.99, 'Software subscription', true),
(1007, 102, CAST('2025-02-03' AS DATE), -1200.00, 'Contractor payment', true),
(1008, 101, CAST('2025-02-05' AS DATE), 4500.00, 'Client payment received', true);The current table now has 8 transactions:
SELECT COUNT(*) FROM transactions;The auditors arrive. They want to see exactly what the books looked like at January close. No problem:
SELECT *
FROM transactions VERSION AS OF 'jan-2025-close'
ORDER BY transaction_id;Only the original 5 transactions. The February data doesn't exist at that tag.
Let's verify the account balances match what was reported:
SELECT
account_id,
SUM(amount) AS balance_at_jan_close
FROM transactions VERSION AS OF 'jan-2025-close'
GROUP BY account_id
ORDER BY account_id;You should see:
account_id | balance_at_jan_close
------------+----------------------
100 | 4850.00
101 | 8800.00
102 | 8500.00These numbers are immutable. No matter what happens to the table going forward, this tag will always show the same data.
Now let's tackle a more complex scenario. Your ETL pipeline has a new batch of transactions to load, but you want to validate them before they hit production.
Create a staging branch:
ALTER TABLE transactions CREATE BRANCH staging;Check the refs:
SELECT name, type, snapshot_id
FROM demo.finance.transactions.refs;You now have three refs: main, staging, and jan-2025-close. Notice that staging starts at the same snapshot as main.
To write to a branch using Write Audit Publish, you must first enable it
ALTER TABLE demo.finance.transactions
SET TBLPROPERTIES ('write.wap.enabled' = 'true');Then you must set which branch you want to use as the write target:
SET spark.wap.branch = staging;Now any writes go to the staging branch, not main. Load the new batch:
INSERT INTO transactions VALUES
(1009, 100, CAST('2025-02-10' AS DATE), -500.00, 'Consulting fee', false),
(1010, 103, CAST('2025-02-10' AS DATE), 15000.00, 'New client deposit', false),
(1011, 101, CAST('2025-02-11' AS DATE), -750.00, 'Travel expenses', false),
(1012, 100, CAST('2025-02-12' AS DATE), 2000.00, 'Refund received', false);Now check both branches:
-- Production (main branch) - need to query explicitly
SELECT COUNT(*) AS main_count
FROM transactions VERSION AS OF 'main';
-- Staging branch (current write target, also queryable by name)
SELECT COUNT(*) AS staging_count
FROM transactions VERSION AS OF 'staging';Production has 8 transactions. Staging has 12. The branches have diverged.
Before publishing, let's run some data quality checks on staging:
-- Check for any negative balances (business rule violation)
SELECT
account_id,
SUM(amount) AS balance
FROM transactions VERSION AS OF 'staging'
GROUP BY account_id
HAVING SUM(amount) < 0;account_id balance
Time taken: 0.222 secondsNo negative balances—good.
-- Check for unverified transactions (need review)
SELECT *
FROM transactions VERSION AS OF 'staging'
WHERE verified = false;The new transactions are unverified. In a real workflow, you might:
You can check that Production is still untouched:
SELECT COUNT(*)
FROM transactions VERSION AS OF 'main'
WHERE verified = false;Zero unverified in production.
Let's mark the transactions in staging as verified (remember, we're still writing to staging):
UPDATE transactions
SET verified = true
WHERE verified = false;Verify the update (still on staging):
SELECT transaction_id, verified
FROM transactions VERSION AS OF 'staging'
WHERE transaction_id >= 1009;The data is validated. Time to publish to production.
First, let's see where each branch points:
SELECT name, type, snapshot_id
FROM demo.finance.transactions.refs;Staging is ahead of main. To publish, we fast-forward main to staging's snapshot:
CALL system.fast_forward('demo.finance.transactions', 'main', 'staging');This moves the main branch pointer to match staging. No data is copied—it's just a metadata update.
Now reset the write target back to main:
ALTER TABLE demo.finance.transactions
UNSET TBLPROPERTIES ('write.wap.enabled');Verify production now has all the data:
SELECT * FROM transactions ORDER BY transaction_id;All 12 transactions, all verified.
The staging branch served its purpose. You can drop it:
ALTER TABLE transactions DROP BRANCH staging;Check refs:
SELECT name, type FROM demo.finance.transactions.refs;Just main and jan-2025-close remain.
Let's tag February's close:
ALTER TABLE transactions CREATE TAG `feb-2025-close`;Now you can compare any two points in time:
SELECT period, account_id, balance FROM (
-- January balances
SELECT 'January' AS period, account_id, SUM(amount) AS balance
FROM transactions VERSION AS OF 'jan-2025-close'
GROUP BY account_id
UNION ALL
-- February balances
SELECT 'February' AS period, account_id, SUM(amount) AS balance
FROM transactions VERSION AS OF 'feb-2025-close'
GROUP BY account_id
) results
ORDER BY period, account_id;You should see:
period account_id balance
February 100 6260.01
February 101 12550.00
February 102 7300.00
February 103 15000.00
January 100 4850.00
January 101 8800.00
January 102 8500.00This is incredibly powerful for financial reporting, audit trails, and trend analysis.
Tags can have retention policies. Create a tag that auto-expires:
ALTER TABLE transactions CREATE TAG `temp-debug-tag` RETAIN 7 DAYS;After 7 days, this tag will be eligible for cleanup during snapshot expiration.
List all tags with their properties:
SELECT
name,
type,
snapshot_id,
max_reference_age_in_ms / 86400000 AS retention_days
FROM demo.finance.transactions.refs
WHERE type = 'TAG';The result is:
name type snapshot_id retention_days
feb-2025-close TAG 1372680281834730592 NULL
jan-2025-close TAG 3726597251635442711 NULL
temp-debug-tag TAG 1372680281834730592 7.0Make sure to frequently drop tags you no longer need to keep your table refs tidy:
ALTER TABLE transactions DROP TAG `temp-debug-tag`;Exit Spark SQL:
exit;If you're done:
docker compose down -vTags:
Branches:
In Exercise 6, we'll look at table maintenance—compacting small files, expiring old snapshots, and keeping your Iceberg tables healthy over time.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.