course: Apache Iceberg™️

Hands-On Exercise: Table Maintenance

Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Exercise 6: Table Maintenance

Overview

You've created tables, evolved schemas, partitioned data, time-traveled through snapshots, and used tags and branches. Everything works beautifully in development. But what happens after six months in production?

Here's the reality: streaming jobs that write every minute create thousands of tiny files. Each UPDATE and DELETE operation creates new snapshots. Old data files accumulate even after they're no longer referenced. Left unchecked, your table becomes a mess of small files that kill query performance and waste storage.

Iceberg provides several maintenance operations to keep tables healthy:

  • Compaction: Merge small files into larger ones
  • Snapshot expiration: Remove old snapshots you no longer need
  • Orphan file cleanup: Delete data files that aren't referenced by any snapshot

In this exercise, you'll deliberately create a fragmented table, observe the performance impact, and then clean it up.

Learning Objectives

By the end of this exercise, you will be able to:

  • Identify file fragmentation using metadata tables
  • Understand delete files and how Iceberg handles DELETEs
  • Compact small files and merge delete files using optimize
  • Expire old snapshots with expire_snapshots
  • Remove orphan files with remove_orphan_files
  • Understand when and why to run each maintenance operation

Prerequisites

  • Completed Exercise 5 (or have the environment running)
  • About 25 minutes

Step 1: Set Up

Start the environment if needed:

docker compose up -d

Run Spark SQL with extra parameters to access data directly from the S3-compatible storage filesystem (MinIO).

docker compose exec -it spark-iceberg spark-sql \
    --packages org.apache.hadoop:hadoop-aws:3.3.4 \
    --conf "spark.hadoop.hive.cli.print.header=true" \
    --conf "spark.hadoop.fs.s3a.endpoint=http://minio:9000" \
    --conf "spark.hadoop.fs.s3a.path.style.access=true" \
    --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
    --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
    --conf "spark.hadoop.fs.s3a.connection.ssl.enabled=false"

Create a fresh table:

CREATE NAMESPACE IF NOT EXISTS demo.ecommerce;
USE demo.ecommerce;

DROP TABLE IF EXISTS events;

CREATE TABLE events (
    event_id BIGINT,
    user_id BIGINT,
    event_type STRING,
    event_time TIMESTAMP,
    properties STRING
)
USING iceberg
PARTITIONED BY (days(event_time))
TBLPROPERTIES (
    'format-version'='2',
    'write.format.default'='parquet',
    'write.update.mode'='copy-on-write',
    'write.delete.mode'='merge-on-read'
);

Step 2: Create File Fragmentation

In production, fragmentation often comes from streaming ingestion—small batches arriving frequently. Let's simulate that:

-- Batch 1
INSERT INTO events VALUES
    (1, 101, 'page_view', CAST('2025-01-15 10:00:00' AS TIMESTAMP), '{"page": "/home"}'),
    (2, 102, 'page_view', CAST('2025-01-15 10:01:00' AS TIMESTAMP), '{"page": "/products"}');

-- Batch 2
INSERT INTO events VALUES
    (3, 101, 'click', CAST('2025-01-15 10:02:00' AS TIMESTAMP), '{"button": "add_to_cart"}'),
    (4, 103, 'page_view', CAST('2025-01-15 10:03:00' AS TIMESTAMP), '{"page": "/checkout"}');

-- Batch 3
INSERT INTO events VALUES
    (5, 102, 'purchase', CAST('2025-01-15 10:05:00' AS TIMESTAMP), '{"amount": 99.99}'),
    (6, 101, 'page_view', CAST('2025-01-15 10:06:00' AS TIMESTAMP), '{"page": "/confirmation"}');

-- Batch 4
INSERT INTO events VALUES
    (7, 104, 'page_view', CAST('2025-01-15 10:10:00' AS TIMESTAMP), '{"page": "/home"}'),
    (8, 104, 'click', CAST('2025-01-15 10:11:00' AS TIMESTAMP), '{"button": "signup"}');

-- Batch 5
INSERT INTO events VALUES
    (9, 104, 'signup', CAST('2025-01-15 10:12:00' AS TIMESTAMP), '{"method": "email"}'),
    (10, 105, 'page_view', CAST('2025-01-15 10:15:00' AS TIMESTAMP), '{"page": "/products"}');

That's 10 rows spread across 5 INSERT statements. Each INSERT creates a new data file.


Step 3: Inspect the Fragmentation

Let's see the damage:

SELECT
    file_path,
    record_count,
    file_size_in_bytes
FROM demo.ecommerce.events.files;

You should see 5 files, each containing only 2 records. In a real system with millions of rows, you might have thousands of files with a few hundred records each. That's a problem because:

  1. Query overhead: Each file requires metadata lookups and I/O operations
  2. Storage inefficiency: Small files don't compress well and have high overhead
  3. Planning time: The query planner has to process every file's metadata

Check the snapshots too:

SELECT
    snapshot_id,
    committed_at,
    operation,
    summary['added-data-files'] AS files_added,
    summary['total-records'] AS total_records
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;

You should see something like this:

snapshot_id	committed_at	operation	files_added	total_records
8784372252740323981	2026-01-08 16:57:39.894	append	1	2
6087226536736430910	2026-01-08 16:57:40.06	append	1	4
6213127806288664064	2026-01-08 16:57:40.219	append	1	6
3242568885727246813	2026-01-08 16:57:40.419	append	1	8
2436576844827260458	2026-01-08 16:57:40.603	append	1	10

Five snapshots, one per INSERT.


Step 4: Create Snapshot History with Updates

Now let's run some UPDATEs. With copy-on-write mode, each UPDATE reads the file containing the target row(s), writes a new file with the changes, and creates a new snapshot pointing to the new file:

-- Fix a typo in event type
UPDATE events
SET event_type = 'page_view'
WHERE event_id = 3;

-- Update properties
UPDATE events
SET properties = '{"page": "/home", "referrer": "google"}'
WHERE event_id = 1;

-- Mark an event as processed
UPDATE events
SET properties = '{"amount": 99.99, "processed": true}'
WHERE event_id = 5;

Check the snapshot history:

SELECT
    snapshot_id,
    committed_at,
    operation,
    summary['total-data-files'] AS data_files
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;

You'll see 3 new snapshots (one per UPDATE), bringing the total to 8 snapshots. Each UPDATE operation replaced a data file with an updated version. The current snapshot still shows ~5 data files, but the old versions of those files are now orphaned (only referenced by previous snapshots).


Step 5: Delete Files and Positional Deletes

Now let's see what happens when you delete rows. Iceberg uses positional delete files to track deleted rows without rewriting data files:

-- Delete some events
DELETE FROM events
WHERE event_id IN (7, 8);

-- Delete another event
DELETE FROM events
WHERE event_id = 10;

Now let's inspect the files and see delete files appear:

SELECT
    content,
    file_path,
    record_count,
    file_size_in_bytes
FROM demo.ecommerce.events.files
ORDER BY content, file_path;

Look at the content column:

  • 0 = Data file (contains actual rows)
  • 1 = Positional delete file (marks rows as deleted)

You should see new files with content = 1. These are delete files that record which positions (row numbers) in data files have been deleted.

Key insight: With merge-on-read mode, Iceberg doesn't immediately rewrite data files when you delete rows. Instead, it creates small delete files that query engines must read and apply. This is efficient for small deletes but accumulates overhead over time.

Check how many delete files we have:

SELECT
    content,
    CASE content
        WHEN 0 THEN 'DATA'
        WHEN 1 THEN 'POSITION_DELETES'
        WHEN 2 THEN 'EQUALITY_DELETES'
    END AS file_type,
    COUNT(*) AS file_count,
    SUM(record_count) AS total_records
FROM demo.ecommerce.events.files
GROUP BY content
ORDER BY content;

You should get:

content	file_type	file_count	total_records
0	DATA	5	10
1	POSITION_DELETES	2	3

Step 6: Compact the Files

Time to clean up. We need to compact the data files AND apply the delete files. The key is using the right options:

CALL system.rewrite_data_files(
    table => 'demo.ecommerce.events',
    options => map(
        'rewrite-all', 'true'
    )
);

The 'rewrite-all' => 'true' option forces rewriting of all data files regardless of size.

This is the result you should see:

rewritten_data_files_count      added_data_files_count  rewritten_bytes_count   failed_data_files_count
5       1       8922    0

The rewrite_data_files procedure applies the positional deletes, physically removing the 3 deleted rows. This is the process coloquially known as "compaction".

Let's verify what we have after compaction:

Check the results:

SELECT
    content,
    file_path,
    record_count,
    file_size_in_bytes
FROM demo.ecommerce.events.files;

You should see:

  • 1 data file with 7 records (the compacted file with deletes applied)
  • 2 delete files still showing (but they're now orphaned)

Wait, delete files are still there? Yes! But notice the data file has only 7 rows. The deletes were physically applied during compaction. The delete files you see are now orphaned - they reference the old data files that were replaced. Don't worry, we'll clean them up in later steps.

Verify the data:

SELECT * FROM events ORDER BY event_id;

You should see 7 rows (10 original - 3 deleted). The orphaned delete files don't affect queries—they're just metadata remnants that will be cleaned up soon.


Step 7: Understand Snapshot Accumulation

Compaction helped with files, but look at the snapshots:

SELECT
    snapshot_id,
    committed_at,
    operation
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;

You now have many snapshots: 5 INSERTs (append operations), 3 UPDATEs (overwrite operations), 2 DELETEs (overwrite operations), and 1 compaction (replace operation). In production, after months of operations, you might have thousands.

Remember, each snapshot keeps old data files from being deleted (for time travel), but takes up space in the metadata and slows down metadata operations


Step 8: Expire Old Snapshots

If you don't need to time-travel back to every single state, you can expire old snapshots. First, check your latest snapshot timestamp:

SELECT MAX(committed_at) FROM demo.ecommerce.events.snapshots;

Then expire all old snapshots, keeping only the most recent one:

CALL system.expire_snapshots(
    table => 'demo.ecommerce.events',
    older_than => TIMESTAMP '2026-01-09 12:00:00',  -- replace with the max timestamp from above
    retain_last => 1  -- Always keep at least this many snapshots.
);

The older_than parameter tells Iceberg to expire snapshots committed before this timestamp, while retain_last ensures you always keep at least 1 snapshot (even if it's older than the timestamp). In production, you'd calculate the timestamp dynamically (e.g., current date minus 7 days) to keep a week of history.

This procedure should return:

deleted_data_files_count        deleted_position_delete_files_count     deleted_equality_delete_files_count     deleted_manifest_files_count      deleted_manifest_lists_count    deleted_statistics_files_count
8       0       0       11      10      0

Check what's left:

SELECT
    snapshot_id,
    committed_at,
    operation
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;

Only the most recent snapshot remains. The old ones are gone.

Warning: Once snapshots are expired, you can't time-travel to them anymore. The retention count parameter protects recent snapshots.


Step 9: Remove Orphan Files

Snapshot expiration removes the metadata for old snapshots, but the actual data files they referenced are still on disk. These are now orphans—files that no snapshot references.

Let's check for orphans by comparing files in storage vs. files in metadata:

-- Files currently referenced by the table
SELECT COUNT(*) AS referenced_files FROM demo.ecommerce.events.files;

The orphan files are still in MinIO, taking up space. You can check it in a different terminal by running:

docker compose exec minio mc ls /data/warehouse/ecommerce/events/data/event_time_day=2025-01-15

On my machine I got:

[2026-01-15 14:44:56 UTC] 4.0KiB 00000-33-f6241d16-c455-46c9-a326-536296bdbb70-00001-deletes.parquet/
[2026-01-15 14:44:56 UTC] 4.0KiB 00000-35-94c7b692-f74f-4dcf-a891-dcfacc87d1fd-00001-deletes.parquet/
[2026-01-15 14:46:41 UTC] 4.0KiB 00000-41-3afb190b-f9c2-433b-b8df-d335c3f5665c-0-00001.parquet/

To clean them up, we would normally use:

CALL system.remove_orphan_files(
    table => 'demo.ecommerce.events',
    older_than => TIMESTAMP 'xxxx-xx-xx 00:00:00'  -- must be at least 24 hours ago
);

Now, Iceberg is cautious by default. The cleanup procedure enforces a minimum 24-hour waiting period before it actually deletes files. This is a safety measure to make sure we don't accidentally toss out data that a long-running query might still be reading.

If you really want to see files removed and don't want to wait until tomorrow, there is a workaround using the Action API directly, but that’s a bit outside the scope of this exercise (it involves launching a spark-shell instead of spark-sql with the parameters at the beginning of the exercise).

For now, let’s focus on those two dangling delete files still hanging around in our metadata. Even though the data they pointed to is gone after our compaction and snapshot expiration, the metadata doesn't know that yet.

To tidy things up, we need to rewrite the position delete files:

CALL system.rewrite_position_delete_files(
    table => 'demo.ecommerce.events',
    options => map('rewrite-all', 'true')
);

The 'rewrite-all' => 'true' option forces the procedure to process all delete files, including dangling ones. It should return:

rewritten_delete_files_count    added_delete_files_count    rewritten_bytes_count    added_bytes_count
2                               0                           3217                     0

The 2 dangling delete files have been consolidated into nothing since they don't reference any existing data, effectively removing them from the metadata.


Step 10: Verify the Cleanup

Now let's verify everything is clean:

-- Should have 7 records
SELECT COUNT(*) AS total_records FROM demo.ecommerce.events;

-- Should have 1 data file (no more delete files!)
SELECT
    content,
    CASE content
        WHEN 0 THEN 'DATA'
        WHEN 1 THEN 'POSITION_DELETES'
        WHEN 2 THEN 'EQUALITY_DELETES'
    END AS file_type,
    COUNT(*) AS file_count
FROM demo.ecommerce.events.files
GROUP BY content
ORDER BY content;

-- Should have 2 snapshots of type "replace"
SELECT COUNT(*) AS snapshot_count FROM demo.ecommerce.events.snapshots;

Perfect! You've completed the maintenance workflow successfully:

  1. Compacted 5 small data files into 1 and applied deletes (7 rows remain)
  2. Expired 11 old snapshots, keeping only the most recent
  3. Removed 9 orphaned data files from storage
  4. Cleaned up 2 dangling delete files from the metadata

The table is now compact and clean!


Step 11: Clean Up

Exit Spark SQL:

exit;

If you're done:

docker compose down -v

What You Learned

You've seen a complete maintenance sequence:

  1. Compact data files (apply deletes)
  2. Expire snapshots
  3. Remove orphan files
  4. Rewrite position delete files

Next Up

Congratulations! You've completed the Iceberg fundamentals series.

For further exploration, consider looking into how you can take the data from your data streaming platform such as Kafka + Flink into Iceberg tables. The most performant and simple way to do this is to use Confluent Tableflow.

If you want keep on experimenting with open source tools, I've written up a blog series which describes how to do it using Apache Kafka, Apache Flink, and Apache Polaris: https://medium.com/@gilles.philippart/build-a-data-lakehouse-with-apache-iceberg-polaris-trino-minio-349c534ecd98

Do you have questions or comments? Join us in the #developer-confluent-io community Slack channel to engage in discussions with the creators of this content.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.