
Software Practice Lead
You've created tables, evolved schemas, partitioned data, time-traveled through snapshots, and used tags and branches. Everything works beautifully in development. But what happens after six months in production?
Here's the reality: streaming jobs that write every minute create thousands of tiny files. Each UPDATE and DELETE operation creates new snapshots. Old data files accumulate even after they're no longer referenced. Left unchecked, your table becomes a mess of small files that kill query performance and waste storage.
Iceberg provides several maintenance operations to keep tables healthy:
In this exercise, you'll deliberately create a fragmented table, observe the performance impact, and then clean it up.
By the end of this exercise, you will be able to:
Start the environment if needed:
docker compose up -dRun Spark SQL with extra parameters to access data directly from the S3-compatible storage filesystem (MinIO).
docker compose exec -it spark-iceberg spark-sql \
--packages org.apache.hadoop:hadoop-aws:3.3.4 \
--conf "spark.hadoop.hive.cli.print.header=true" \
--conf "spark.hadoop.fs.s3a.endpoint=http://minio:9000" \
--conf "spark.hadoop.fs.s3a.path.style.access=true" \
--conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
--conf "spark.hadoop.fs.s3a.connection.ssl.enabled=false"Create a fresh table:
CREATE NAMESPACE IF NOT EXISTS demo.ecommerce;
USE demo.ecommerce;
DROP TABLE IF EXISTS events;
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP,
properties STRING
)
USING iceberg
PARTITIONED BY (days(event_time))
TBLPROPERTIES (
'format-version'='2',
'write.format.default'='parquet',
'write.update.mode'='copy-on-write',
'write.delete.mode'='merge-on-read'
);In production, fragmentation often comes from streaming ingestion—small batches arriving frequently. Let's simulate that:
-- Batch 1
INSERT INTO events VALUES
(1, 101, 'page_view', CAST('2025-01-15 10:00:00' AS TIMESTAMP), '{"page": "/home"}'),
(2, 102, 'page_view', CAST('2025-01-15 10:01:00' AS TIMESTAMP), '{"page": "/products"}');
-- Batch 2
INSERT INTO events VALUES
(3, 101, 'click', CAST('2025-01-15 10:02:00' AS TIMESTAMP), '{"button": "add_to_cart"}'),
(4, 103, 'page_view', CAST('2025-01-15 10:03:00' AS TIMESTAMP), '{"page": "/checkout"}');
-- Batch 3
INSERT INTO events VALUES
(5, 102, 'purchase', CAST('2025-01-15 10:05:00' AS TIMESTAMP), '{"amount": 99.99}'),
(6, 101, 'page_view', CAST('2025-01-15 10:06:00' AS TIMESTAMP), '{"page": "/confirmation"}');
-- Batch 4
INSERT INTO events VALUES
(7, 104, 'page_view', CAST('2025-01-15 10:10:00' AS TIMESTAMP), '{"page": "/home"}'),
(8, 104, 'click', CAST('2025-01-15 10:11:00' AS TIMESTAMP), '{"button": "signup"}');
-- Batch 5
INSERT INTO events VALUES
(9, 104, 'signup', CAST('2025-01-15 10:12:00' AS TIMESTAMP), '{"method": "email"}'),
(10, 105, 'page_view', CAST('2025-01-15 10:15:00' AS TIMESTAMP), '{"page": "/products"}');That's 10 rows spread across 5 INSERT statements. Each INSERT creates a new data file.
Let's see the damage:
SELECT
file_path,
record_count,
file_size_in_bytes
FROM demo.ecommerce.events.files;You should see 5 files, each containing only 2 records. In a real system with millions of rows, you might have thousands of files with a few hundred records each. That's a problem because:
Check the snapshots too:
SELECT
snapshot_id,
committed_at,
operation,
summary['added-data-files'] AS files_added,
summary['total-records'] AS total_records
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;You should see something like this:
snapshot_id committed_at operation files_added total_records
8784372252740323981 2026-01-08 16:57:39.894 append 1 2
6087226536736430910 2026-01-08 16:57:40.06 append 1 4
6213127806288664064 2026-01-08 16:57:40.219 append 1 6
3242568885727246813 2026-01-08 16:57:40.419 append 1 8
2436576844827260458 2026-01-08 16:57:40.603 append 1 10Five snapshots, one per INSERT.
Now let's run some UPDATEs. With copy-on-write mode, each UPDATE reads the file containing the target row(s), writes a new file with the changes, and creates a new snapshot pointing to the new file:
-- Fix a typo in event type
UPDATE events
SET event_type = 'page_view'
WHERE event_id = 3;
-- Update properties
UPDATE events
SET properties = '{"page": "/home", "referrer": "google"}'
WHERE event_id = 1;
-- Mark an event as processed
UPDATE events
SET properties = '{"amount": 99.99, "processed": true}'
WHERE event_id = 5;Check the snapshot history:
SELECT
snapshot_id,
committed_at,
operation,
summary['total-data-files'] AS data_files
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;You'll see 3 new snapshots (one per UPDATE), bringing the total to 8 snapshots. Each UPDATE operation replaced a data file with an updated version. The current snapshot still shows ~5 data files, but the old versions of those files are now orphaned (only referenced by previous snapshots).
Now let's see what happens when you delete rows. Iceberg uses positional delete files to track deleted rows without rewriting data files:
-- Delete some events
DELETE FROM events
WHERE event_id IN (7, 8);
-- Delete another event
DELETE FROM events
WHERE event_id = 10;Now let's inspect the files and see delete files appear:
SELECT
content,
file_path,
record_count,
file_size_in_bytes
FROM demo.ecommerce.events.files
ORDER BY content, file_path;Look at the content column:
You should see new files with content = 1. These are delete files that record which positions (row numbers) in data files have been deleted.
Key insight: With merge-on-read mode, Iceberg doesn't immediately rewrite data files when you delete rows. Instead, it creates small delete files that query engines must read and apply. This is efficient for small deletes but accumulates overhead over time.
Check how many delete files we have:
SELECT
content,
CASE content
WHEN 0 THEN 'DATA'
WHEN 1 THEN 'POSITION_DELETES'
WHEN 2 THEN 'EQUALITY_DELETES'
END AS file_type,
COUNT(*) AS file_count,
SUM(record_count) AS total_records
FROM demo.ecommerce.events.files
GROUP BY content
ORDER BY content;You should get:
content file_type file_count total_records
0 DATA 5 10
1 POSITION_DELETES 2 3Time to clean up. We need to compact the data files AND apply the delete files. The key is using the right options:
CALL system.rewrite_data_files(
table => 'demo.ecommerce.events',
options => map(
'rewrite-all', 'true'
)
);The 'rewrite-all' => 'true' option forces rewriting of all data files regardless of size.
This is the result you should see:
rewritten_data_files_count added_data_files_count rewritten_bytes_count failed_data_files_count
5 1 8922 0The rewrite_data_files procedure applies the positional deletes, physically removing the 3 deleted rows. This is the process coloquially known as "compaction".
Let's verify what we have after compaction:
Check the results:
SELECT
content,
file_path,
record_count,
file_size_in_bytes
FROM demo.ecommerce.events.files;You should see:
Wait, delete files are still there? Yes! But notice the data file has only 7 rows. The deletes were physically applied during compaction. The delete files you see are now orphaned - they reference the old data files that were replaced. Don't worry, we'll clean them up in later steps.
Verify the data:
SELECT * FROM events ORDER BY event_id;You should see 7 rows (10 original - 3 deleted). The orphaned delete files don't affect queries—they're just metadata remnants that will be cleaned up soon.
Compaction helped with files, but look at the snapshots:
SELECT
snapshot_id,
committed_at,
operation
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;You now have many snapshots: 5 INSERTs (append operations), 3 UPDATEs (overwrite operations), 2 DELETEs (overwrite operations), and 1 compaction (replace operation). In production, after months of operations, you might have thousands.
Remember, each snapshot keeps old data files from being deleted (for time travel), but takes up space in the metadata and slows down metadata operations
If you don't need to time-travel back to every single state, you can expire old snapshots. First, check your latest snapshot timestamp:
SELECT MAX(committed_at) FROM demo.ecommerce.events.snapshots;Then expire all old snapshots, keeping only the most recent one:
CALL system.expire_snapshots(
table => 'demo.ecommerce.events',
older_than => TIMESTAMP '2026-01-09 12:00:00', -- replace with the max timestamp from above
retain_last => 1 -- Always keep at least this many snapshots.
);The older_than parameter tells Iceberg to expire snapshots committed before this timestamp, while retain_last ensures you always keep at least 1 snapshot (even if it's older than the timestamp). In production, you'd calculate the timestamp dynamically (e.g., current date minus 7 days) to keep a week of history.
This procedure should return:
deleted_data_files_count deleted_position_delete_files_count deleted_equality_delete_files_count deleted_manifest_files_count deleted_manifest_lists_count deleted_statistics_files_count
8 0 0 11 10 0Check what's left:
SELECT
snapshot_id,
committed_at,
operation
FROM demo.ecommerce.events.snapshots
ORDER BY committed_at;Only the most recent snapshot remains. The old ones are gone.
Warning: Once snapshots are expired, you can't time-travel to them anymore. The retention count parameter protects recent snapshots.
Snapshot expiration removes the metadata for old snapshots, but the actual data files they referenced are still on disk. These are now orphans—files that no snapshot references.
Let's check for orphans by comparing files in storage vs. files in metadata:
-- Files currently referenced by the table
SELECT COUNT(*) AS referenced_files FROM demo.ecommerce.events.files;The orphan files are still in MinIO, taking up space. You can check it in a different terminal by running:
docker compose exec minio mc ls /data/warehouse/ecommerce/events/data/event_time_day=2025-01-15On my machine I got:
[2026-01-15 14:44:56 UTC] 4.0KiB 00000-33-f6241d16-c455-46c9-a326-536296bdbb70-00001-deletes.parquet/
[2026-01-15 14:44:56 UTC] 4.0KiB 00000-35-94c7b692-f74f-4dcf-a891-dcfacc87d1fd-00001-deletes.parquet/
[2026-01-15 14:46:41 UTC] 4.0KiB 00000-41-3afb190b-f9c2-433b-b8df-d335c3f5665c-0-00001.parquet/To clean them up, we would normally use:
CALL system.remove_orphan_files(
table => 'demo.ecommerce.events',
older_than => TIMESTAMP 'xxxx-xx-xx 00:00:00' -- must be at least 24 hours ago
);Now, Iceberg is cautious by default. The cleanup procedure enforces a minimum 24-hour waiting period before it actually deletes files. This is a safety measure to make sure we don't accidentally toss out data that a long-running query might still be reading.
If you really want to see files removed and don't want to wait until tomorrow, there is a workaround using the Action API directly, but that’s a bit outside the scope of this exercise (it involves launching a spark-shell instead of spark-sql with the parameters at the beginning of the exercise).
For now, let’s focus on those two dangling delete files still hanging around in our metadata. Even though the data they pointed to is gone after our compaction and snapshot expiration, the metadata doesn't know that yet.
To tidy things up, we need to rewrite the position delete files:
CALL system.rewrite_position_delete_files(
table => 'demo.ecommerce.events',
options => map('rewrite-all', 'true')
);The 'rewrite-all' => 'true' option forces the procedure to process all delete files, including dangling ones. It should return:
rewritten_delete_files_count added_delete_files_count rewritten_bytes_count added_bytes_count
2 0 3217 0The 2 dangling delete files have been consolidated into nothing since they don't reference any existing data, effectively removing them from the metadata.
Now let's verify everything is clean:
-- Should have 7 records
SELECT COUNT(*) AS total_records FROM demo.ecommerce.events;
-- Should have 1 data file (no more delete files!)
SELECT
content,
CASE content
WHEN 0 THEN 'DATA'
WHEN 1 THEN 'POSITION_DELETES'
WHEN 2 THEN 'EQUALITY_DELETES'
END AS file_type,
COUNT(*) AS file_count
FROM demo.ecommerce.events.files
GROUP BY content
ORDER BY content;
-- Should have 2 snapshots of type "replace"
SELECT COUNT(*) AS snapshot_count FROM demo.ecommerce.events.snapshots;Perfect! You've completed the maintenance workflow successfully:
The table is now compact and clean!
Exit Spark SQL:
exit;If you're done:
docker compose down -vYou've seen a complete maintenance sequence:
Congratulations! You've completed the Iceberg fundamentals series.
For further exploration, consider looking into how you can take the data from your data streaming platform such as Kafka + Flink into Iceberg tables. The most performant and simple way to do this is to use Confluent Tableflow.
If you want keep on experimenting with open source tools, I've written up a blog series which describes how to do it using Apache Kafka, Apache Flink, and Apache Polaris: https://medium.com/@gilles.philippart/build-a-data-lakehouse-with-apache-iceberg-polaris-trino-minio-349c534ecd98
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.