
Software Practice Lead
In Exercise 1, you created a table with a fixed schema. But schemas change—it's a fact of life. A column that seemed optional at launch becomes mandatory six months later. That user_id integer runs out of headroom and needs to become a bigint. Marketing wants a loyalty_tier field. The GDPR team wants you to rename email to contact_email for some compliance reason.
In traditional data lakes, schema changes are painful. You either rewrite all your historical data (expensive and slow) or maintain complex ETL logic to handle different file formats (fragile and error-prone).
Iceberg takes a different approach. The schema is tracked in metadata, separate from the data files. You can add columns, rename them, reorder them, or widen their types—and old Parquet files just keep working. No rewrites. No compatibility nightmares.
In this exercise, you'll evolve a table's schema through several changes and see how Iceberg handles each one gracefully.
By the end of this exercise, you will be able to:
Start the environment if needed:
docker compose up -dRun a Spark SQL console:
docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"Let's create a fresh table representing a product catalog. We'll start with a minimal schema and evolve it:
CREATE NAMESPACE IF NOT EXISTS demo.ecommerce;
USE demo.ecommerce;
CREATE TABLE products (
product_id INT,
name STRING,
price DECIMAL(10, 2)
)
USING iceberg
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');Insert some initial products:
INSERT INTO products VALUES
(1, 'Wireless Mouse', 29.99),
(2, 'Mechanical Keyboard', 89.99),
(3, 'USB-C Hub', 49.99),
(4, '27-inch Monitor', 349.99),
(5, 'Webcam HD', 79.99);Verify your starting point:
SELECT * FROM products ORDER BY product_id;The marketing team wants to track product categories. Let's add that column:
ALTER TABLE products
ADD COLUMN category STRING;Check the schema:
DESCRIBE products;Now query the data:
SELECT * FROM products ORDER BY product_id;product_id name price category
1 Wireless Mouse 29.99 NULL
2 Mechanical Keyboard 89.99 NULL
3 USB-C Hub 49.99 NULL
4 27-inch Monitor 349.99 NULL
5 Webcam HD 79.99 NULLNotice that the category column shows NULL for all existing rows. Iceberg didn't rewrite any files—it just updated the metadata. When reading old Parquet files that don't have this column, Iceberg automatically fills in NULL.
Let's update some categories:
UPDATE products SET category = 'Peripherals' WHERE product_id IN (1, 2);
UPDATE products SET category = 'Accessories' WHERE product_id = 3;
UPDATE products SET category = 'Displays' WHERE product_id = 4;
UPDATE products SET category = 'Peripherals' WHERE product_id = 5;And add a new product with the category already set:
INSERT INTO products VALUES
(6, 'Laptop Stand', 59.99, 'Accessories');Query again to see the mix:
SELECT * FROM products ORDER BY product_id;product_id name price category
1 Wireless Mouse 29.99 Peripherals
2 Mechanical Keyboard 89.99 Peripherals
3 USB-C Hub 49.99 Accessories
4 27-inch Monitor 349.99 Displays
5 Webcam HD 79.99 Peripherals
6 Laptop Stand 59.99 AccessoriesThe data governance team decided name is too generic. They want product_name instead:
ALTER TABLE products
RENAME COLUMN name TO product_name;Check the schema:
DESCRIBE products;col_name data_type comment
product_id int
product_name string
price decimal(10,2)
category stringQuery the data—notice you now use the new column name:
SELECT product_id, product_name, price FROM products ORDER BY product_id;product_id product_name price
1 Wireless Mouse 29.99
2 Mechanical Keyboard 89.99
3 USB-C Hub 49.99
4 27-inch Monitor 349.99
5 Webcam HD 79.99
6 Laptop Stand 59.99The old Parquet files still have a column called name internally, but Iceberg's metadata maps it to product_name. No file rewrites needed.
Remember that product_id is an INT? The company is growing fast, and someone's worried about running out of IDs. Let's widen it to BIGINT:
ALTER TABLE products ALTER COLUMN product_id TYPE BIGINT;Verify the change:
DESCRIBE products;col_name data_type comment
product_id bigint
product_name string
price decimal(10,2)
category stringQuery still works:
SELECT * FROM products ORDER BY product_id;After some deliberation, the team decides the category column should be tracked in a separate product taxonomy system. Let's remove it:
ALTER TABLE products
DROP COLUMN category;Verify:
DESCRIBE products;
SELECT * FROM products ORDER BY product_id;The column is gone from the schema. The data still exists in the Parquet files, but Iceberg no longer exposes it. If you ever need it back, you could time-travel to a snapshot before the drop.
Iceberg's time travel feature lets you query the table as it existed at any point in time—including with its old schema. First, let's see all snapshots:
SELECT
snapshot_id,
committed_at,
operation
FROM demo.ecommerce.products.snapshots
ORDER BY committed_at;Copy the snapshot_id from the first row (the initial INSERT). Now query the table as of that snapshot:
-- Replace <first_snapshot_id> with the actual snapshot_id from above
SELECT * FROM products VERSION AS OF <first_snapshot_id>;product_id name price
1 Wireless Mouse 29.99
2 Mechanical Keyboard 89.99
3 USB-C Hub 49.99
4 27-inch Monitor 349.99
5 Webcam HD 79.99Notice the schema! You'll see:
This is the original schema. Now query the current version:
SELECT * FROM products;Although there are different schemas, Iceberg handles the mapping automatically.
Exit Spark SQL:
exit;If you're done for the day:
docker compose down -vTraditional approaches to schema evolution often require:
With Iceberg, schema changes are:
In Exercise 3, we'll add partitioning to our table and see how Iceberg's hidden partitioning makes life easier for both writers and readers.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.