course: Apache Iceberg™️

Hands-on Exercise: Schema Evolution

Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Exercise 2: Schema Evolution

Overview

In Exercise 1, you created a table with a fixed schema. But schemas change—it's a fact of life. A column that seemed optional at launch becomes mandatory six months later. That user_id integer runs out of headroom and needs to become a bigint. Marketing wants a loyalty_tier field. The GDPR team wants you to rename email to contact_email for some compliance reason.

In traditional data lakes, schema changes are painful. You either rewrite all your historical data (expensive and slow) or maintain complex ETL logic to handle different file formats (fragile and error-prone).

Iceberg takes a different approach. The schema is tracked in metadata, separate from the data files. You can add columns, rename them, reorder them, or widen their types—and old Parquet files just keep working. No rewrites. No compatibility nightmares.

In this exercise, you'll evolve a table's schema through several changes and see how Iceberg handles each one gracefully.

Learning Objectives

By the end of this exercise, you will be able to:

  • Add new columns to an existing table
  • Rename columns without breaking queries
  • Widen column types (e.g., INT to BIGINT)
  • Drop columns safely
  • Understand how Iceberg handles schema evolution across file formats

Prerequisites

  • Completed Exercise 1 (or have the environment running)
  • About 20 minutes

Step 1: Set Up

Start the environment if needed:

docker compose up -d

Run a Spark SQL console:

docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"

Let's create a fresh table representing a product catalog. We'll start with a minimal schema and evolve it:

CREATE NAMESPACE IF NOT EXISTS demo.ecommerce;

USE demo.ecommerce;

CREATE TABLE products (
    product_id INT,
    name STRING,
    price DECIMAL(10, 2)
)
USING iceberg
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');

Insert some initial products:

INSERT INTO products VALUES
    (1, 'Wireless Mouse', 29.99),
    (2, 'Mechanical Keyboard', 89.99),
    (3, 'USB-C Hub', 49.99),
    (4, '27-inch Monitor', 349.99),
    (5, 'Webcam HD', 79.99);

Verify your starting point:

SELECT * FROM products ORDER BY product_id;

Step 2: Add a New Column

The marketing team wants to track product categories. Let's add that column:

ALTER TABLE products
ADD COLUMN category STRING;

Check the schema:

DESCRIBE products;

Now query the data:

SELECT * FROM products ORDER BY product_id;
product_id	name	price	category
1	Wireless Mouse	29.99	NULL
2	Mechanical Keyboard	89.99	NULL
3	USB-C Hub	49.99	NULL
4	27-inch Monitor	349.99	NULL
5	Webcam HD	79.99	NULL

Notice that the category column shows NULL for all existing rows. Iceberg didn't rewrite any files—it just updated the metadata. When reading old Parquet files that don't have this column, Iceberg automatically fills in NULL.

Let's update some categories:

UPDATE products SET category = 'Peripherals' WHERE product_id IN (1, 2);
UPDATE products SET category = 'Accessories' WHERE product_id = 3;
UPDATE products SET category = 'Displays' WHERE product_id = 4;
UPDATE products SET category = 'Peripherals' WHERE product_id = 5;

And add a new product with the category already set:

INSERT INTO products VALUES
    (6, 'Laptop Stand', 59.99, 'Accessories');

Query again to see the mix:

SELECT * FROM products ORDER BY product_id;
product_id	name	price	category
1	Wireless Mouse	29.99	Peripherals
2	Mechanical Keyboard	89.99	Peripherals
3	USB-C Hub	49.99	Accessories
4	27-inch Monitor	349.99	Displays
5	Webcam HD	79.99	Peripherals
6	Laptop Stand	59.99	Accessories

Step 3: Rename a Column

The data governance team decided name is too generic. They want product_name instead:

ALTER TABLE products
    RENAME COLUMN name TO product_name;

Check the schema:

DESCRIBE products;
col_name	data_type	comment
product_id          	int
product_name        	string
price               	decimal(10,2)
category            	string

Query the data—notice you now use the new column name:

SELECT product_id, product_name, price FROM products ORDER BY product_id;
product_id	product_name	price
1	Wireless Mouse	29.99
2	Mechanical Keyboard	89.99
3	USB-C Hub	49.99
4	27-inch Monitor	349.99
5	Webcam HD	79.99
6	Laptop Stand	59.99

The old Parquet files still have a column called name internally, but Iceberg's metadata maps it to product_name. No file rewrites needed.


Step 4: Widen a Column Type

Remember that product_id is an INT? The company is growing fast, and someone's worried about running out of IDs. Let's widen it to BIGINT:

ALTER TABLE products ALTER COLUMN product_id TYPE BIGINT;

Verify the change:

DESCRIBE products;
col_name	data_type	comment
product_id          	bigint
product_name        	string
price               	decimal(10,2)
category            	string

Query still works:

SELECT * FROM products ORDER BY product_id;

Step 5: Drop a Column

After some deliberation, the team decides the category column should be tracked in a separate product taxonomy system. Let's remove it:

ALTER TABLE products
DROP COLUMN category;

Verify:

DESCRIBE products;
SELECT * FROM products ORDER BY product_id;

The column is gone from the schema. The data still exists in the Parquet files, but Iceberg no longer exposes it. If you ever need it back, you could time-travel to a snapshot before the drop.


Step 6: Time Travel to See Schema History

Iceberg's time travel feature lets you query the table as it existed at any point in time—including with its old schema. First, let's see all snapshots:

SELECT
    snapshot_id,
    committed_at,
    operation
FROM demo.ecommerce.products.snapshots
ORDER BY committed_at;

Copy the snapshot_id from the first row (the initial INSERT). Now query the table as of that snapshot:

-- Replace <first_snapshot_id> with the actual snapshot_id from above
SELECT * FROM products VERSION AS OF <first_snapshot_id>;
product_id	name	price
1	Wireless Mouse	29.99
2	Mechanical Keyboard	89.99
3	USB-C Hub	49.99
4	27-inch Monitor	349.99
5	Webcam HD	79.99

Notice the schema! You'll see:

  • The column is still called name (not product_name)
  • The category column doesn't exist yet
  • The product_id column is INT (not BIGINT)

This is the original schema. Now query the current version:

SELECT * FROM products;

Although there are different schemas, Iceberg handles the mapping automatically.


Step 7: Clean Up

Exit Spark SQL:

exit;

If you're done for the day:

docker compose down -v

What You Learned

  • ADD COLUMN: New columns appear as NULL in existing data—no file rewrites
  • RENAME COLUMN: Changes the metadata mapping, not the physical files
  • ALTER COLUMN TYPE: Only safe widenings are allowed (INT→BIGINT, etc.)
  • DROP COLUMN: Hides the column from queries but data remains in files

Why This Matters

Traditional approaches to schema evolution often require:

  1. Creating a new table with the new schema
  2. Copying all data (potentially terabytes)
  3. Switching applications to the new table
  4. Dropping the old table

With Iceberg, schema changes are:

  • Instant: Metadata-only operations
  • Non-breaking: Old data files continue to work
  • Reversible: Time travel lets you see previous schemas

Next Up

In Exercise 3, we'll add partitioning to our table and see how Iceberg's hidden partitioning makes life easier for both writers and readers.

Do you have questions or comments? Join us in the #developer-confluent-io community Slack channel to engage in discussions with the creators of this content.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.