
Software Practice Lead
Time to get your hands dirty. In this exercise, you'll spin up a local Iceberg environment, create your first table, and insert some data.
By the end of this exercise, you will be able to:
First, create a directory, for example iceberg-course-exercises.
Open a terminal in this directory and create a docker-compose.yaml file with the following content. This configuration is identical to the official example provided on the Apache Iceberg website:
services:
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
build: spark/
networks:
iceberg_net:
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8080:8080
- 10000:10000
- 10001:10001
rest:
image: apache/iceberg-rest-fixture
container_name: iceberg-rest
networks:
iceberg_net:
ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
networks:
iceberg_net:
aliases:
- warehouse.minio
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: |
/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
networks:
iceberg_net:Next, create a spark-defaults.conf file in the same directory with the following contents:
# S3 Support via Maven packages
spark.jars.packages org.apache.hadoop:hadoop-aws:3.3.4
# Hadoop S3A Configuration for MinIO
spark.hadoop.fs.s3a.endpoint http://minio:9000
spark.hadoop.fs.s3a.access.key admin
spark.hadoop.fs.s3a.secret.key password
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled falseStart everything up:
docker compose up -dWait about 30 seconds for all services to initialize. You can check that everything is healthy:
docker compose psYou should see four containers running: iceberg-rest, minio, mc, and spark-iceberg.
Open a Spark SQL session:
docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"You should see the spark-sql> prompt. Let's verify our Iceberg catalog is available:
SHOW CATALOGS;You should see demo in the list. For this exercise, we'll use the configured Iceberg REST catalog which Spark accesses directly.
In Iceberg, tables live inside namespaces. Let's create one for our e-commerce data:
CREATE NAMESPACE demo.ecommerce;Verify it exists:
SHOW NAMESPACES;You should see ecommerce in the output.
Now set it as the current database:
USE demo.ecommerce;Now for the main event. We'll create a simple orders table:
CREATE TABLE orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
total_amount DECIMAL(10, 2),
status STRING
)
USING iceberg
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');A few things to note:
Verify the table was created:
DESCRIBE orders;You should see something like:
col_name data_type comment
order_id bigint
customer_id bigint
order_date date
total_amount decimal(10,2)
status stringLet's add a few orders. Nothing fancy—just plain INSERT statements:
INSERT INTO orders VALUES
(1001, 42, CAST('2024-01-15' AS DATE), 299.99, 'completed'),
(1002, 17, CAST('2024-01-15' AS DATE), 149.50, 'completed'),
(1003, 42, CAST('2024-01-16' AS DATE), 89.00, 'pending');Add a couple more:
INSERT INTO orders VALUES
(1004, 88, CAST('2024-01-16' AS DATE), 1250.00, 'completed'),
(1005, 17, CAST('2024-01-17' AS DATE), 45.00, 'cancelled');Each INSERT creates a new snapshot. That's going to matter a lot when we get to time travel, but for now, just keep it in the back of your mind.
Let's make sure everything landed correctly:
SELECT * FROM orders ORDER BY order_id;You should see all five orders.
order_id customer_id order_date total_amount status
1001 42 2024-01-15 299.99 completed
1002 17 2024-01-15 149.50 completed
1003 42 2024-01-16 89.00 pending
1004 88 2024-01-16 1250.00 completed
1005 17 2024-01-17 45.00 cancelledTry a slightly more interesting query—total revenue by customer:
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
ORDER BY total_spent DESC;You should get this result:
customer_id order_count total_spent
88 1 1250.00
42 1 299.99
17 1 149.50Here's where Iceberg gets interesting. Let's look at the snapshots we've created by querying the snapshots metadata table:
SELECT
snapshot_id,
committed_at,
operation
FROM demo.ecommerce.orders.snapshots
ORDER BY committed_at;Here's what I got:
snapshot_id committed_at operation
472356398657053538 2026-01-08 15:27:50.182 append
3188905721270494811 2026-01-08 15:27:55.681 appendYou should see two snapshots—one for each INSERT statement. This is the foundation of time travel: every write operation creates an immutable snapshot of the table state.
Let's also check what data files were created:
SELECT
file_path,
record_count,
file_size_in_bytes
FROM demo.ecommerce.orders.files;Here's the result on my machine:
file_path record_count file_size_in_bytes
s3://warehouse/ecommerce/orders/data/00000-3-255d1806-0a53-4703-aaf3-24403d97adf2-0-00001.parquet 1 1601
s3://warehouse/ecommerce/orders/data/00001-4-255d1806-0a53-4703-aaf3-24403d97adf2-0-00001.parquet 1 1601
s3://warehouse/ecommerce/orders/data/00000-0-67dbaf8a-0c02-40e3-9acd-971dcf9c5c34-0-00001.parquet 1 1600
s3://warehouse/ecommerce/orders/data/00001-1-67dbaf8a-0c02-40e3-9acd-971dcf9c5c34-0-00001.parquet 1 1600
s3://warehouse/ecommerce/orders/data/00002-2-67dbaf8a-0c02-40e3-9acd-971dcf9c5c34-0-00001.parquet 1 1587Each INSERT created a bunch of Parquet files. In a production system, you'd eventually compact these—but that's a topic for another exercise.
When you're done exploring, exit Spark SQL:
exit;And shut down the environment:
docker compose down -vThe -v flag removes the volumes, giving you a clean slate for the next exercise.
In Exercise 2, we'll explore schema evolution—adding, removing, and renaming columns without breaking downstream consumers.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.