Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Exercise 1: Your First Iceberg Table

Overview

Time to get your hands dirty. In this exercise, you'll spin up a local Iceberg environment, create your first table, and insert some data.

Learning Objectives

By the end of this exercise, you will be able to:

  • Start a local Iceberg environment with Spark, a REST catalog, and object storage
  • Create a namespace and table using Spark SQL
  • Insert data and verify it landed correctly

Prerequisites

  • Docker and Docker Compose installed
  • A terminal
  • About 15 minutes

Step 1: Start the Environment

First, create a directory, for example iceberg-course-exercises.

Open a terminal in this directory and create a docker-compose.yaml file with the following content. This configuration is identical to the official example provided on the Apache Iceberg website:

services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    networks:
      iceberg_net:
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8080:8080
      - 10000:10000
      - 10001:10001
  rest:
    image: apache/iceberg-rest-fixture
    container_name: iceberg-rest
    networks:
      iceberg_net:
    ports:
      - 8181:8181
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - CATALOG_WAREHOUSE=s3://warehouse/
      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
      - CATALOG_S3_ENDPOINT=http://minio:9000
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
    networks:
      iceberg_net:
        aliases:
          - warehouse.minio
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    networks:
      iceberg_net:
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: |
      /bin/sh -c "
      until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      tail -f /dev/null
      "
networks:
  iceberg_net:

Next, create a spark-defaults.conf file in the same directory with the following contents:

# S3 Support via Maven packages
spark.jars.packages                    org.apache.hadoop:hadoop-aws:3.3.4

# Hadoop S3A Configuration for MinIO
spark.hadoop.fs.s3a.endpoint           http://minio:9000
spark.hadoop.fs.s3a.access.key         admin
spark.hadoop.fs.s3a.secret.key         password
spark.hadoop.fs.s3a.path.style.access  true
spark.hadoop.fs.s3a.impl               org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl                org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.connection.ssl.enabled false

Start everything up:

docker compose up -d

Wait about 30 seconds for all services to initialize. You can check that everything is healthy:

docker compose ps

You should see four containers running: iceberg-rest, minio, mc, and spark-iceberg.


Step 2: Connect to Spark SQL

Open a Spark SQL session:

docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"

You should see the spark-sql> prompt. Let's verify our Iceberg catalog is available:

SHOW CATALOGS;

You should see demo in the list. For this exercise, we'll use the configured Iceberg REST catalog which Spark accesses directly.


Step 3: Create a Database

In Iceberg, tables live inside namespaces. Let's create one for our e-commerce data:

CREATE NAMESPACE demo.ecommerce;

Verify it exists:

SHOW NAMESPACES;

You should see ecommerce in the output.

Now set it as the current database:

USE demo.ecommerce;

Step 4: Create the Orders Table

Now for the main event. We'll create a simple orders table:

CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    total_amount DECIMAL(10, 2),
    status STRING
)
USING iceberg
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');

A few things to note:

  • USING iceberg tells Spark this is an Iceberg table
  • We're storing data as Parquet files (the default, but it's good to be explicit)
  • format-version=2 enables row-level operations (UPDATE, DELETE, MERGE)
  • No partitioning yet—we'll cover that in a later exercise

Verify the table was created:

DESCRIBE orders;

You should see something like:

col_name     data_type     comment
order_id     bigint
customer_id  bigint
order_date   date
total_amount decimal(10,2)
status       string

Step 5: Insert Some Data

Let's add a few orders. Nothing fancy—just plain INSERT statements:

INSERT INTO orders VALUES
    (1001, 42, CAST('2024-01-15' AS DATE), 299.99, 'completed'),
    (1002, 17, CAST('2024-01-15' AS DATE), 149.50, 'completed'),
    (1003, 42, CAST('2024-01-16' AS DATE), 89.00, 'pending');

Add a couple more:

INSERT INTO orders VALUES
    (1004, 88, CAST('2024-01-16' AS DATE), 1250.00, 'completed'),
    (1005, 17, CAST('2024-01-17' AS DATE), 45.00, 'cancelled');

Each INSERT creates a new snapshot. That's going to matter a lot when we get to time travel, but for now, just keep it in the back of your mind.


Step 6: Query Your Data

Let's make sure everything landed correctly:

SELECT * FROM orders ORDER BY order_id;

You should see all five orders.

order_id	customer_id	order_date	total_amount	status
1001	42	2024-01-15	299.99	completed
1002	17	2024-01-15	149.50	completed
1003	42	2024-01-16	89.00	pending
1004	88	2024-01-16	1250.00	completed
1005	17	2024-01-17	45.00	cancelled

Try a slightly more interesting query—total revenue by customer:

SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
ORDER BY total_spent DESC;

You should get this result:

customer_id	order_count	total_spent
88	1	1250.00
42	1	299.99
17	1	149.50

Step 7: Peek Under the Hood

Here's where Iceberg gets interesting. Let's look at the snapshots we've created by querying the snapshots metadata table:

SELECT
    snapshot_id,
    committed_at,
    operation
FROM demo.ecommerce.orders.snapshots
ORDER BY committed_at;

Here's what I got:

snapshot_id	committed_at	operation
472356398657053538	2026-01-08 15:27:50.182	append
3188905721270494811	2026-01-08 15:27:55.681	append

You should see two snapshots—one for each INSERT statement. This is the foundation of time travel: every write operation creates an immutable snapshot of the table state.

Let's also check what data files were created:

SELECT
    file_path,
    record_count,
    file_size_in_bytes
FROM demo.ecommerce.orders.files;

Here's the result on my machine:

file_path	record_count	file_size_in_bytes
s3://warehouse/ecommerce/orders/data/00000-3-255d1806-0a53-4703-aaf3-24403d97adf2-0-00001.parquet	1	1601
s3://warehouse/ecommerce/orders/data/00001-4-255d1806-0a53-4703-aaf3-24403d97adf2-0-00001.parquet	1	1601
s3://warehouse/ecommerce/orders/data/00000-0-67dbaf8a-0c02-40e3-9acd-971dcf9c5c34-0-00001.parquet	1	1600
s3://warehouse/ecommerce/orders/data/00001-1-67dbaf8a-0c02-40e3-9acd-971dcf9c5c34-0-00001.parquet	1	1600
s3://warehouse/ecommerce/orders/data/00002-2-67dbaf8a-0c02-40e3-9acd-971dcf9c5c34-0-00001.parquet	1	1587

Each INSERT created a bunch of Parquet files. In a production system, you'd eventually compact these—but that's a topic for another exercise.


Step 8: Clean Up

When you're done exploring, exit Spark SQL:

exit;

And shut down the environment:

docker compose down -v

The -v flag removes the volumes, giving you a clean slate for the next exercise.


What You Learned

  • Iceberg tables are created through standard SQL DDL
  • Each INSERT operation creates a new snapshot
  • Metadata tables (.snapshots, .files) let you inspect what's happening under the hood
  • The REST catalog handles all the coordination between your query engine and storage

Next Up

In Exercise 2, we'll explore schema evolution—adding, removing, and renaming columns without breaking downstream consumers.

Do you have questions or comments? Join us in the #developer-confluent-io community Slack channel to engage in discussions with the creators of this content.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.