Gilles Philippart profile picture  (round 128px)

Gilles Philippart

Software Practice Lead

Exercise 3: Partitioning

Overview

You've created tables and evolved their schemas. That's fine for small datasets, but what happens when you've got five million rows? Or five billion? At some point, scanning every single file just to find last Tuesday's orders becomes... let's call it "suboptimal."

Partitioning solves this by organizing your data into logical groups. Iceberg takes it a step further with hidden partitioning—you define the partitioning strategy once, and Iceberg handles the rest. No more partition columns cluttering up your queries. No more WHERE year=2025 AND month=01 gymnastics.

In this exercise, you'll create a partitioned table, generate enough data to make partitioning worthwhile, and see exactly how Iceberg prunes partitions at query time.

Learning Objectives

By the end of this exercise, you will be able to:

  • Create a table with hidden partitioning using partition transforms
  • Generate realistic test data using SQL
  • Inspect partitions using metadata tables
  • Verify partition pruning with EXPLAIN

Prerequisites

  • Completed Exercise 2 (or have the environment running)
  • About 20 minutes

Step 1: Start the Environment

If it's not already running:

docker compose up -d

Run Spark SQL:

docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"

Create the database if you don't have it:

CREATE NAMESPACE IF NOT EXISTS demo.ecommerce;
USE demo.ecommerce;

Step 2: Create a Partitioned Table

This time, we're adding a partitioning clause. We'll partition by month, derived from the order date:

CREATE TABLE orders_partitioned (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    total_amount DECIMAL(10, 2),
    status STRING
)
USING iceberg
PARTITIONED BY (months(order_date))
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');

That month(order_date) is a partition transform. Iceberg supports several of these: year(), month(), day(), hour() for timestamps, bucket() for hash distribution, and truncate() for strings. The beauty is that your queries don't need to know about them—you just query order_date like normal.

You can check the detailed table schema with :

SHOW CREATE TABLE orders_partitioned;

Step 3: Generate Test Data

Five rows won't cut it for demonstrating partitioning. Let's generate 1,000 orders spread across six months.

Spark's explode() and sequence() functions help us generate test data:

INSERT INTO orders_partitioned
SELECT
    row_num AS order_id,
    1 + (row_num % 100) AS customer_id,
    date_add(DATE '2025-01-01', row_num % 180) AS order_date,
    ROUND(CAST((10 + (row_num % 500)) AS DECIMAL(10,2)) + (row_num % 100) / 100.0, 2) AS total_amount,
    CASE (row_num % 4)
        WHEN 0 THEN 'completed'
        WHEN 1 THEN 'completed'
        WHEN 2 THEN 'pending'
        ELSE 'cancelled'
    END AS status
FROM (SELECT explode(sequence(1, 1000)) AS row_num);

Let's break down what's happening:

  • sequence(1, 1000) generates an array [1, 2, 3, ..., 1000]
  • explode() expands that into 1,000 rows
  • row_num % 180 spreads orders across 180 days (January through June)
  • row_num % 100 gives us 100 different customers
  • The status distribution is roughly 50% completed, 25% pending, 25% cancelled

Verify the data landed:

SELECT COUNT(*) FROM orders_partitioned;

Check the distribution across months:

SELECT
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS order_count
FROM orders_partitioned
GROUP BY 1
ORDER BY 1;

You should see orders spread across January through June 2025.

Here's what I got:

month	order_count
2025-01-01 00:00:00	185
2025-02-01 00:00:00	168
2025-03-01 00:00:00	186
2025-04-01 00:00:00	161
2025-05-01 00:00:00	155
2025-06-01 00:00:00	145

Step 4: Explore the Partitions

Now let's see how Iceberg organized our data. The partitions metadata table shows exactly what partitions exist:

SELECT * FROM demo.ecommerce.orders_partitioned.partitions;

You should see six partitions—one for each month. The partition numbers range from 660 to 665, representing the number of months since January 1970 (Iceberg's epoch for the month() transform). Notice the record_count and file_count columns—each partition tracks its own statistics.

Here's what I got:

       partition        | record_count | file_count | total_size |                                               >
------------------------+--------------+------------+------------+----------------------------------------------->
 {order_date_month=660} |          185 |          1 |       2162 | {order_id={min=1, max=930, null_count=0, nan_c>
 {order_date_month=663} |          161 |          1 |       1958 | {order_id={min=90, max=1000, null_count=0, nan>
 {order_date_month=664} |          155 |          1 |       1957 | {order_id={min=120, max=870, null_count=0, nan>
 {order_date_month=661} |          168 |          1 |       1934 | {order_id={min=31, max=958, null_count=0, nan_>
 {order_date_month=662} |          186 |          1 |       2100 | {order_id={min=59, max=989, null_count=0, nan_>
 {order_date_month=665} |          145 |          1 |       1894 | {order_id={min=151, max=899, null_count=0, nan>
(6 rows)

Let's also look at the files:

SELECT
    partition,
    file_path,
    record_count,
    file_size_in_bytes
FROM demo.ecommerce.orders_partitioned.files;

You'll see each partition has its own file organized by month in the file path:

partition	file_path	record_count	file_size_in_bytes
{"order_date_month":664}	s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-05/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00004.parquet	155	2779
{"order_date_month":665}	s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-06/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00002.parquet	145	2715
{"order_date_month":662}	s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-03/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00006.parquet	186	2923
{"order_date_month":663}	s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-04/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00001.parquet	161	2778
{"order_date_month":660}	s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-01/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00003.parquet	185	2937
{"order_date_month":661}	s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-02/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00005.parquet	168	2757

Step 5: Partition Pruning in Action

Here's where the magic happens. When you query a specific time range, Iceberg skips partitions that can't possibly contain matching data.

Let's verify partition pruning with actual row counts. Start with a full table scan:

SELECT COUNT(*) AS total_orders
FROM orders_partitioned;
total_orders
1000

Now let's see what happens when you filter for just March:

SELECT COUNT(*) AS march_orders
FROM orders_partitioned
WHERE order_date >= DATE '2025-03-01'
  AND order_date < DATE '2025-04-01';
march_orders
 186

So, 186 rows for March out of 1000 in total.

Let's use EXPLAIN COST to see the difference in what Iceberg plans to scan. First, the full table:

EXPLAIN COST
SELECT * FROM orders_partitioned;

Look at the Statistics line in the output:

Statistics(sizeInBytes=46.9 KiB, rowCount=1.00E+3)

This shows Iceberg will scan ~47 KB across all 1,000 rows (all six partitions).

Now compare with the filtered query:

EXPLAIN COST
SELECT * FROM orders_partitioned
WHERE order_date >= DATE '2025-03-01'
  AND order_date < DATE '2025-04-01';

You should see something like:

Statistics(sizeInBytes=8.7 KiB, rowCount=186)

This is partition pruning in action! The filtered query only scans ~8.5 KB and 186 rows—just the March partition. Iceberg automatically determined that only partition 662 (March 2025) could contain matching rows, so it skipped the other five partitions entirely.

In a real data lake with terabytes of data, this is the difference between a 30-second query and a 3-hour query.


Step 6: The Hidden Partitioning Advantage

Here's something subtle but important. Notice that in all our queries above, we filtered on order_date, not on any partition column. Let's demonstrate this with another example:

-- Query for a single day in March
SELECT COUNT(*) AS single_day_orders
FROM orders_partitioned
WHERE order_date = DATE '2025-03-15';

You should get 6 orders for that specific day. Now try EXPLAIN COST on this query:

EXPLAIN COST
SELECT COUNT(*) AS single_day_orders
FROM orders_partitioned
WHERE order_date = DATE '2025-03-15';

Notice the Statistics still show rowCount=186—the same as when we queried the entire month! This is because partition pruning works at the partition level. Since we partitioned by months(order_date), Iceberg prunes to the March partition (all 186 rows), then filters down to the specific day during the scan.

Behind the scenes, Iceberg:

  1. Sees the filter order_date = DATE '2025-03-15'
  2. Applies the partition transform: months(DATE '2025-03-15') = 662 (March 2025)
  3. Scans only the March partition file

The key insight: you didn't need to specify WHERE partition = 662 or WHERE month = 'March'. You simply filtered on the actual date column, and Iceberg automatically determined which partition to read.

This is hidden partitioning. The partition structure is an implementation detail, not something your queries need to worry about. Compare that to Hive-style partitioning, where you'd need to:

  1. Add a partition_month column to your table
  2. Populate it correctly on every write
  3. Include it in every query's WHERE clause

With Iceberg, you define the transform once—month(order_date)—and forget about it.


Step 7: Clean Up

Exit Spark SQL:

exit;

If you're done for the day:

docker compose down -v

What You Learned

  • Partition transforms like months() let Iceberg automatically organize data without changing your schema
  • The partitions metadata table shows you exactly how data is distributed
  • Partition pruning dramatically reduces the amount of data scanned
  • Hidden partitioning keeps your queries clean, no synthetic partition columns required

Next Up

In Exercise 4, we'll explore time travel and see how Iceberg's snapshot-based architecture lets you query historical versions of your data.

Do you have questions or comments? Join us in the #developer-confluent-io community Slack channel to engage in discussions with the creators of this content.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.