
Software Practice Lead
You've created tables and evolved their schemas. That's fine for small datasets, but what happens when you've got five million rows? Or five billion? At some point, scanning every single file just to find last Tuesday's orders becomes... let's call it "suboptimal."
Partitioning solves this by organizing your data into logical groups. Iceberg takes it a step further with hidden partitioning—you define the partitioning strategy once, and Iceberg handles the rest. No more partition columns cluttering up your queries. No more WHERE year=2025 AND month=01 gymnastics.
In this exercise, you'll create a partitioned table, generate enough data to make partitioning worthwhile, and see exactly how Iceberg prunes partitions at query time.
By the end of this exercise, you will be able to:
If it's not already running:
docker compose up -dRun Spark SQL:
docker compose exec -it spark-iceberg spark-sql --conf "spark.hadoop.hive.cli.print.header=true"Create the database if you don't have it:
CREATE NAMESPACE IF NOT EXISTS demo.ecommerce;
USE demo.ecommerce;This time, we're adding a partitioning clause. We'll partition by month, derived from the order date:
CREATE TABLE orders_partitioned (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
total_amount DECIMAL(10, 2),
status STRING
)
USING iceberg
PARTITIONED BY (months(order_date))
TBLPROPERTIES ('format-version'='2', 'write.format.default'='parquet');That month(order_date) is a partition transform. Iceberg supports several of these: year(), month(), day(), hour() for timestamps, bucket() for hash distribution, and truncate() for strings. The beauty is that your queries don't need to know about them—you just query order_date like normal.
You can check the detailed table schema with :
SHOW CREATE TABLE orders_partitioned;Five rows won't cut it for demonstrating partitioning. Let's generate 1,000 orders spread across six months.
Spark's explode() and sequence() functions help us generate test data:
INSERT INTO orders_partitioned
SELECT
row_num AS order_id,
1 + (row_num % 100) AS customer_id,
date_add(DATE '2025-01-01', row_num % 180) AS order_date,
ROUND(CAST((10 + (row_num % 500)) AS DECIMAL(10,2)) + (row_num % 100) / 100.0, 2) AS total_amount,
CASE (row_num % 4)
WHEN 0 THEN 'completed'
WHEN 1 THEN 'completed'
WHEN 2 THEN 'pending'
ELSE 'cancelled'
END AS status
FROM (SELECT explode(sequence(1, 1000)) AS row_num);Let's break down what's happening:
Verify the data landed:
SELECT COUNT(*) FROM orders_partitioned;Check the distribution across months:
SELECT
DATE_TRUNC('month', order_date) AS month,
COUNT(*) AS order_count
FROM orders_partitioned
GROUP BY 1
ORDER BY 1;You should see orders spread across January through June 2025.
Here's what I got:
month order_count
2025-01-01 00:00:00 185
2025-02-01 00:00:00 168
2025-03-01 00:00:00 186
2025-04-01 00:00:00 161
2025-05-01 00:00:00 155
2025-06-01 00:00:00 145Now let's see how Iceberg organized our data. The partitions metadata table shows exactly what partitions exist:
SELECT * FROM demo.ecommerce.orders_partitioned.partitions;You should see six partitions—one for each month. The partition numbers range from 660 to 665, representing the number of months since January 1970 (Iceberg's epoch for the month() transform). Notice the record_count and file_count columns—each partition tracks its own statistics.
Here's what I got:
partition | record_count | file_count | total_size | >
------------------------+--------------+------------+------------+----------------------------------------------->
{order_date_month=660} | 185 | 1 | 2162 | {order_id={min=1, max=930, null_count=0, nan_c>
{order_date_month=663} | 161 | 1 | 1958 | {order_id={min=90, max=1000, null_count=0, nan>
{order_date_month=664} | 155 | 1 | 1957 | {order_id={min=120, max=870, null_count=0, nan>
{order_date_month=661} | 168 | 1 | 1934 | {order_id={min=31, max=958, null_count=0, nan_>
{order_date_month=662} | 186 | 1 | 2100 | {order_id={min=59, max=989, null_count=0, nan_>
{order_date_month=665} | 145 | 1 | 1894 | {order_id={min=151, max=899, null_count=0, nan>
(6 rows)Let's also look at the files:
SELECT
partition,
file_path,
record_count,
file_size_in_bytes
FROM demo.ecommerce.orders_partitioned.files;You'll see each partition has its own file organized by month in the file path:
partition file_path record_count file_size_in_bytes
{"order_date_month":664} s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-05/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00004.parquet 155 2779
{"order_date_month":665} s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-06/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00002.parquet 145 2715
{"order_date_month":662} s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-03/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00006.parquet 186 2923
{"order_date_month":663} s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-04/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00001.parquet 161 2778
{"order_date_month":660} s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-01/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00003.parquet 185 2937
{"order_date_month":661} s3://warehouse/ecommerce/orders_partitioned/data/order_date_month=2025-02/00000-1-424bb7ee-af3e-4310-96a3-6d0619889239-0-00005.parquet 168 2757Here's where the magic happens. When you query a specific time range, Iceberg skips partitions that can't possibly contain matching data.
Let's verify partition pruning with actual row counts. Start with a full table scan:
SELECT COUNT(*) AS total_orders
FROM orders_partitioned;total_orders
1000Now let's see what happens when you filter for just March:
SELECT COUNT(*) AS march_orders
FROM orders_partitioned
WHERE order_date >= DATE '2025-03-01'
AND order_date < DATE '2025-04-01';march_orders
186So, 186 rows for March out of 1000 in total.
Let's use EXPLAIN COST to see the difference in what Iceberg plans to scan. First, the full table:
EXPLAIN COST
SELECT * FROM orders_partitioned;Look at the Statistics line in the output:
Statistics(sizeInBytes=46.9 KiB, rowCount=1.00E+3)This shows Iceberg will scan ~47 KB across all 1,000 rows (all six partitions).
Now compare with the filtered query:
EXPLAIN COST
SELECT * FROM orders_partitioned
WHERE order_date >= DATE '2025-03-01'
AND order_date < DATE '2025-04-01';You should see something like:
Statistics(sizeInBytes=8.7 KiB, rowCount=186)This is partition pruning in action! The filtered query only scans ~8.5 KB and 186 rows—just the March partition. Iceberg automatically determined that only partition 662 (March 2025) could contain matching rows, so it skipped the other five partitions entirely.
In a real data lake with terabytes of data, this is the difference between a 30-second query and a 3-hour query.
Here's something subtle but important. Notice that in all our queries above, we filtered on order_date, not on any partition column. Let's demonstrate this with another example:
-- Query for a single day in March
SELECT COUNT(*) AS single_day_orders
FROM orders_partitioned
WHERE order_date = DATE '2025-03-15';You should get 6 orders for that specific day. Now try EXPLAIN COST on this query:
EXPLAIN COST
SELECT COUNT(*) AS single_day_orders
FROM orders_partitioned
WHERE order_date = DATE '2025-03-15';Notice the Statistics still show rowCount=186—the same as when we queried the entire month! This is because partition pruning works at the partition level. Since we partitioned by months(order_date), Iceberg prunes to the March partition (all 186 rows), then filters down to the specific day during the scan.
Behind the scenes, Iceberg:
The key insight: you didn't need to specify WHERE partition = 662 or WHERE month = 'March'. You simply filtered on the actual date column, and Iceberg automatically determined which partition to read.
This is hidden partitioning. The partition structure is an implementation detail, not something your queries need to worry about. Compare that to Hive-style partitioning, where you'd need to:
With Iceberg, you define the transform once—month(order_date)—and forget about it.
Exit Spark SQL:
exit;If you're done for the day:
docker compose down -vIn Exercise 4, we'll explore time travel and see how Iceberg's snapshot-based architecture lets you query historical versions of your data.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.