Apache Iceberg

Posted on May 18, 2026

Introduction

Apache Iceberg is a super compelling technology, I love it. There’s a lot of reasons to be excited, it’s open source, performant, supports time travel and is ACID compliant. These features are all well and good, but often with technologies like Iceberg we never really understand the why - always the how or the what for. If you haven’t already come across column formats, namely Parquet, take a read of my blog post on DuckDB and columnar formats first: here .

What are we trying to avoid anyway?

To start, imagine we’re a small data analytics team using Parquet files to do some data analysis and we’re crushing it. The team scales and we’ve got more data, people and problems to solve than before. We run into all too familiar problems (often seen with Excel/CSV too!):

File versioning becomes chaotic, causing breaking changes
People overwriting each other’s changes without knowing
We end up duplicating data to work around these limitations
We go round the office on apology tours when we break stuff

In practice, we’re trying to handle things a database would usually handle (concurrent writes, locking, transactions) with people process (nobody touch customer_data_last_version_i_promise.parquet until Monday!).

Data Resilience

So, the first thing iceberg gives us is ACID compliance. If you’re not familiar this is a set of principles which guide database design to ensure data is correctly handled. If you are not using an ACID compliant format or runtime then you can run into concurrency issues as previously mentioned. Iceberg has a couple of levels of metadata surrounding Parquet files to handle this. The spec goes into detail on this metadata, but effectively Iceberg allows us to avoid modifying files after we’ve written data. For instance, it uses delete files to avoid having to delete data directly, and instead handles delete when the table is read by reading the original parquet except for the rows listed in the delete file. This is crucially important as a concurrency failure in Iceberg effectively means an orphaned set of metadata files as opposed to a broken CSV/Parquet file when two people try to delete the same file, i.e. by forcing folks to only create new files we make sure they don’t collide into each other when making changes.

The metadata hierarchy

If you like seeing how things work this section will cover the nitty-gritty. The layered metadata structure is what makes all of Iceberg’s features work. Every Iceberg table is built from four layers:

The catalog holds a single pointer to the current metadata.json, effectively this captures the state of the current table. When you make a change, Iceberg atomically swaps this pointer to the new snapshot. The metadata.json tracks the full schema, partition spec, and a list of snapshots (the history). Each manifest list represents one snapshot and lists all the manifest files that make it up, e.g. think of this as a version of a table. Each manifest file records a batch of data files along with their column-level statistics (min, max, row count). This metadata helps improve query performance, but more on that later.

To handle concurrency when updating the pointer to the latest snapshot often catalogs are backed by a RDBMS like Postgres, a common option is Polaris.

Schema Evolution

Schema evolution is an exciting feature of Iceberg. The real differentiator here is that changes to the schema happen at the metadata level and therefore do not make any changes at the data level. What effectively happens is the metadata structures we previously discussed will capture these changes, and on read/write to the table the resulting action executes with the schema change as a constraint. For instance let’s say we delete a column, the underlying data is still present but any query trying to access this column will fail because the metadata conflicts. Similar approaches go for other changes, e.g. query the parquet for an old column name but return it with the updated name or return default/null values from a new column where the underlying data doesn’t exist.

Implementing these changes at the metadata level is really important as it avoids expensive rewrites/restructures of data.

Time Travel

As cool and theoretical as this feature sounds, it’s actually super simple. When we look at each change somebody makes to an Iceberg table, from the previous section we see that a new metadata pointer to the new version of the table is created. Very simply, we can time travel by presenting all the previous metadata pointers in the Iceberg table and allowing users to select which one they’d like. As Iceberg tables are built at query time, we just get the client to execute a query with this metadata pointer and they’ll get the version of the table at that point in time, effectively ignoring all the data files, delete files that have been written since. Pretty cool!

Divorcing storage

You may have heard about the separation of storage and compute, which is a key idea behind iceberg. We might be supporting multiple runtimes so we want to be able to give them a process for understanding when and how to interact with iceberg tables. We also need to store metadata and data about our updates like transactions and versioning history which should live in the filesystem. Typically these things are tightly coupled (e.g. Databases) but as Iceberg is a file format and your runtime (SAS, DuckDB, Spark etc) can handle the mechanics of reading and writing as this is an open format. This is what is meant by separating compute and storage as traditionally runtimes and databases have each had a unique file format that is unlikely to be supported by other offerings. There’s a bunch of benefits, but namely we don’t have to duplicate data in different formats to support different analytics tools and we can give users a wider choice of which tools they would like to use.

In practice, the same Iceberg data in S3 is going to be accessible by SAS, Spark, Snowflake and other runtimes which support it.

Hidden Partitions and Query optimization

Partitioning is a super cool feature of Iceberg. For a given table you can define a partition and Iceberg will handle the mechanics on your behalf. For instance, when partitioning by date, Iceberg will automatically partition data when new data is written into a table and apply any transformations (e.g. calculating month from timestamp). In databases and other runtimes you often need to create a specific index or column to partition by and write queries to use this at runtime.

In addition, all this metadata lets us store really cool information like the min, max and range values for columns in a particular parquet file. For instance, imagine we have an iceberg with environmental data. By storing these pieces of metadata, Iceberg can improve query performance of queries with filters (e.g. only days where rainfall > 5mm or temperature < 32 degrees) as we can read the metadata and see the min/max/range values do not overlap with our query filter. In addition, because the partitioning is handled at the metadata level, we can change the partition specification without having to rewrite old data.

Challenges

One of the things to watch out for with Iceberg is that a lot of small changes can stack up large amounts of files in a table (since we don’t edit existing files as discussed earlier). This can cause performance issues as more files means more actions to take. Iceberg does provide features for table maintenance which allow you to merge these changes into each other (effectively concatenating files). However, I do view this as reimplementing one of the simple functions of a database which feels duplicative (ahem DuckLake)!

It’s worth noting though that a lot of changes have been making their way into the spec to address these, particularly things like Delete Vectors in v3.

Conclusion

Hopefully this has provided a quick overview of Iceberg. It’s a great starting point for exploring open data formats as it’s feature rich but doesn’t require much additional infrastructure. While there are some challenges with Iceberg as read/writes scale, most datasets and use cases will not be materially impacted by this as they won’t have the read/write or data volumes to surface these limitations.