
Why have Open Table Formats Emerged?
Data lakes have risen in popularity in the modern data stack with the rise of cloud computing and cheap storage. It allows businesses to store their data in one place regardless of format and structure and when the data is required it can be ‘fished out’ for processing to serve analytics, machine learning or AI use cases.
We can think of it a bit like a kitchen pantry where new goods (data files) can be placed in the pantry and retrieved when a recipe requires them.
However, there are some challenges when it comes to working with a data lake:
Ambiguous identity and loss of lineage: You and your housemate both independently go to a supermarket to buy a bag of flower and then put the flour in the pantry - you buy the same bag with the same expiry date - for all purposes they look the same, you now can’t tell which was which and which came from where (data corruption).
Slow query and processing of data: If you want to make a cake, to find the baking soda, the flour, the eggs you have to search through the pantry to find it scanning all items (picture it all in a pile rather than neatly organized).
No undo or reversion - you spill a bag of flour over everything, there is no way to go back to the previous state of the file - the lack of time travel or versioning becomes a challenge.
Changing something is difficult: suppose you want to go back and add a date of purchase to all your items, you might have lost this information or if not you’ll still have to go in and update each file individually.
What if you started a stock take for your pantry, noting down the items you store in the pantry, where, when they were added, information around expiry date, you’ll also start building out a history of what the pantry looked like at a given point in time. That is the type of functionality that can be opened up with the adoption of an Open Table Format.
What Exactly is an Open Table Format?
At its core an open table format is a metadata layer that sits on top of data files.
The meta data doesn't hold the data itself, so the food remains on the pantry shelves. Your log documents what food items constitute an item e.g. the data files required to bake a cake, the exact shelf location quantity and expiration date for each item and logs every change that is made to these items so you can see what your pantry looked like on Tuesday.
Key options in the open table space are Apache Iceberg, Delta Lake and Apache Hudi.
Open table formats store data across multiple files and use the metadata to make sense of them.
How can Open Table Help?
ACID Transactions: This underpins why you and your housemates can add groceries at the exact same time, both bags of flour can be logged with nothing lost or hidden. But what exactly does ACID mean?
Atomicity: A transaction is all-or-nothing, if a part of the transaction fails the whole thing fails (you cannot partially add data)
Consistency: The database must follow its rules, a transaction must move data from one valid state to another.
Isolation: Transactions do not interfere with each other; each action runs on its own (hence the individual purchases of flour log separately).
Durability: Once a transaction is committed, it stays committed even after a crash
What possibilities does the above open up?
Time Travel: Rewind your inventory to a time before the flour was spilt all over the floor.
Performance: Instead of searching the whole pantry for a particular ingredient, your metadata storage tells the query engine exactly where it needs to look.
Schema Evolution: If you want to add new information to an item you can go ahead and make that update, the metadata log accepts that old items might not have this field but new ones will.
Open - interoperability!! You can change the engine you are using to process and recall the data, as the metadata storage stays the constant.
What Types of Open-Table Formats are There?
Generally speaking the open table formats have the common features mentioned above. Consolidated they might be summarised as:
Open source
Work with range of engines
Time travel
Schema Evolution
DML Operations
Here are a list of formats you can play around with depending on your platform/infrastructure:
Apache Iceberg
Delta Lake
Apache Hudi
What About the Bulk of the Iceberg?
So what are some of the technical details of what is happening with Iceberg Tables:
First, understand the components of an iceberg table:
The data files:
The food of the pantry, these are data files stored in a cloud storage service like S3, this will be your data, stored in an open-source columnar format - most commonly Apache Parquet - which is why you might encounter descriptions such as Parquet Iceberg. These files are designed to store your data efficiently. Iceberg does not replace these files; it’s a catalog to organise said files.
The manifest files: An Avro file that lists a subset of data files that make up a table, so it will consist of a list of Parquet files and column-level statistics for the data files (range of values, for example). This allows for query pruning; files can be skipped when the query engine is aware of which file contains the data of interest.
The manifest list:
This is an Avro file that contains a list of all the manifest files that make up one complete version (or snapshot) of a table.
It contains a list of manifest file paths and the higher-level summary statistics for each manifest file.
This defines the snapshot state of a table.
