The Good Tech Companies - How Iceberg + AIStor Power the Modern Multi-Engine Data Lakehouse
Episode Date: November 26, 2025This story was originally published on HackerNoon at: https://hackernoon.com/how-iceberg-aistor-power-the-modern-multi-engine-data-lakehouse. Learn how Apache Iceberg pa...ired with AIStor forms a high-performance, scalable lakehouse architecture with SQL features, snapshots, & multi-engine support. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #apache-iceberg-lakehouse, #aistor-object-storage, #open-table-formats, #data-lakehouse-architecture, #schema-and-partition-evolution, #iceberg-time-travel, #multi-engine-analytics, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. Apache Iceberg delivers schema evolution, partition evolution, time travel, and multi-engine compatibility for modern lakehouses. When combined with AIStor’s high-performance object storage, it enables fast analytics, scalable AI/ML workloads, and flexible SQL-driven data engineering. This guide explains architecture, features, and hands-on setup.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
How Iceberg Plus AI Store Power the modern multi-engine data lakehouse by Minio.
Apache Iceberg seems to have taken the data world by a, snow, storm.
Initially incubated at Netflix by Ryan Blue, also of tabular, now Databricks fame,
it was eventually transmitted to the Apache Software Foundation where it currently resides.
At its core, it is an open table format for its scale data sets,
think hundreds of TBs to hundreds of pbs. With AI gobbling up vast amounts of data for model
creation, tuning, and real-time inference, the need for this technology has only increased since
its initial development. Iceberg is a multi-engine compatible format. That means that Spark, Trino,
Flink, Presto, Snowflake and Dremio can all operate independently and simultaneously on the same
data set. Iceberg supports SQL, the universal language of data analysis, and provides advanced features like
atomic transactions, schema evolution, partition evolution, time travel, rollback, and
zero copy branching. This post examines how icebergs features and design, when paired with
Ister's high-performance storage layer, offers the flexibility and reliability needed to build
data lakehouses. Related. Data Lakehouse Solutions by Min I.O. goals for a modern open
table format. The promise of a lakehouse lies in its ability to unify the best of data lakesand
warehouses. Data lakes stored on object storage were designed to store massive amounts of
raw data in its native format, offering scalability, exofuse, and cost-effectiveness. However,
data lakes on their own struggled with a sous-like data governance and the ability to efficiently
query the data stored in them. Warehouses, on the other hand, were optimized for structured
data and SQL analytics but fell short in handling semi-structured OR unstructured data at scale.
Iceberg and other open table formats like Apache Huddy and Delta Lake represent a leap forward.
These formats make data lakes behave like warehouses all while retaining the initial flexibility
and scalability of object storage. This new stack of open table formats and open storage can
support diverse workloads such as AI, machine learning, advanced analytics, and real-time
visualization. Iceberg distinguishes itself with its SQL first approach and focus on multi-engine
compatibility. Here's a look at how this infrastructure could function. The query engines can
it directly on top of the object storage in your stack and query the date inside it without
migration. The old paradigm of migrating data into a D databases storage is no longer relevant.
Instead, you can and should be able to query your iceberg tables from anywhere.
Analytics tools and visualizations can often connect directly to your storage where
your iceberg tables are stored, but more typically interface with your query engine for a more streamlined
and user experience. Finally, AIML has the same design principle as the query engines. These AIML
tools will use iceberg tables directly in object storage without migration. While not all AI
ML tools utilize iceberg at this time, there is a deepening interest in using iceberg for
AIML workloads, particularly for their capability of maintaining multiple versions of models
and datasets without difficulty. What's special about iceberg? Apache Iceberg distinguishes itself from the
through a combination of features and design principles that set it apart from alternatives like
Apache Huddy and Delta Lake. Schema Evolution. Iceberg offers comprehensive support for schema
evolution, allowing users to add, drop, update, or rename columns without necessitating
a complete rewrite of existing data. This flexibility ensures that changes to the data model can
be implemented seamlessly, maintaining operational continuity. Partition Evolution.
Iceberg supports partition evolution, enabling modifications to partitioning schemes over time without
requiring data rewrites. This capability allows for dynamic optimization of data layout as access
patterns change. Metadata management. Iceberg employs a hierarchical metadata structure,
including manifest files, manifest lists, and metadata files. This design enhances query planning
and optimization by providing detailed information about data files, facilitating efficient data access,
and management. Multi-engine compatibility. Designed with openness in mind, Iceberg is compatible
with various processing engines such as Apache Spark, Flink, Trino, and Presto. This interoperability
provides organizations with the flexibility to choose the tools that best fit their needs and
effectively commodizes query engines driving down pricing and driving up innovation in these products.
Open source community. As with all of the open table formats, Iceberg benefits from a diverse
and active community, contributing to its robustness and continuous improvement. Object Storage
and Iceberg Object Storage forms the foundation of modern lakehouse architectures, providing the
scalability, durability, and cost efficiency required to manage vast data sets across many environments.
In a lakehouse architecture, the storage layer plays a critical role in ensuring the durability
and availability of data, while an open table format like Iceberg manages the metadata.
A.I. Store is particularly well-suited for these requirements. For example, for high-throughput workloads like model training, Eister's support for S3 over RDMA ensures low latency access to data. This feature makes the combination of the two technologies a very effective solution, particularly for large-scale AI and analytics pipelines. High performance is critical for the success of Lakehouse initiatives. It doesn't matter how cool your table format is if it can't serve UP queries as fast as your users required.
Try it out yourself.
Pre-requisites.
You'll need to install Docker and Docker Compose.
When both are required, it's often easier to install Docker desktop.
Download an install Docker desktop from the official Docker website.
Follow the installation instructions provided in the installer for your operating system.
Open Docker desktop to ensure ID's running.
If you would prefer, you can also verify the installation of Docker and Docker Compose
by opening a terminal and running.
Iceberg tutorial.
This section demonstrates how to integrate Iceberg and AI store for a robust lake house architecture.
We'll use the Apache Iceberg Spark Quick Start to set U-Poor initial environment.
Step 1. Set up with Docker Compose.
Follow the Apache Iceberg Spark Quick Start guide to launch a containerized environment with Spark and Iceberg pre-configured.
The first step in the Guidus to copy the below information into a file and save as a file named Docker Compose.
YML. Next, in a terminal window, navigate to the place where you save the YML file and start up
the Docker containers with this command. You can then run the following commands to start a Spark
SQL session. Docker Execute Spark Iceberg Spark SQL step two. Create a table. Open a spark shell and
create a table. Inspect your Minio console by navigating to HTTP colon slash 127.0.0.1 to 9,000
and logging in with the credentials from the YML AWS underscore access underscore key underscore ID
equals admin. Oz underscore secret underscore access underscore key equals password here. Observe the iceberg,
my underscore table, metadata path created for the table. Initially, no data files exist, only metadata that
defines the schema and partitions. Step three. Insert data. Insert some mock data. This operation creates
data files in the appropriate partitions, my underscore table, data, category equals, and updates the
metadata. Step 4. Query the data. Run a basic query to validate. Your output should look something
like the following, testing key iceberg features. Now that you've established a baseline
data lakehouse with data that you can manipulate. Let's check out the features that make iceberg
unique schema evolution and partition evolution. Schema evolution is one of Apache
iceberg's most powerful and defining features, addressing a significant pain point in traditional
warehouse systems. It allows you to modify a table schema without requiring expensive
and time-consuming data rewrites, which is particularly beneficial for large-scale data sets.
This capability enables organizations to adapt their data models as business needs evolve
without disrupting ongoing queries or operations. In iceberg, schema changes are handled
purely at the metadata level. Each column in a table is assigned a unique,
unique ID, ensuring that changes to the schema do not affect the underlying data files.
For instance, if a new column is added, Iceberg assigns it a new ID without reinterpreting
or rewriting existing data.
This avoids errors and ensures backward compatibility with historical data.
Here's an example of how schema evolution works in practice.
Suppose you need to add a new column, buyer, to store additional information about transactions.
You can execute the following SQL command.
This operation updates the tables metadata, allowing new data files to include the buyer column.
Older data files remain untouched, and queries can be written to handle both new and old data seamlessly.
Similarly, if a column is no longer required, you can remove it without affecting the stored data.
This action updates the schema to exclude the category column.
Iceberg ensures that any associated metadata is cleaned up while maintaining the integrity
of historical queries that might still reference older data versions.
Iceberg schema evolution also supports renaming columns, changing column types, and rearranging column order,
all through efficient metadata-only operations. These changes are made without altering the data files,
making schema modifications instantaneous and cost-effective. This approach is especially
advantageous in environments with frequent schemachanges, such as those driven by AI-ML experiments,
dynamic business logic, or evolving regulatory requirements. Partition evolution.
Iceberg supports partition evolution that can handle the tedious and error-prone tasks of producing partition values for rows in a table.
Users focus on adding filters to the queries that solve business problems and do not worry about how the table is partitioned.
Iceberg automatically avoids reads from unnecessary partitions.
Iceberg handles the intricacies of partitioning and changing the partition scheme of a table for you,
greatly simplifying the process for end users.
You can define partitioning or let iceberg take care of it for you.
Iceberg likes toe partition on a timestamp, such as event time.
Partitions are tracked by snapshots and manifests.
Queries no longer depend on a table's physical layout.
Because of this separation between physical and logical tables,
Iceberg TableScan evolve partitions over time as more data is added.
We now have two partitioning schemes for the same table.
From now on, query plans are split,
using the old partition scheme to query old data,
and the new partition scheme to query new data.
data. Iceberg takes care of this for you. People querying the table don't need to know that data
is stored using two partition schemes. Iceberg does this through a combination of behind-the-scenes
where clauses and partition filters that prune out data files without matches. Time travel and
rollback. Every right to iceberg tables creates new snapshots. Snapshots are like the versions
and can be used to time travel and rollback just like the way we DO with AI store versioning
capabilities. The way snapshots are managed is by setting expire snapshots so the system is
maintained well. Time travel enables reproducible queries that use exactly the same table snapshot,
or lets users easily examine changes. Version rollback allows users to quickly correct problems by
resetting tables to a good state. As tables are changed, iceberg tracks each version as a snapshot
and then provides the ability to time travel to any snapshot when querying the table. This can be very
useful if you want to run historical queries or reproduce the results of previous queries, perhaps
for reporting. Time travel can also be helpful when testing new code changes because you can test
new code with a query of known results. To see the snapshots that have been saved for a table run the
following query, your output should look something like this. Some examples. You can do incremental
reads using snapshots, but you must use Spark, not Spark SQL. For example, you can also roll back the
table to a point in time or to a specific snapshot, as in these two examples expressive SQL.
Iceberg supports all the expressive SQL commands like row-level delete, merge, and update,
and the biggest thing to highlight is that iceberg supports both eager and lazy strategies.
We can encode all the things we need to delete, for example, GDPR or CCPA,
but not rewrite all those data files immediately. We can lazily collect garbage as needed and that
helps in efficiency on the huge tables supported by iceberg. For example, you can delete all
records in a table that match a specific predicate. The following will remove all rows from the
video category. Alternatively, you could use create table as select or replace table as select to
accomplish this. You can merge two tables very easily. Data Engineering. Iceberg is the foundation
for the open analytic table standard and uses SQL behavior and applies the data warehouse
fundamentals to fix the problems before we know we have it. With declarative data engineering,
we can configure tables and not worry about changing each engine to fit the needs of the data.
This unlocks automatic optimization and recommendations. With safe commits, data services are
possible which helps in avoiding humans babysitting data workloads. Here are some examples of these
types of configurations. To inspect a table's history, Snapshot, S and other metadata,
Iceberg support querying metadata. Metadata tables are identified by adding the metadata table
name, for example, history, after the original table name in your query. To display a table's data
files, to display manifests, to display table history to display snapshots, you can also join snapshots
in table history to see the application that wrote each snapshot now that you've learned the
basics. Load some of your data into iceberg. If you learn more from the iceberg documentation,
integration. Various query and execution engines have implemented iceberg connectors, making it easy to
create and manage iceberg tables. The engines support iceberg include Spark, Flink, Presto, Trino, Dremio,
Snowflake and the list is growing. This extensive integration landscape ensures that organizations
can adopt Apache iceberg without being constrained to a single processing engine, promoting flexibility
and interoperability in their data infrastructure. Catalogs. It would be remiss in
any definitive guide not to mention catalogs.
Iceberg catalogs are central to managing table metadata and facilitating connections between
datasets and query engines. These catalogs maintain critical information such as table schemas,
snapshots, and partition layouts, enabling icebergs advanced features like time travel,
schema evolution, and atomic updates. Several catalog implementations are available to meet diverse
operational needs. For instance, Polaris offers a scalable, cloud-native
cataloging solution tailored to modern data infrastructures, while Dremio Nessi introduces
versioning with Git-like semantics, enabling teams to track changes to data and metadata with
precision. Traditional solutions like Hive Metastore are still widely used, particularly for
backward compatibility with legacy systems. It's cool to build data lakes with Iceberg and
AI store. Apache Iceberg gets a lot of attention as a table format for data lakes. The growing
Open source community and increasing number of integrations from multiple cloud providers and
application frameworks means that it's time to take iceberg seriously and start experimenting,
learning, and planning on integrating it into existing data lake architecture.
Pair Iceberg with AI Store for multi-cloud data lakehouses and analytics. As you get started with
iceberg and AI store, please reach out and share your experiences or ask questions through our Slack
channel. Thank you for listening to this Hackernoon story, read by artificial intelligence.
Visit hackernoon.com to read, write, learn and publish.
