The Good Tech Companies - The Architect’s Handbook to Open Table Formats and Object Storage
Episode Date: November 26, 2025This story was originally published on HackerNoon at: https://hackernoon.com/the-architects-handbook-to-open-table-formats-and-object-storage. Learn how Apache Iceberg, ...Delta Lake, and Apache Hudi pair with object storage to build scalable, open, high-performance data lakehouse architectures. Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #open-table-formats, #apache-iceberg, #delta-lake, #apache-hudi, #data-lakehouse-architecture, #object-storage-systems, #ai-and-analytics-workloads, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. Open table formats like Iceberg, Delta Lake, and Hudi combined with object storage enable scalable, modular, and interoperable data lakehouses. This guide explains their evolution, compares key features, and explores performance, compute considerations, and the future of open, AI-ready architectures for modern analytics.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
The Architects Handbook to Open Table Formats and Object Storage by Minio.
This article was original published on the new stack.
Compare Apache Iceberg, Delta Lake and Apache Huddy and learn how to choose the right
open table format for your data lakehouse.
Open table formats and object storage are redefining how organizations architect their data
systems, providing the foundation for scalable, efficient and future-proof data
lakehouses. By leveraging the unique strengths of object storage, its scalability, flexibility and
cost effectiveness, alongside the advanced metadata management capabilities of open table formats
like Apache Iceberg, Delta Lake and Apache Huddy, organizations can create modular architectures
that meet the demands of modern data workloads. At the core of this architectural shift is
the disaggregation of compute and storage. Object storage serves as the foundation, offering seamless
management of structured, semi-structured and unstructured data, while open-table formats
act as a metadata abstraction layer, enabling database like features such askemas, partitions
and atomicity, consistency, isolation and durability, acid, transactions. Compute engines like Spark,
Presto, Trino and Dremio interact with these table formats, delivering the flexibility to process
and analyze data at scale without vendor lock-in. This guide will delve into the role of open
table formats and object storage in building modern data lakehouses. I'll explore their evolution,
compare leading table formats and highlight performance considerations that optimize your
architecture for advanced analytics and AI workloads. By understanding these components,
you'll be equipped to design data systems that are not only efficient and scalable,
but also adaptable to the rapidly changing demands of the data-driven era. Where open table
formats fit in. The modern data lakehouse architecture builds upon three critical components, the
the storage layer, the open table format and the compute engines. This modular design is optimized
to take full advantage of object storage's scalability and cost efficiency while leveraging open
table formats for seamless metadata management and interoperability across diverse compute engines.
At its foundation lies the storage layer of object storage, which provide a scalable and
flexible storage for structured, semi-structured and unstructured data. In the storage layer sit the
open table formats, which could be Apache Iceberg, Delta Lake, or Apache Huddy. These open table
formats act as a metadata abstraction layer, providing database-like features including schemas,
partitions and versioning as well as advanced features like acid transactions, schema evolution,
and time travel. Finally, compute engines like Spark, Presto, Trino and Dremio interact with the
open table formats to process and analyze data at scale, offering users the flexibility to choose the
the best tool for their workload. Evolution of data architectures. The rise of data lakehouses can be
understood as part of the broader evolution of data architectures. Early systems like online
transaction processing, OTLP databases prioritized transactional integrity but lacked analytical
capabilities. The advent of online analytical processing, OLAP systems introduced data warehouses,
optimized for querying structured data, but at the cost of handling semi-structured and unstructured
data efficiently. Data lakes emerged to address these limitations, offering scalable storage
for varied data types and schema on read capabilities. However, the lack of transactional
guarantees in data lakes spurred the development of data lake houses, which integrate the
strengths of data lakes and data warehouses into a unified architecture. Lakehouses are built on
open table formats and object storage and are fully decoupled, meaning they are constructed
of modular components. This disaggregated architecture provides both the
transactional consistency of databases and the scalability of object storage.
Why open table formats are ideal for object storage?
Data lakehouse architectures are purposefully designed to leverage the scalability and cost
effectiveness of object storage systems, such as Amazon Web Services, AWS, S3, Google Cloud Storage
and Azure blob storage. This integration enables the seamless management of diverse
data types, structured, semi-structured and unstructured, within a unified platform.
Key features of data lakehouse architectures on object storage include unified storage layer.
By utilizing object storage, data lakehouses can store vast amounts of data in its native format,
eliminating the need for complex data transformations before storage.
This approach simplifies data ingestion and enables compatibility with various data sources.
Scalability. Object storage systems are inherently scalable,
allowing data lakehouses to accommodate growing data volumes without significant infrastructure change.
changes. This scalability enables organizations to efficiently manage expanding data sets and
evolving analytics requirements. Flexibility. Best in class object storage can be deployed
anywhere, on premises, in private clouds, in public clouds, co-location facilities, data centers,
and at the edge. This flexibility allows organizations to tailor their data infrastructure to
specific operational and geographic needs. By integrating these elements, data lakehouse architectures
offer a comprehensive solution that combines the strengths of data lakes and data warehouses. This design
facilitates efficient data storage, management and analysis, all built upon the foundation of scalable
and flexible object storage systems. Open table formats defined. An open table format is a standardized,
open source framework designed to manage a large-scale analytic datasets efficiently. It operates as
a metadata layer atop data files, facilitating seamless data management and access across various processes.
processing engines. Here is an overview of the three open table formats, Iceberg Delta
Lake and Huddy. Apache Iceberg is a high-performance table format designed for massive
datasets. Its architecture prioritizes efficient read operations and scalability, making it a cornerstone
for modern analytics workloads. One of its defining features is the separation of
metadata from data, allowing efficient snapshot-based isolation and planning. This design eliminates
costly metadata operations, enabling parallel query planning across large datasets.
Recent advancements in the iceberg ecosystem highlight its growing adoption across the industry.
S3 tables simplify data management by enabling query engines to directly access table metadata and
data files stored in S3 compatible systems, reducing latency and improving interoperability.
Meanwhile, Databricks acquisition of tabular underscores iceberg's primacy role in open
lakehouse platforms and emphasizes its focus on performance and governance.
Additionally, Snowflake's decision to make Polaris Open Source demonstrates the industry
commitment to openness and interoperability, further solidifying iceberg position as a leading
table format. Delta Lake originally developed by Databricks, Delta Lake was closely tied to
Apache Spark. It is fully compatible with Spark APIs and integrates with Spark's structured
streaming, allowing for both batch and streaming operations. One key feature of
Delta Lake is that it employs a transaction log to record all changes made to data, ensuring
consistent views and right isolation. This design supports concurrent data operations, making it
suitable for high throughput environments. Apache Huddy Apache Huddy is designed to address
the challenges of real-time data ingestion and analytics, particularly in environments requiring
frequent updates. Its architecture supports right optimized storage, woes, for efficient data
ingestion and read optimized storage, ROS for querying, enabling up-to-date views of data sets.
By processing changes in data streams incrementally, HUDDII facilitates real-time analytics at
scale. Features like Bloom filters and global indexing optimize I-O operations, improving query
and write performance. Additionally, HUDD includes tools for clustering, compaction and cleaning,
which aid in maintaining table organization and performance. Its capability to handle record-level update
and deletes makes it a practical choice for high-velocity data streams and scenarios requiring
compliance and strict data governance. Comparing open-table formats, Apache Iceberg, Delta Lake,
and Apache Huddy each bring unique strengths to data lakehouse architectures. Here's a comparative overview of
these formats based on key features. Acid transactions. All three formats provide acid
compliance, ensuring reliable data operations. Iceberg employs snapshot isolation for transactional
integrity, Delta Lake utilizes a transaction log for consistent views and write isolation,
and HUDDY offers file-level concurrency control for high concurrency scenarios. Schema evolution.
Each format supports schema changes, allowing the addition, deletion or modification of columns.
Iceberg offers flexible schema evolution without rewriting existing data, Delta Lake enforces schema at
runtime to maintain data quality, and HUDD provides pre-commit transformations for additional flexibility.
Partition Evolution.
Iceberg supports partition evolution, enabling seamless updates to partitioning schemes without rewriting existing data.
Delta Lake allows partition changes but may require manual intervention for optimal performance,
while HUDdy offers fine-grained clustering as an alternative to traditional partitioning.
Time travel. All three formats offer time travel capabilities, allowing users to query historical data states.
This feature is invaluable for auditing and debugging purposes.
Widespread adoption, Iceberg is the most widely adopted open table format by the data community.
From data bricks to Snowflake to AWS, many large platforms have invested in Iceberg.
If you're already part of these ecosystems or thinking about joining them, iceberg might naturally
stand out. Indexing. Huddy provides multimodal indexing capabilities, including Bloom filters and
record level indexing, which can enhance query performance. Delta Lake and Iceberg rely on metadata
optimizations but do not offer the same level of indexing flexibility. Concurrency and streaming.
Huddy is designed for real-time analytics with advanced concurrency control and built-in tools
like Delta Streamer for incremental ingestion. Delta Lake supports streaming through change data feed,
and Iceberg provides basic incremental read capabilities. These distinctions highlight that while all
three formats provide a robust foundation for modern data architectures, the optimal choice depends
on specific workload requirements and organizational needs.
Performance expectations, achieving optimal performance in data lakehouse architectures
is essential tofully leverage the capabilities of open table formats.
This performance hingeson the efficiency of both the storage and compute layers.
The storage layer must provide low latency and high throughput to accommodate a large-scale
analytics demands.
Object storage solutions should facilitate rapid data access and support high-speed transfers,
ensuring smooth operations even under high workloads.
Additionally, efficient input, output operations per second, IOPS, are crucial for handling
numerous concurrent data requests, enabling responsive data interactions without bottlenecks.
Equally important is compute layer performance, which directly influences data processing
and query execution speeds.
Compute engines must be scalable to manage growing data volumes and user queries without compromising
performance. Employing optimized query execution plans and resource management strategies can further
enhance processing efficiency. Additionally, compute engines need to integrate seamlessly with
open table formats to fully utilize advanced features like acid transactions, schema evolution,
and time travel. The open table formats also incorporate features designed to boost performance.
These also need to be configured properly and leveraged for a fully optimized stack. One such feature is
efficient metadata handling, where metadata is managed separately from the data, which enables
faster query planning and execution. Data partitioning organizes data into subsets, improving query
performance by reducing the amount of data scanned during operations. Support for schema evolution
allows table formats to adapt to changes in data structure without extensive data rewrites,
ensuring flexibility while minimizing processing overhead. By focusing on these performance aspects
across the storage and compute layers, organizations can ensure that their data lakehouse environments
are efficient, scalable and capable of meeting the demands of modern analytics and AI workloads.
These considerations enable open table formats to reach their full potential, delivering the high
performance needed for real-time insights and decision-making. Open data lakehouses and
interoperability. The data lakehouse architecture builds on open-table formats to deliver a unified
approach to data management. However, achieving true open-operability. However, achieving true open-
Openness requires more than just adopting open table formats.
Open data lakehouses must integrate modular, interoperable and open source components such as storage engines, catalogs and compute engines to enable seamless operation across diverse platforms.
The open-table formats are open standards and, by their very design, support interoperability and openness throughout the stack.
Yet, practical challenge remain, such as ensuring catalog interoperability and avoiding dependencies on proprietary services for table management.
The recent introduction of tools like Apache X table demonstrates progress toward universal compatibility, providing a path to write once, query anywhere systems.
It's important to note that X table doesn't allow you to write in multiple open table formats, just read.
Hopefully future innovations in interoperability will continue to build on these and other projects surrounding the open table formats.
The future of open table formats, as the landscape of data lakehouses continues to evolve, certain trends and advancement,
are likely to shape its future. A significant area of growth will likely be the integration of
AI and machine learning, ML. Workloads directly into the lakehouse architecture. For the storage layer,
this could look like platforms with direct integrations to key AI platforms like hugging face and
open AI. For the compute layer, AI integration could lead to creating specialized compute engines
optimized for ML algorithms, enhancing the efficiency of training and inference processes within the
lakehouse ecosystem. Another area of significant growth will likely be in the open source community.
When major private companies like Databricks, Snowflake and AWS start to throw their weight
around, it's easy to forget that the open table formats are true open standards.
Iceberg, Huddy and Delta Lake are available for any contributors, collaboration or integration
into open source tools and platforms. In other words, they are part of a vibrant and growing
open standard data ecosystem. It is important to remember that.
that open source begets open source, we will see the continued proliferation of open source applications,
add-ons, catalogs and innovations in this space. Finally, adoption of open table formats will
continue to rise as enterprises build large scale, high-performance data lakehouses for AI and other
advanced use cases. Some industry professionals equate the popularity of open table formats
with the rise and supremacy of Hadoop in the early 2000s. Big data is dead. Long live big data. Built for
today and tomorrow. Combining open table formats with high-performance object storage all
all-aus architects to build data systems that are open, interoperable and capable of meeting
the demands of AI, ML and advanced analytics. By embracing these technologies, organizations can
create scalable and flexible architectures that drive innovation and efficiency in the data-driven
era. Thank you for listening to this Hackernoon story, read by artificial intelligence.
Visit hackernoon.com to read, write, learn and publish.
Thank you.
