The Good Tech Companies - How Iceberg + AIStor Power the Modern Multi-Engine Data Lakehouse

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. How Iceberg Plus AI Store Power the modern multi-engine data lakehouse by Minio. Apache Iceberg seems to have taken the data world by a, snow, storm. Initially incubated at Netflix by Ryan Blue, also of tabular, now Databricks fame, it was eventually transmitted to the Apache Software Foundation where it currently resides. At its core, it is an open table format for its scale data sets, think hundreds of TBs to hundreds of pbs. With AI gobbling up vast amounts of data for model creation, tuning, and real-time inference, the need for this technology has only increased since

Starting point is 00:00:41 its initial development. Iceberg is a multi-engine compatible format. That means that Spark, Trino, Flink, Presto, Snowflake and Dremio can all operate independently and simultaneously on the same data set. Iceberg supports SQL, the universal language of data analysis, and provides advanced features like atomic transactions, schema evolution, partition evolution, time travel, rollback, and zero copy branching. This post examines how icebergs features and design, when paired with Ister's high-performance storage layer, offers the flexibility and reliability needed to build data lakehouses. Related. Data Lakehouse Solutions by Min I.O. goals for a modern open table format. The promise of a lakehouse lies in its ability to unify the best of data lakesand

Starting point is 00:01:27 warehouses. Data lakes stored on object storage were designed to store massive amounts of raw data in its native format, offering scalability, exofuse, and cost-effectiveness. However, data lakes on their own struggled with a sous-like data governance and the ability to efficiently query the data stored in them. Warehouses, on the other hand, were optimized for structured data and SQL analytics but fell short in handling semi-structured OR unstructured data at scale. Iceberg and other open table formats like Apache Huddy and Delta Lake represent a leap forward. These formats make data lakes behave like warehouses all while retaining the initial flexibility and scalability of object storage. This new stack of open table formats and open storage can

Starting point is 00:02:11 support diverse workloads such as AI, machine learning, advanced analytics, and real-time visualization. Iceberg distinguishes itself with its SQL first approach and focus on multi-engine compatibility. Here's a look at how this infrastructure could function. The query engines can it directly on top of the object storage in your stack and query the date inside it without migration. The old paradigm of migrating data into a D databases storage is no longer relevant. Instead, you can and should be able to query your iceberg tables from anywhere. Analytics tools and visualizations can often connect directly to your storage where your iceberg tables are stored, but more typically interface with your query engine for a more streamlined

Starting point is 00:02:52 and user experience. Finally, AIML has the same design principle as the query engines. These AIML tools will use iceberg tables directly in object storage without migration. While not all AI ML tools utilize iceberg at this time, there is a deepening interest in using iceberg for AIML workloads, particularly for their capability of maintaining multiple versions of models and datasets without difficulty. What's special about iceberg? Apache Iceberg distinguishes itself from the through a combination of features and design principles that set it apart from alternatives like Apache Huddy and Delta Lake. Schema Evolution. Iceberg offers comprehensive support for schema evolution, allowing users to add, drop, update, or rename columns without necessitating

Starting point is 00:03:38 a complete rewrite of existing data. This flexibility ensures that changes to the data model can be implemented seamlessly, maintaining operational continuity. Partition Evolution. Iceberg supports partition evolution, enabling modifications to partitioning schemes over time without requiring data rewrites. This capability allows for dynamic optimization of data layout as access patterns change. Metadata management. Iceberg employs a hierarchical metadata structure, including manifest files, manifest lists, and metadata files. This design enhances query planning and optimization by providing detailed information about data files, facilitating efficient data access, and management. Multi-engine compatibility. Designed with openness in mind, Iceberg is compatible

Starting point is 00:04:26 with various processing engines such as Apache Spark, Flink, Trino, and Presto. This interoperability provides organizations with the flexibility to choose the tools that best fit their needs and effectively commodizes query engines driving down pricing and driving up innovation in these products. Open source community. As with all of the open table formats, Iceberg benefits from a diverse and active community, contributing to its robustness and continuous improvement. Object Storage and Iceberg Object Storage forms the foundation of modern lakehouse architectures, providing the scalability, durability, and cost efficiency required to manage vast data sets across many environments. In a lakehouse architecture, the storage layer plays a critical role in ensuring the durability

Starting point is 00:05:10 and availability of data, while an open table format like Iceberg manages the metadata. A.I. Store is particularly well-suited for these requirements. For example, for high-throughput workloads like model training, Eister's support for S3 over RDMA ensures low latency access to data. This feature makes the combination of the two technologies a very effective solution, particularly for large-scale AI and analytics pipelines. High performance is critical for the success of Lakehouse initiatives. It doesn't matter how cool your table format is if it can't serve UP queries as fast as your users required. Try it out yourself. Pre-requisites. You'll need to install Docker and Docker Compose. When both are required, it's often easier to install Docker desktop. Download an install Docker desktop from the official Docker website. Follow the installation instructions provided in the installer for your operating system.

Starting point is 00:06:03 Open Docker desktop to ensure ID's running. If you would prefer, you can also verify the installation of Docker and Docker Compose by opening a terminal and running. Iceberg tutorial. This section demonstrates how to integrate Iceberg and AI store for a robust lake house architecture. We'll use the Apache Iceberg Spark Quick Start to set U-Poor initial environment. Step 1. Set up with Docker Compose. Follow the Apache Iceberg Spark Quick Start guide to launch a containerized environment with Spark and Iceberg pre-configured.

Starting point is 00:06:34 The first step in the Guidus to copy the below information into a file and save as a file named Docker Compose. YML. Next, in a terminal window, navigate to the place where you save the YML file and start up the Docker containers with this command. You can then run the following commands to start a Spark SQL session. Docker Execute Spark Iceberg Spark SQL step two. Create a table. Open a spark shell and create a table. Inspect your Minio console by navigating to HTTP colon slash 127.0.0.1 to 9,000 and logging in with the credentials from the YML AWS underscore access underscore key underscore ID equals admin. Oz underscore secret underscore access underscore key equals password here. Observe the iceberg, my underscore table, metadata path created for the table. Initially, no data files exist, only metadata that

Starting point is 00:07:32 defines the schema and partitions. Step three. Insert data. Insert some mock data. This operation creates data files in the appropriate partitions, my underscore table, data, category equals, and updates the metadata. Step 4. Query the data. Run a basic query to validate. Your output should look something like the following, testing key iceberg features. Now that you've established a baseline data lakehouse with data that you can manipulate. Let's check out the features that make iceberg unique schema evolution and partition evolution. Schema evolution is one of Apache iceberg's most powerful and defining features, addressing a significant pain point in traditional warehouse systems. It allows you to modify a table schema without requiring expensive

Starting point is 00:08:19 and time-consuming data rewrites, which is particularly beneficial for large-scale data sets. This capability enables organizations to adapt their data models as business needs evolve without disrupting ongoing queries or operations. In iceberg, schema changes are handled purely at the metadata level. Each column in a table is assigned a unique, unique ID, ensuring that changes to the schema do not affect the underlying data files. For instance, if a new column is added, Iceberg assigns it a new ID without reinterpreting or rewriting existing data. This avoids errors and ensures backward compatibility with historical data.

Starting point is 00:08:56 Here's an example of how schema evolution works in practice. Suppose you need to add a new column, buyer, to store additional information about transactions. You can execute the following SQL command. This operation updates the tables metadata, allowing new data files to include the buyer column. Older data files remain untouched, and queries can be written to handle both new and old data seamlessly. Similarly, if a column is no longer required, you can remove it without affecting the stored data. This action updates the schema to exclude the category column. Iceberg ensures that any associated metadata is cleaned up while maintaining the integrity

Starting point is 00:09:33 of historical queries that might still reference older data versions. Iceberg schema evolution also supports renaming columns, changing column types, and rearranging column order, all through efficient metadata-only operations. These changes are made without altering the data files, making schema modifications instantaneous and cost-effective. This approach is especially advantageous in environments with frequent schemachanges, such as those driven by AI-ML experiments, dynamic business logic, or evolving regulatory requirements. Partition evolution. Iceberg supports partition evolution that can handle the tedious and error-prone tasks of producing partition values for rows in a table. Users focus on adding filters to the queries that solve business problems and do not worry about how the table is partitioned.

Starting point is 00:10:21 Iceberg automatically avoids reads from unnecessary partitions. Iceberg handles the intricacies of partitioning and changing the partition scheme of a table for you, greatly simplifying the process for end users. You can define partitioning or let iceberg take care of it for you. Iceberg likes toe partition on a timestamp, such as event time. Partitions are tracked by snapshots and manifests. Queries no longer depend on a table's physical layout. Because of this separation between physical and logical tables,

Starting point is 00:10:50 Iceberg TableScan evolve partitions over time as more data is added. We now have two partitioning schemes for the same table. From now on, query plans are split, using the old partition scheme to query old data, and the new partition scheme to query new data. data. Iceberg takes care of this for you. People querying the table don't need to know that data is stored using two partition schemes. Iceberg does this through a combination of behind-the-scenes where clauses and partition filters that prune out data files without matches. Time travel and

Starting point is 00:11:22 rollback. Every right to iceberg tables creates new snapshots. Snapshots are like the versions and can be used to time travel and rollback just like the way we DO with AI store versioning capabilities. The way snapshots are managed is by setting expire snapshots so the system is maintained well. Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state. As tables are changed, iceberg tracks each version as a snapshot and then provides the ability to time travel to any snapshot when querying the table. This can be very useful if you want to run historical queries or reproduce the results of previous queries, perhaps

Starting point is 00:12:06 for reporting. Time travel can also be helpful when testing new code changes because you can test new code with a query of known results. To see the snapshots that have been saved for a table run the following query, your output should look something like this. Some examples. You can do incremental reads using snapshots, but you must use Spark, not Spark SQL. For example, you can also roll back the table to a point in time or to a specific snapshot, as in these two examples expressive SQL. Iceberg supports all the expressive SQL commands like row-level delete, merge, and update, and the biggest thing to highlight is that iceberg supports both eager and lazy strategies. We can encode all the things we need to delete, for example, GDPR or CCPA,

Starting point is 00:12:51 but not rewrite all those data files immediately. We can lazily collect garbage as needed and that helps in efficiency on the huge tables supported by iceberg. For example, you can delete all records in a table that match a specific predicate. The following will remove all rows from the video category. Alternatively, you could use create table as select or replace table as select to accomplish this. You can merge two tables very easily. Data Engineering. Iceberg is the foundation for the open analytic table standard and uses SQL behavior and applies the data warehouse fundamentals to fix the problems before we know we have it. With declarative data engineering, we can configure tables and not worry about changing each engine to fit the needs of the data.

Starting point is 00:13:33 This unlocks automatic optimization and recommendations. With safe commits, data services are possible which helps in avoiding humans babysitting data workloads. Here are some examples of these types of configurations. To inspect a table's history, Snapshot, S and other metadata, Iceberg support querying metadata. Metadata tables are identified by adding the metadata table name, for example, history, after the original table name in your query. To display a table's data files, to display manifests, to display table history to display snapshots, you can also join snapshots in table history to see the application that wrote each snapshot now that you've learned the basics. Load some of your data into iceberg. If you learn more from the iceberg documentation,

Starting point is 00:14:17 integration. Various query and execution engines have implemented iceberg connectors, making it easy to create and manage iceberg tables. The engines support iceberg include Spark, Flink, Presto, Trino, Dremio, Snowflake and the list is growing. This extensive integration landscape ensures that organizations can adopt Apache iceberg without being constrained to a single processing engine, promoting flexibility and interoperability in their data infrastructure. Catalogs. It would be remiss in any definitive guide not to mention catalogs. Iceberg catalogs are central to managing table metadata and facilitating connections between datasets and query engines. These catalogs maintain critical information such as table schemas,

Starting point is 00:15:01 snapshots, and partition layouts, enabling icebergs advanced features like time travel, schema evolution, and atomic updates. Several catalog implementations are available to meet diverse operational needs. For instance, Polaris offers a scalable, cloud-native cataloging solution tailored to modern data infrastructures, while Dremio Nessi introduces versioning with Git-like semantics, enabling teams to track changes to data and metadata with precision. Traditional solutions like Hive Metastore are still widely used, particularly for backward compatibility with legacy systems. It's cool to build data lakes with Iceberg and AI store. Apache Iceberg gets a lot of attention as a table format for data lakes. The growing

Starting point is 00:15:44 Open source community and increasing number of integrations from multiple cloud providers and application frameworks means that it's time to take iceberg seriously and start experimenting, learning, and planning on integrating it into existing data lake architecture. Pair Iceberg with AI Store for multi-cloud data lakehouses and analytics. As you get started with iceberg and AI store, please reach out and share your experiences or ask questions through our Slack channel. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - How Iceberg + AIStor Power the Modern Multi-Engine Data Lakehouse

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.