The Good Tech Companies - The Architect’s Guide to A Modern Datalake Reference Architecture

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. The Architect's Guide to a Modern Data Lake Reference Architecture, by Minio. An abbreviated version of this post appeared on the new stack on March 26, 2024. Businesses aiming to maximize their data assets are adopting scalable, flexible, and unified data storage and analytics approaches. This trend is driven by enterprise architects tasked with crafting infrastructures that align with evolving business demands. A modern data lake architecture addresses this need by integrating the scalability and flexibility

Starting point is 00:00:35 of a data lake with the structure and performance optimizations of a data warehouse. This post serves as a guidero understanding and implementing a modern data lake architecture. What is a modern data lake? A modern data lake is one half data warehouse and one half data lake and this is object storage for everything. This may sound like a marketing trick, put two products in one package and call it a new product, but the data warehouse that will be presented in this post is better than a conventional data warehouse it uses object storage, therefore, it provides all the benefits of object storage in terms of scalability and performance. Organizations that adopt this approach will only pay for what they need, facilitated by the scalability of object storage, and if blazing speed is needed,

Starting point is 00:01:19 they can equip their underlying object store with NVMe drives connected by a high-end network. The use of object storage in this fashion has been made possible by the rise of open-table formats OTFs like Apache Iceberg, Apache Huddy, and Delta Lake which are specifications that, once implemented, make it seamless for object storage to be used as the underlying storage solution for a data warehouse. These specifications also provide features that may not exist in a conventional data warehouse, for example, snapshots, also known as time travel, schema evolution, partitions, partition evolution, and zero-copy branching. But, as stated above, the modern data lake is more than just a fancy data warehouse.

Starting point is 00:02:01 It also contains a data lake for unstructured data. The OTFs also provide integration to external data in the data lake. This integration allows external data to be used as a SQL table if needed, or the external data can be transformed and routed to the data warehouse using high-speed processing engine SAN familiar SQL commands. So the modern data lake is more than just a data warehouse and a data lake in one package with a different name. Collectively, they provide more value than what can be found in a conventional data warehouse or a standalone data lake. Conceptual architecture layering is a convenient way to present all the components and services needed be the modern data lake.

Starting point is 00:02:40 Layering provides a clear way to group services that provide similar functionality. It also allows for a hierarchy to be established, with consumers on top and data sources, with their raw data, on the bottom. The layers of the modern data lake from top to bottom are consumption layer. Contains the tools used by power users to analyze data. Also contains applications and AI, ML workloads that will programmatically access the modern data lake. Backslash dot semantic layer, an optional metadata layer for data discovery and governance. Backslash dot processing layer. This layer contains the compute clusters needed to query the modern data lake. It also contains compute clusters used for distributed model training. Complex transformations can occur in the processing layer by taking advantage of

Starting point is 00:03:28 the storage layer's integration between the data lake and the data warehouse. Backslash dot, storage layer, object storage is the primary storage service for the modern data lake. However, MLOP tools may need other storage services such as relational databases. If you are pursuing generative AI, you will need a vector database. Backslash dot ingestion layer contains the services needed to receive data. Advanced ingestion layers will be able to retrieve data based on a schedule. The modern data lake should support a variety of protocols. It should also support data arriving in streams and batches. Simple and complex data transformations can occur in the ingestion layer.

Starting point is 00:04:08 Backslash dot. Data sources. The data sources layer is technically not a part of the modern data lake solution, but it is included in this post because a well-constructed modern data lake must support a variety of data sources with varying capabilities for sending data. Backslash dot. The diagram below visually depicts all the layers described above and all the capabilities that may be needed to implement

Starting point is 00:04:30 these layers. This is an end-to-end architecture where the heart of the platform is a modern data lake. Rather than focusing on just the processing layer and the storage layer, this architecture also shows components needed to ingest, transform, discover, govern, and consume data. The tools needed to support important use cases that depend on a modern data lake are also included, such as MLOPs storage, vector databases, and machine learning clusters. The conceptual nature of the approach used in this post is important. If the diagram above made use of product names, then meaning would be lost. Product names are rarely chosen for meaning, rather, they are chosen for brand awareness and memory retention. To this end, our conceptual architecture USES simple nouns where the feature

Starting point is 00:05:15 provided is intuitive. The next section will provide an example of a concrete implementation for the reader familiar with the more popular big data projects and products in the market today. However, the reader is encouraged to refer to the conceptual diagram when making decisions for their organization. Finally, there are no arrows. Arrows typically depict data flow and dependencies. Showing all possible data flows and dependencies would unnecessarily complicate the diagram. A better approach is to look at data flow and dependencies in the context of a use case. Once a few components are isolated in the context of a use case, then data flow and dependencies can be more clearly illustrated. A concrete architecture The purpose of this section is to ground the design of our reference architecture with concrete

Starting point is 00:06:00 open-source examples. For the architect eager to dive in and start building, the projects and products shown below are free to use INA proof of concept. When your POC graduates to a funded project that will one day run in production, then be sure to check open-source licenses and terms of use for all software used in your POC. A few words on data sources The applications, devices, and vendors that feed your modern data lake come in a variety of flavors, and so does their data. On-premise modern applications may be able to stream well-structured data in real-time using formats such as a VR-owned parquet. On the other hand, older legacy applications may only be able to send simple files in batches, such as XML, JSON, and CSVs. Data vendors may not send data at all,

Starting point is 00:06:47 expecting their customers to retrieve data. Mobile apps, websites, IoT devices, and social media apps will typically send application logs and other telemetry usage statistics to your ingestion layer. Log analytics is a popular use case for a modern data lake. Additionally, they may send images and audio files to be used within AI, ML workloads. Finally, organizations looking to take advantage of generative AI will need to store documents found in file shares and portals such as SharePoint portal server and Confluence in the data lake. The modern data lake needs to be able to interface with all these data sources efficiently and reliably, getting the data to either data lake storage or data warehouse storage. Onboarding data is the primary purpose of the ingestion layer of our architecture.

Starting point is 00:07:33 This requires your ingestion layer to support a variety of protocols capable of receiving streamed data and batched data. Let's investigate the components of this layer next. The ingestion layer The ingestion layer. The ingestion layer is the on-ramp to your modern data lake. It is responsible for ingesting data into the data storage layer. Structured data from sources that designed their feeds for the data warehouse side of the modern data lake can bypass the data lake and send their data directly to the data warehouse. On the other hand, sources that did not design their feeds in such a fashion will need to have their data sent to the data lake, where it can not design their feeds in such a fashion will need to have

Starting point is 00:08:05 their data sent to the data lake, where it can be transformed before being ingested into the data warehouse. The ingestion layer should be able to receive and retrieve data. Internal lines of business, LOB, applications may have been given the mandate to send their data via streaming or batching. For these applications, the ingestion layer needs to provide an endpoint for receiving the data. However, data vendors and other external data sources may not be so willing to deliver data. The ingestion layer should also provide scheduled retrieval capabilities. For example, a data vendor may provide new datasets at the first of every month. Scheduled retrieval capabilities will allow for the ingestion layer to connect and download data at the correct time. Streaming is the best way to transmit data to a modern

Starting point is 00:08:49 data lake or to any destination for that matter. Steaming implies the use of a messaging service deployed in a way that makes it resilient, available and highly performant. The messaging service usually provides a queuing mechanism that acknowledges the receipt of a message only upon successful storage of the message. The service then provides, exactly once, delivery to a downstream service that is responsible for saving the data in the message to either the data warehouse or data lake. Note. Some message services provide, at least once, delivery requiring downstream services to implement idempotent updates to the data source. It is important to check the fine print of the service you end upusing.

Starting point is 00:09:32 What is especially nice about this style of ingestion is that if the downstream service fails and does not acknowledge the successful processing of a message then the message will reappear in the queue for future ingestion. Messaging services also provide dead letter queues for messages that repeated LY fail. Streaming ingestion is great, but in many cases, real-time insights are not needed. In these situations, batch or mini-batch processing works fine and can be considerably simpler to implement. For batch uploads, the S3 API is your best option. Minio is S3 compliant, and any data source currently sending batch data to an S3

Starting point is 00:10:06 endpoint will work, as is, with only a connection change once you switch over to the Minio data lake. However, many organizations may still prefer FTP, SFTP for its simplicity and ability to run in highly constrained environments. Minio also has support for FTP and SFTP. This interface allows ADATA Source to send data to Minio the same way it would send data to an FTP server. From an application or user's perspective, moving data onto Minio using SFTP is seamless since everything is essentially the same, from policies, security, etc. The data storage layer The data storage layer is the bedrock that all other

Starting point is 00:10:45 layers depend upon. Its purpose is to store data reliably and serve it efficiently. There will be an object storage service for the data lake side of the modern data lake and the rewell be an object storage service for the data warehouse. These two object storage services can be combined into one physical instance of an object store if needed by using buckets to keep data warehouse storage separate from data lake storage. However, consider keeping them separate and installed on different hardware if the processing layer will be putting different workloads on these two storage services. For example, a common data flow is to have all new data land in the data lake. Once in the data lake, IT can be transformed and ingested into the data warehouse, where it can be consumed by other applications and used for the purpose of data science,

Starting point is 00:11:30 business intelligence, and data analytics. If this is your data flow, then your modern data lake will be putting more load on your data warehouse, and you will want to make sure it is running on high-end hardware, storage devices, storage clusters, and network. External table functionality allows data warehouses and processing engines to read objects in the data lake as if they were SQL tables. If the data lake ISUSED is the landing zone for raw data, then this capability, along with the data warehouse SQL capabilities, can be used to transform raw data before inserting it into the data warehouse. Alternatively, the external table could be used, as is, and joined with other tables and resources

Starting point is 00:12:09 inside the data warehouse without it ever leaving the data lake. This pattern can help save on migration costs and can overcome some data security concerns by keeping the data in one place while, at the same time, making it available to outside services. Most MLOP tools use a combination of an object store and a relational database to support MLOPs. For example, an MLOP tool should store training metrics, hyperparameters, model checkpoints, and dataset versions. Models and datasets should be stored in the data lake, while metrics and hyperparameters will be more efficiently stored in a relational database. If you are pursuing generative AI, you will need to build a custom corpus for your organization. It should contain documents with knowledge that no one else has and only documents that

Starting point is 00:12:55 are true and accurate should be used. Furthermore, your custom corpus should be built with a vector database. A vector database indexes, stores, and provides access to your documents alongside their vector embeddings, which are numerical representations of your documents. Vector databases facilitate semantic search, which is needed for retrieval augmented generation, a technique utilized by generative AI to marry information in your custom corpus to an LLM's trained parametric memory. The processing layer The processing layer contains the compute needed for all the workloads supported by the modern

Starting point is 00:13:29 data lake. At a high level, compute comes in two flavors, processing engines for the data warehouse and clusters for distributed machine learning. The data warehouse processing engine supports the distributed execution of SQL commands against the data in data warehouse storage. Transformations that are a part of the ingestion process may also need the compute power in the processing layer. For example, some data warehouses may wish to use a medallion architecture. Others may choose a star schema with dimensional tables. The SATA designs often require substantial ETL against the raw data during ingestion. The data warehouse used within a modern data lake disaggregates compute from storage.

Starting point is 00:14:15 So, if needed, multiple processing engines can exist for a single data warehouse data store. This differs from a conventional relational database where compute and storage are tightly coupled, and there is one compute resource for every storage device. A possible design for your processing layer is to set up one processing engine for each entity in the consumption layer. For example, a processing cluster for business intelligence, a separate cluster for data analytics, and yet another for data science. Each processing engine would query the same data warehouse storage service, however, since each team has their OWN dedicated cluster they do not compete with each other for compute. If the business intelligence team is running month-end reports that are compute intensive, then they will not interfere with

Starting point is 00:14:54 another team that may be running daily reports. Machine learning models, especially large language models, can be trained faster if training is done in a distributed fashion. The machine learning cluster supports distributed training. Distributed training should be integrated with an MLOPS tool for experiment tracking and checkpointing. The optional semantic LAYERA semantic layer helps the business understand its data. The semantic layer sits between the processing layer, which serves up the data from the storage layer, and the consumption layer, which contains the tools and applications looking for data. It acts like a translator that bridges the gap between the language of the business and the technical terms used to describe data. It also helps both data professionals and business users

Starting point is 00:15:38 find relevant data for either end-user reports or dataset creation for AI, ML. In its simplest form, the semantic layer could be a data catalog or an organized inventory of data. A data catalog typically includes the original data source location, lineage, schema, short description, and long description. A more robust semantic layer can provide security, privacy, and governance by incorporating policies, controls, and data quality rules. This layer is optional. Organizations that have few data sources with well-structured feeds may not need a semantic layer. A well-structured feed is a feed that contains intuitive field

Starting point is 00:16:16 names and accurate field descriptions that can be easily extracted from data sources and loaded into the data warehouse. Well-structured feeds should also implement data quality checks at the source so that only quality data is transmitted to the modern data lake. However, large organizations that have many data sources where metadata was an afterthought when schemas and feeds were designed should consider implementing the semantic layer. Many of the products that can be used in this layer provide features that help an organization populate a metadata catalog. Also, organizations that operate in complex industries should consider a semantic layer. For example, industries like financial services, healthcare and legal make heavy use of terms that are not everyday words. When these domain-specific terms are used as table names and

Starting point is 00:17:01 field names, the underlying meaning of the data can be hard to ascertain. The consumption layer Let's conclude our presentation of the modern data lake layers by looking at the workloads run in the topmost layer, the consumption layer, and discussing how the layers below support their specific use cases. Many of the workloads Bellaware often used interchangeably are synonymous. This is unfortunate because when investigating their needs, it is better to have precise definitions. In the discussion below, I will precisely describe each function and then align it with the capabilities of the modern data lake. Applications. Custom applications can programmatically send SQL queries to the modern data lake to provide custom views for end users. These may be the same applications that submitted raw data as

Starting point is 00:17:45 data sources at the bottom of the diagram. A use case that should be supported by a modern data lake is to allow applications to submit raw data, clean it, combine it with other data and finally serve it up quickly. Applications may use models trained with data from the modern data lake. This is another use case that the modern data lake should support. Applications should be able to send raw data to the modern data lake, get it processed, and sent to model training pipelines. From there, the model scan be used to make predictions within the application. Data science is the study of data. Data scientists design the data sets and potentially the models that will be trained and used for inference. Data scientists also use techniques from mathematics and statistics for inference. Data scientists also use techniques from mathematics

Starting point is 00:18:25 and statistics for the purpose of feature engineering. Feature engineering is a technique for improving datasets used to train a model. A very slick feature that modern data lakes possesses is zero-copy branching, which allows data to be branched the same way code can be branched within a Git repository. As the name suggests, this feature does not make a copy of the data, rather, it makes use of the metadata layer of the OpenTable format used to implement the data warehouse to create the appearance of a unique copy of the data. Data scientists can experiment with a branch, if their experiments are successful, then they can merge their branch back into the main branch for other data scientists to use. Business intelligence is often retrospective, providing insights into past events. It involves

Starting point is 00:19:10 the use of reporting tools, dashboards, and key performance indicators, KPIs, to provide a view into business performance. Much of the data needed for buy are aggregations which can require a fair amount of compute to create. Data analytics, on the other hand, involves the analysis of data to extract insights, identify trends, and make predictions. It is more forward-looking and aims to understand why certain events occurred and what might happen in the future. Data analytics overlaps data science in that it incorporates statistical analysis and machine learning techniques. Machine learning. The machine learning workload is where ML teams run their experiments and MLOPs teams test and promote

Starting point is 00:19:50 models to production. There I saw often a considerable difference between the needs of teams that are using machine learning for research and prototyping versus those that are putting modeling to production on a regular basis. Teams only doing research and experimental work can often get away with minimal MLOps tooling, whereas those putting models into production will need considerably more rigorous tools and processes. Security the modern data lake must provide authentication and authorization for users and services. It should also provide encryption for data at rest and data in motion. This section will look into these aspects of security.

Starting point is 00:20:29 Both the data lake and the data warehouse must support an identity and access management IAM solution that facilitates authentication and authorization. Both halves of the modern data lake should use the same directory service for keeping track of users and groups allowing users to present their corporate credentials when signing into the user interface for both the data lake and the data warehouse. For programmatic access, since each product requires a different connection type, the credentials that need to be presented for authentication will be different. Likewise, the policies used for authorization will also be different as the underlying resources and actions are different. The data lake requires authorization for buckets and objects as well as bucket and object actions. The data lecare requires authorization for buckets and objects as well as

Starting point is 00:21:05 bucket and object actions. The data warehouse, on the other hand, needs tables and table-related actions to be authorized. Data lake authentication. Every connection to the data lake requires verification of identity and the data lake should integrate with the organization's identity provider. Since the data lake is an object store that ISS3 compliant, the AWS Signature Version 4 protocol should be used. For programmatic access, this means that each service wishing to access an administrative API or an S3 API, such as put, get, and delete operations, must present a valid access key and secret key. Data lake authorization. Authorization is the act of restricting the actions and resources the authenticated client can perform on the data lake.

Starting point is 00:21:51 ANS3-compliant object store should use policy-based access control, PBAC, where each policy describes one or more rules that outline the permissions of a user or group of users. The data lake should support three specific actions and conditions when creating policies. By default, MinIodini's access to actions are resources not explicitly referenced in a user's assigned or inherited policies. Data warehouse authentication. Similar to the data lake, every connection to the data warehouse must be authenticated and the data warehouse should integrate with the organization's identity provider for authenticating users. A data warehouse may provide the following options for programmatic access. ODBC connection, JDBC connection, or REST session. Each will require

Starting point is 00:22:36 an access token. Data warehouse authorization. A data warehouse should support user, group, and role-level access controls for tables, views, and other objects found in the data warehouse. This allows access to individual objects to be configured based in either the user's ID, a group, or a role. Key Management Server For security at rest and in transit, the modern data locius is a key management server, KMS. A KMS is a service that is responsible for generating, distributing, and managing cryptographic keys used for encryption and decryption. Summary There you have it, the five layers of a modern data lake from data sources to consumption.

Starting point is 00:23:15 This post explored a conceptual reference architecture for modern data lakes. The goal? To provide organizations with a strategic blueprint for building a platform that efficiently manages and extracts value from their vast and diverse datasets. The modern data lake combines the strengths of traditional data warehouses and flexible data lakes, offering a unified and scalable solution for storing, processing, and analyzing data. If you would like to go deeper with the team at Minio on what components are recommended, feel free to reach out to us at hello at min.io. Thank you for listening to this HackerNoon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - The Architect’s Guide to A Modern Datalake Reference Architecture

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.