The Good Tech Companies - The Architect’s Guide to A Modern Datalake Reference Architecture
Episode Date: June 7, 2024This story was originally published on HackerNoon at: https://hackernoon.com/the-architects-guide-to-a-modern-datalake-reference-architecture. A Modern Datalake is one-h...alf Data Warehouse and one- half Data Lake. It uses object storage in terms of scalability and performance. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #modern-datalake, #object-storage, #open-table-formats, #software-development, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. A Modern Datalake is one-half Data Warehouse and one- half Data Lake. It uses object storage in terms of scalability and performance. Open Table Formats (OTFs) make it seamless for object storage to be used as the underlying storage solution for a data warehouse. Layering provides a clear way to group services that provide similar functionality.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
The Architect's Guide to a Modern Data Lake Reference Architecture, by Minio.
An abbreviated version of this post appeared on the new stack on March 26, 2024.
Businesses aiming to maximize their data assets are adopting scalable,
flexible, and unified data storage and analytics approaches.
This trend is driven by enterprise architects
tasked with crafting infrastructures that align with evolving business demands.
A modern data lake architecture addresses this need by integrating the scalability and flexibility
of a data lake with the structure and performance optimizations of a data warehouse.
This post serves as a guidero understanding and implementing a modern data lake architecture.
What is a modern data lake? A modern data lake is one half data warehouse and one half data lake and this is object storage for everything. This may sound like a marketing trick,
put two products in one package and call it a new product, but the data warehouse that will
be presented in this post is better than a conventional data warehouse it uses object
storage, therefore, it provides all the benefits of object storage in terms of scalability
and performance. Organizations that adopt this approach will only pay for what they need,
facilitated by the scalability of object storage, and if blazing speed is needed,
they can equip their underlying object store with NVMe drives connected by a high-end network.
The use of object storage in this fashion has been made possible by the rise of open-table formats
OTFs like Apache Iceberg, Apache Huddy, and Delta Lake which are specifications that,
once implemented, make it seamless for object storage to be used as the underlying storage
solution for a data warehouse. These specifications also provide features that may
not exist in a conventional data warehouse, for example, snapshots, also known as time travel,
schema evolution, partitions, partition evolution, and zero-copy branching.
But, as stated above, the modern data lake is more than just a fancy data warehouse.
It also contains a data lake for unstructured data. The OTFs also
provide integration to external data in the data lake. This integration allows external data to be
used as a SQL table if needed, or the external data can be transformed and routed to the data
warehouse using high-speed processing engine SAN familiar SQL commands. So the modern data lake is
more than just a data warehouse and a data lake in
one package with a different name. Collectively, they provide more value than what can be found
in a conventional data warehouse or a standalone data lake. Conceptual architecture layering is a
convenient way to present all the components and services needed be the modern data lake.
Layering provides a clear way to group services that provide similar functionality.
It also allows for a hierarchy to be established, with consumers on top and data sources,
with their raw data, on the bottom. The layers of the modern data lake from top to bottom are consumption layer. Contains the tools used by power users to analyze data. Also contains
applications and AI, ML workloads that will programmatically access the modern
data lake. Backslash dot semantic layer, an optional metadata layer for data discovery
and governance. Backslash dot processing layer. This layer contains the compute clusters needed
to query the modern data lake. It also contains compute clusters used for distributed model
training. Complex transformations can occur in the processing layer by taking advantage of
the storage layer's integration between the data lake and the data warehouse.
Backslash dot, storage layer, object storage is the primary storage service for the modern data
lake. However, MLOP tools may need other storage services such as relational databases.
If you are pursuing generative AI,
you will need a vector database. Backslash dot ingestion layer contains the services needed to
receive data. Advanced ingestion layers will be able to retrieve data based on a schedule.
The modern data lake should support a variety of protocols. It should also support data arriving
in streams and batches. Simple and complex data transformations can occur in the ingestion layer.
Backslash dot.
Data sources.
The data sources layer is technically not a part of the modern data lake solution,
but it is included in this post because a well-constructed modern data lake must support
a variety of data sources with varying capabilities for sending data.
Backslash dot.
The diagram below visually
depicts all the layers described above and all the capabilities that may be needed to implement
these layers. This is an end-to-end architecture where the heart of the platform is a modern data
lake. Rather than focusing on just the processing layer and the storage layer, this architecture
also shows components needed to ingest, transform, discover, govern, and consume
data. The tools needed to support important use cases that depend on a modern data lake are also
included, such as MLOPs storage, vector databases, and machine learning clusters. The conceptual
nature of the approach used in this post is important. If the diagram above made use of
product names, then meaning would be lost.
Product names are rarely chosen for meaning, rather, they are chosen for brand awareness and memory retention. To this end, our conceptual architecture USES simple nouns where the feature
provided is intuitive. The next section will provide an example of a concrete implementation
for the reader familiar with the more popular big data projects and products in the market today. However, the reader is encouraged to refer to the conceptual diagram when making
decisions for their organization. Finally, there are no arrows. Arrows typically depict data flow
and dependencies. Showing all possible data flows and dependencies would unnecessarily complicate
the diagram. A better approach is to look at data flow and
dependencies in the context of a use case. Once a few components are isolated in the context of a
use case, then data flow and dependencies can be more clearly illustrated. A concrete architecture
The purpose of this section is to ground the design of our reference architecture with concrete
open-source examples. For the architect eager to dive in and start building,
the projects and products shown below are free to use INA proof of concept.
When your POC graduates to a funded project that will one day run in production,
then be sure to check open-source licenses and terms of use for all software used in your POC.
A few words on data sources The applications, devices,
and vendors that feed your modern data lake come in a variety of flavors, and so does their data. On-premise modern applications may be able to stream well-structured data in real-time using formats such as a VR-owned parquet.
On the other hand, older legacy applications may only be able to send simple files in batches,
such as XML, JSON, and CSVs. Data vendors may not send data at all,
expecting their customers to retrieve data. Mobile apps, websites, IoT devices, and social media apps
will typically send application logs and other telemetry usage statistics to your ingestion layer.
Log analytics is a popular use case for a modern data lake. Additionally, they may send images and audio files to be used within AI, ML workloads.
Finally, organizations looking to take advantage of generative AI will need to store documents
found in file shares and portals such as SharePoint portal server and Confluence in the data lake.
The modern data lake needs to be able to interface with all these data sources efficiently and
reliably, getting the data to either data lake storage or data warehouse storage.
Onboarding data is the primary purpose of the ingestion layer of our architecture.
This requires your ingestion layer to support a variety of protocols capable of receiving
streamed data and batched data. Let's investigate the components of this layer next.
The ingestion layer The ingestion layer. The ingestion
layer is the on-ramp to your modern data lake. It is responsible for ingesting data into the
data storage layer. Structured data from sources that designed their feeds for the data warehouse
side of the modern data lake can bypass the data lake and send their data directly to the data
warehouse. On the other hand, sources that did not design their feeds in such a fashion will need to
have their data sent to the data lake, where it can not design their feeds in such a fashion will need to have
their data sent to the data lake, where it can be transformed before being ingested into the
data warehouse. The ingestion layer should be able to receive and retrieve data. Internal lines of
business, LOB, applications may have been given the mandate to send their data via streaming or
batching. For these applications, the ingestion layer needs to provide an endpoint for receiving
the data. However, data vendors and other external data sources may not be so willing to deliver
data. The ingestion layer should also provide scheduled retrieval capabilities. For example,
a data vendor may provide new datasets at the first of every month. Scheduled retrieval
capabilities will allow for the ingestion layer to connect and download data at the correct time. Streaming is the best way to transmit data to a modern
data lake or to any destination for that matter. Steaming implies the use of a messaging service
deployed in a way that makes it resilient, available and highly performant. The messaging
service usually provides a queuing mechanism that acknowledges the receipt of a message
only upon successful storage of the message. The service then provides, exactly once, delivery to a downstream
service that is responsible for saving the data in the message to either the data warehouse or
data lake. Note. Some message services provide, at least once, delivery requiring downstream
services to implement idempotent updates to the data source.
It is important to check the fine print of the service you end upusing.
What is especially nice about this style of ingestion is that if the downstream service fails and does not acknowledge the successful processing of a message then the message will
reappear in the queue for future ingestion. Messaging services also provide dead letter
queues for messages that repeated LY fail.
Streaming ingestion is great, but in many cases, real-time insights are not needed.
In these situations, batch or mini-batch processing works fine and can be considerably simpler
to implement.
For batch uploads, the S3 API is your best option.
Minio is S3 compliant, and any data source currently sending batch data to an S3
endpoint will work, as is, with only a connection change once you switch over to the Minio data lake.
However, many organizations may still prefer FTP, SFTP for its simplicity and ability to
run in highly constrained environments. Minio also has support for FTP and SFTP.
This interface allows ADATA Source to send data to
Minio the same way it would send data to an FTP server. From an application or user's perspective,
moving data onto Minio using SFTP is seamless since everything is essentially the same,
from policies, security, etc. The data storage layer
The data storage layer is the bedrock that all other
layers depend upon. Its purpose is to store data reliably and serve it efficiently. There will be
an object storage service for the data lake side of the modern data lake and the rewell be an object
storage service for the data warehouse. These two object storage services can be combined into one
physical instance of an object store if needed by using buckets to keep data warehouse storage separate from data lake storage. However, consider keeping them
separate and installed on different hardware if the processing layer will be putting different
workloads on these two storage services. For example, a common data flow is to have all new
data land in the data lake. Once in the data lake, IT can be transformed and ingested into the data warehouse,
where it can be consumed by other applications and used for the purpose of data science,
business intelligence, and data analytics. If this is your data flow, then your modern data
lake will be putting more load on your data warehouse, and you will want to make sure it
is running on high-end hardware, storage devices, storage clusters, and network.
External table functionality allows
data warehouses and processing engines to read objects in the data lake as if they were SQL
tables. If the data lake ISUSED is the landing zone for raw data, then this capability, along
with the data warehouse SQL capabilities, can be used to transform raw data before inserting it
into the data warehouse. Alternatively, the external table could be used, as is, and joined with other tables and resources
inside the data warehouse without it ever leaving the data lake. This pattern can help save on
migration costs and can overcome some data security concerns by keeping the data in one
place while, at the same time, making it available to outside services. Most MLOP tools use a combination of an object store and a relational database to support MLOPs.
For example, an MLOP tool should store training metrics, hyperparameters,
model checkpoints, and dataset versions. Models and datasets should be stored in the data lake,
while metrics and hyperparameters will be more efficiently stored in a relational database.
If you are pursuing generative AI, you will need to build a custom corpus for your organization.
It should contain documents with knowledge that no one else has and only documents that
are true and accurate should be used.
Furthermore, your custom corpus should be built with a vector database.
A vector database indexes, stores, and provides access to
your documents alongside their vector embeddings, which are numerical representations of your
documents. Vector databases facilitate semantic search, which is needed for retrieval augmented
generation, a technique utilized by generative AI to marry information in your custom corpus to an
LLM's trained parametric memory. The processing layer
The processing layer contains the compute needed for all the workloads supported by the modern
data lake. At a high level, compute comes in two flavors, processing engines for the data warehouse
and clusters for distributed machine learning. The data warehouse processing engine supports
the distributed execution of SQL commands against the data in data warehouse storage.
Transformations that are a part of the ingestion process may also need the compute power in the
processing layer. For example, some data warehouses may wish to use a medallion architecture.
Others may choose a star schema with dimensional tables. The SATA designs often require substantial
ETL against the raw data during ingestion.
The data warehouse used within a modern data lake disaggregates compute from storage.
So, if needed, multiple processing engines can exist for a single data warehouse data store.
This differs from a conventional relational database where compute and storage are tightly coupled, and there is one compute resource for every storage device. A possible
design for your processing layer is to set up one processing engine for each entity in the
consumption layer. For example, a processing cluster for business intelligence, a separate
cluster for data analytics, and yet another for data science. Each processing engine would query
the same data warehouse storage service, however, since each team has their OWN
dedicated cluster they do not compete with each other for compute. If the business intelligence
team is running month-end reports that are compute intensive, then they will not interfere with
another team that may be running daily reports. Machine learning models, especially large language
models, can be trained faster if training is done in a distributed fashion. The machine learning
cluster supports distributed training. Distributed training should be integrated with an MLOPS tool
for experiment tracking and checkpointing. The optional semantic LAYERA semantic layer helps
the business understand its data. The semantic layer sits between the processing layer, which
serves up the data from the storage layer, and the consumption layer, which contains the tools and applications looking for data.
It acts like a translator that bridges the gap between the language of the business and the
technical terms used to describe data. It also helps both data professionals and business users
find relevant data for either end-user reports or dataset creation for AI, ML. In its simplest form,
the semantic layer could be a
data catalog or an organized inventory of data. A data catalog typically includes the original
data source location, lineage, schema, short description, and long description. A more robust
semantic layer can provide security, privacy, and governance by incorporating policies, controls,
and data quality rules.
This layer is optional. Organizations that have few data sources with well-structured feeds may
not need a semantic layer. A well-structured feed is a feed that contains intuitive field
names and accurate field descriptions that can be easily extracted from data sources and loaded
into the data warehouse. Well-structured feeds should also implement data quality checks at the source so that only quality data is transmitted to the modern data
lake. However, large organizations that have many data sources where metadata was an afterthought
when schemas and feeds were designed should consider implementing the semantic layer.
Many of the products that can be used in this layer provide features that help an organization
populate a metadata catalog. Also, organizations that operate in complex industries should consider a semantic
layer. For example, industries like financial services, healthcare and legal make heavy use
of terms that are not everyday words. When these domain-specific terms are used as table names and
field names, the underlying meaning of the data can be hard to ascertain. The consumption layer Let's conclude our presentation of the modern data lake layers
by looking at the workloads run in the topmost layer, the consumption layer, and discussing how
the layers below support their specific use cases. Many of the workloads Bellaware often used
interchangeably are synonymous. This is unfortunate because when investigating their needs, it is better to have
precise definitions. In the discussion below, I will precisely describe each function and then
align it with the capabilities of the modern data lake. Applications. Custom applications can
programmatically send SQL queries to the modern data lake to provide custom views for end users.
These may be the same applications that submitted raw data as
data sources at the bottom of the diagram. A use case that should be supported by a modern
data lake is to allow applications to submit raw data, clean it, combine it with other data and
finally serve it up quickly. Applications may use models trained with data from the modern data lake.
This is another use case that the modern data lake should support. Applications should be able to send raw data to the modern data lake, get it processed,
and sent to model training pipelines. From there, the model scan be used to make predictions within
the application. Data science is the study of data. Data scientists design the data sets and
potentially the models that will be trained and used for inference. Data scientists also use
techniques from mathematics and statistics for inference. Data scientists also use techniques from mathematics
and statistics for the purpose of feature engineering. Feature engineering is a technique
for improving datasets used to train a model. A very slick feature that modern data lakes possesses
is zero-copy branching, which allows data to be branched the same way code can be branched within
a Git repository. As the name suggests, this feature does not make a copy
of the data, rather, it makes use of the metadata layer of the OpenTable format used to implement
the data warehouse to create the appearance of a unique copy of the data. Data scientists can
experiment with a branch, if their experiments are successful, then they can merge their branch
back into the main branch for other data scientists to use. Business intelligence is often retrospective, providing insights into past events. It involves
the use of reporting tools, dashboards, and key performance indicators, KPIs, to provide a view
into business performance. Much of the data needed for buy are aggregations which can require a fair
amount of compute to create. Data analytics, on the other hand,
involves the analysis of data to extract insights, identify trends, and make predictions.
It is more forward-looking and aims to understand why certain events occurred and what might happen
in the future. Data analytics overlaps data science in that it incorporates statistical
analysis and machine learning techniques. Machine learning. The
machine learning workload is where ML teams run their experiments and MLOPs teams test and promote
models to production. There I saw often a considerable difference between the needs of
teams that are using machine learning for research and prototyping versus those that are putting
modeling to production on a regular basis. Teams only doing research and experimental work can
often get away with
minimal MLOps tooling, whereas those putting models into production will need considerably
more rigorous tools and processes. Security the modern data lake must provide authentication and
authorization for users and services. It should also provide encryption for data at rest and data
in motion. This section will look into these aspects of security.
Both the data lake and the data warehouse must support an identity and access management IAM solution that facilitates authentication and authorization. Both halves of the modern
data lake should use the same directory service for keeping track of users and groups allowing
users to present their corporate credentials when signing into the user interface for both
the data lake and the data warehouse. For programmatic access, since each product
requires a different connection type, the credentials that need to be presented for
authentication will be different. Likewise, the policies used for authorization will also be
different as the underlying resources and actions are different. The data lake requires authorization
for buckets and objects as well as bucket and object actions. The data lecare requires authorization for buckets and objects as well as
bucket and object actions. The data warehouse, on the other hand, needs tables and table-related
actions to be authorized. Data lake authentication. Every connection to the data lake requires
verification of identity and the data lake should integrate with the organization's identity
provider. Since the data lake is an object store that ISS3 compliant,
the AWS Signature Version 4 protocol should be used. For programmatic access, this means that
each service wishing to access an administrative API or an S3 API, such as put, get, and delete
operations, must present a valid access key and secret key. Data lake authorization. Authorization is the act of restricting the actions and resources the authenticated client
can perform on the data lake.
ANS3-compliant object store should use policy-based access control, PBAC, where each policy describes
one or more rules that outline the permissions of a user or group of users.
The data lake should support three specific actions and conditions when
creating policies. By default, MinIodini's access to actions are resources not explicitly referenced
in a user's assigned or inherited policies. Data warehouse authentication. Similar to the data lake,
every connection to the data warehouse must be authenticated and the data warehouse should
integrate with the organization's identity provider for authenticating users. A data warehouse may provide the following options
for programmatic access. ODBC connection, JDBC connection, or REST session. Each will require
an access token. Data warehouse authorization. A data warehouse should support user, group,
and role-level access controls for tables, views, and other objects
found in the data warehouse. This allows access to individual objects to be configured based in
either the user's ID, a group, or a role. Key Management Server
For security at rest and in transit, the modern data locius is a key management server, KMS.
A KMS is a service that is responsible for generating, distributing, and managing
cryptographic keys used for encryption and decryption. Summary
There you have it, the five layers of a modern data lake from data sources to consumption.
This post explored a conceptual reference architecture for modern data lakes.
The goal? To provide organizations with a strategic blueprint for building a platform
that efficiently manages and extracts value from their vast and diverse datasets.
The modern data lake combines the strengths of traditional data warehouses and flexible data
lakes, offering a unified and scalable solution for storing, processing, and analyzing data.
If you would like to go deeper with the team at Minio on what components are recommended,
feel free to reach out to us at hello at min.io. Thank you for listening to this HackerNoon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.