The Good Tech Companies - An Architect’s Guide to Building Reference Architecture for an AI/ML Datalake
Episode Date: June 12, 2024This story was originally published on HackerNoon at: https://hackernoon.com/an-architects-guide-to-building-reference-architecture-for-an-aiml-datalake. Organizations s...hould not build an infrastructure dedicated to AI and AI only while leaving other workloads to fend for themselves. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #datalake-for-ai, #aiml, #artificial-intelligence, #machine-learning, #modern-datalake, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. Organizations should not build an infrastructure dedicated to AI and AI only while leaving workloads like Business Intelligence, Data Analytics, and Data Science to fend for themselves.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
An Architect's Guide to Building Reference Architecture for an AI ML Data Lake by Minio.
An abbreviated version of this post appeared on the new stack on March 19, 2024.
In enterprise artificial intelligence, there are two main types of models,
discriminative and generative. Discriminative models are used to classify
or predict data, while generative models are used to create new data. Even though generative AI has
dominated the news of late, organizations are still pursuing both types of AI. Discriminative
AI still remains an important initiative for organizations that want to operate more efficiently
and pursue additional revenue streams. These different types of AI have a lot in common, but at the same time, there are significant differences
that must be taken into account when building your AI data infrastructure. Organizations should
not build an infrastructure dedicated to AI and AI only while leaving workloads like business
intelligence, data analytics, and data science to fend for themselves. It is possible to build
a complete data infrastructure that supports all the needs of the organization, business intelligence,
data analytics, data science, discriminative AI, and generative AI. In another post,
we presented a reference architecture for a modern Datalac E-capable of serving the needs
of business intelligence, data analytics, data science, and AI ML. Let's review the modern data lake reference
architecture and highlight the capabilities it has for supporting AI ML workloads. The modern data
lake let's start by defining a modern data lake as that will serve as the foundation for our
reference architecture. This architecture is not recycled, rather IT reflects engineering first
principles that are broadly applicable.
A modern data lake is one half data warehouse and one half data lake and uses object storage for everything. Using object storage for a data lake makes perfect sense as object storage is for
unstructured data, which is what a data lake is may onto store. However, using object storage
for a data warehouse may sound odd, but a data warehouse built this way represents the next generation of data warehouses.
This is made possible by the Open Table Format Specifications, OTFs, authored by Netflix,
Uber, and Databricks, which make it seamless to employ object storage within a data warehouse.
The OTFs are Apache Iceberg, Apache Huddy, and Delta Lake. They were authored by Netflix,
Uber, and Databricks, respectively, because there were no products on the market that
could handle their data needs. Essentially, what they all do, in different ways,
is define a data warehouse that can be built on top of object storage, Minio.
Object storage provides the combination of scalable capacity and high performance that
other storage solutions cannot. Since these are modern specifications, they have advanced
features that old-fashioned data warehouses do not have, such as partition evolution,
schema evolution, and zero-copy branching. Finally, since the data warehouse is built
with object storage, you can use this same object store for unstructured data like images, video files,
audio files, and documents. Unstructured data is usually stored in what the industry calls a data
lake. Using an object store as the foundation for both your data lake and your data warehouse
results in a solution capable of holding all your data. Structured storage resides in the
OTF-based data warehouse and unstructured storage lives in the data lake.
The same instance of Minio could be used for both. At Minio, we call this combination of an
OTF-based data warehouse and a data lake the modern data lake, and we see it as the foundation
for all your AI, ML workloads. It's where data is collected, stored, processed, and transformed.
Training models using discriminative AI,
supervised, unsupervised, Andre enforcement learning, often requires a storage solution
that can handle structured data that can live in the data warehouse. On the other hand,
if you are training large language models, LLMs, you must manage unstructured data or
documents in their raw and processed form in the data lake. Source. Modern data lake
reference architecture. This post focuses on those areas of the modern data lake reference
architecture for AI, ML that support the different AI, ML workloads. These functional areas are
listed below. A visual depiction of the modern data lake is shown above. The layers in which
these functional areas can be found have been highlighted.
Discriminative AI, storage for unstructured data, storage for semi-structured data,
zero-copy branching in the data warehouse. Generative AI, building a custom corpus with a vector database. Building a document pipeline, retrieval augmented generation, RAG, fine-tuning
large language models, measuring LLM accuracy,
machine learning operations. This post also looks at the current state of GPUs and how they impact
your AI data infrastructure. We will also look at a couple of scenarios that illustrate how to
build your infrastructure and how not to build your infrastructure. Finally, this post presents
a few recommendations for building an AI data infrastructure of your own.
The current state of GPUs, the hungry GPU problem. Supercharging object storage,
a tale of two organizations. A plan for building your AI data infrastructure.
Discriminative AI Discriminative AI models require data of all types for training.
Models for image classification and speech recognition
will use unstructured data in the form of images and audio files. On the other hand,
models for fraud detection and medical diagnosis make predictions based on structured data.
Let's look at options available within the modern data lake for storing and manipulating the data
needed by discriminative AI. Storage for unstructured data. Unstructured data will reside
in the data lake where it can be used for training and testing models. Training sets that can fit
into memory can be loaded prior to training before your epic loop starts. However, if your training
set is large and will not fit into memory, you will have to load a list of objects before training
and retrieve the actual objects while processing each batch in your epic loop. This could put a strain on your data lake if you do not build
your data lake using a high-speed network and high-speed disk drives. If you are training
models with data that cannot fit into memory, then consider building your data lake with a
100GB network and NVMe drives. Storage for semi-structured data. There are a few options
available within the modern data lake for storing semi-structured data. There are a few options available within the modern
data lake for storing semi-structured files like parquet files, avro files, json files, and even csv
files. The easiest thing to do is store them in your data lake and load them the same way you load
unstructured objects. If the data in these semi-structured files is not needed by other
workloads that the modern data lake supports, intelligence data analytics and data science then this is the best option another
option is to load these files into your data warehouse where other workloads can use them
when data is loaded into your data warehouse you can use zero copy branching to perform experiments
with your data zero copy branching in the data warehouse. Feature
engineering is a technique for improving datasets used to train a model. A very slick feature that
OTF-based data warehouses possess is zero-copy branching. This allows data to be branched the
same way code can be branched within a Git repository. As the name suggests, this feature
does not make a copy of the data, rather, it makes use of the
metadata layer of the OpenTable format used to implement the data warehouse to create the
appearance of a unique copy of the data. Data scientists can experiment with a branch,
if their experiments are successful, then they can merge their branch back into the main branch
for other data scientists to use. If the experiment is not successful, then the branch can be deleted.
Generative AI all models, whether they are small models built with scikit-learn,
custom neural networks built with PyTorch or TensorFlow, or large language models based on
the transformer architecture, require numbers as inputs and produce numbers as outputs.
This simple fact places a few additional requirements on your AI ML infrastructure if you
are interested in generative AI, where words have to be turned into numbers or vectors, as we shall
see. A generative AI solution gets even more complicated if you want to use private documents
that contain your company's proprietary knowledge to enhance the answers produced by the LLMs.
This enhancement could be in the form of retrieval augmented generation or
LLM fine-tuning. This section will discuss all these techniques, turning words into numbers,
RAG, and fine-tuning, and their impact on AI infrastructure. Let's start by discussing how
to build a custom corpus and where it should reside. Creating a custom corpus with a vector
database. If you are serious about generative AI, then your custom corpus should define your organization.
It should contain documents with knowledge that no one else has and only contain true
and accurate information. Furthermore, your custom corpus should be built with a vector database.
A vector database indexes, stores, and provides access to your documents alongside their vector
embeddings, which are the numerical representations of your documents. This solves the number problem
described above. Vector databases facilitate semantic search. How this is done requires a
lot of mathematical background and is complicated. However, semantic search is conceptually easy to
understand. Let's say you want to find all documents that discuss anything related to artificial intelligence. To do this on a conventional database, you would need to
search for every possible abbreviation, synonym, and related term of artificial intelligence.
Your query would look something like this. Not only is the manual similarity search arduous
and prone to error, but the search itself is very slow. A vector database can take a request like below
and run the query faster and with greater accuracy. The ability to run semantic queries
quickly and accurately is important if you wish to use retrieval augmented generation.
Another important consideration for your custom corpus is security. Access to documents should
honor access restrictions on the original documents. It would be unfortunate if an
intern could gain access to the CFO's financial results that have not been released to Wall Street yet.
Within your vector database, you should set up authorization to match the access levels of the
original content. This can be done by integrating your vector database with your organization's
identity and access management solution. At their core, vector databases store unstructured data.
Therefore, they should use your data lake as their storage solution.
Building a document pipeline. Unfortunately, most organizations do not have a single repository with
clean and accurate documents. Rather, documents are spread across the organization in various
team portals in many formats. Consequently, the first step in building a custom corpus is to
build a pipeline that takes only documents that have been approved for use with generative AI
and place them in your vector database. For large global organizations, this could potentially be
the hardest task of a generative AI solution. It is common for teams to have documentation
in draft format in their portals. There may also be documents that are
random musings about what could be. These documents should not become a part of a custom corpus as
they do not accurately represent the business. Unfortunately, filtering these documents will
be a manual effort. A document pipeline should also convert the documents to text. Fortunately,
a few open-source libraries can do this for many of the common document formats.
Additionally, a document pipeline must break documents into small segments before saving
them in the vector database. This is due to limitations on prompt size when these documents
are used for retrieval augmented generation, which will be discussed in a later section.
Fine-tuning large language models. When we fine-tune a large language model, we train
it a little more with information in the custom corpus. This could be a good way to get a domain
specific LLM. While this option does require compute to perform the fine-tuning against your
custom corpus, it is not as intensive as training a model from scratch and can be completed in a
modest time frame. If your domain includes terms not
found in everyday usage, fine-tuning may improve the quality of the LLM's responses.
For example, projects that use documents from medical research, environmental research,
and anything related to the natural sciences may benefit from fine-tuning.
Fine-tuning takes the highly specific vernacular found in your documents and bakes them into the
parametric parameters of the model. The advantages and disadvantages of fine-tuning takes the highly specific vernacular found in your documents and bakes them into the parametric parameters of the model. The advantages and disadvantages of fine-tuning should be
understood before deciding on this approach. Disadvantages. Fine-tuning will require compute
resources. Explainability is not possible. You will periodically need to refine tune with new
data as your corpus evolves. Hallucinations are a concern.
Document-level security is impossible.
Backslash dot.
Advantages.
The LLM has knowledge from your custom corpus via fine-tuning.
The inference flow is less complicated than RAG.
Backslash dot. While fine-tuning is a good way to teach an LLM about the language of your business,
it dilutes the data since most LLMs contain billions
of parameters, and your data will be spread across all these parameters. The biggest disadvantage of
fine-tuning is that document-level authorization is impossible. Once a document is used for fine-tuning,
its information becomes a part of the model. It is not possible to restrict this information
based on the user's authorization levels. Let's look at a technique that combines your custom data and parametric data at inference time.
Retrieval Augmented Generation, RAG. Retrieval Augmented Generation, RAG, is a technique that
starts with the question being asked, uses a vector database to marry the questions with
additional data, and then passes the question and data to an LLM for content creation.
With RAG, no training is needed because we educate the LLM by sending it relevant text snippets from
our corpus of quality documents. It works like this using a question answering task. A user asks
a question in your application's user interface. Your application will take the question specifically
the words in it and, using a vector database, search your corpus of quality documents for text snippets that are contextually
relevant. Thesis snippets in the original question get sent to the LLM. This entire
package question plus snippets context is known as a prompt. The LLM will use this information
to generate your answer. This may seem like a silly thing to do.
If you already know the answer, the snippets, why bother with the LLM? Remember, this is happening
in real-time and the goal is to generate text, something you can copy and paste into your
research. You need the LLM to create the text that incorporates the information from your custom
corpus. This is more complicated than fine-tuning. However,
user authorization can be implemented since the documents, or document snippets, are selected
from the vector database at inference time. The information in the documents never becomes a part
of the model's parametric parameters. The advantages and disadvantages of RAG are listed below.
Disadvantages Inference flow is more complicated.
Backslash dot. Advantages The LLM has direct knowledge from your custom corpus.
Explainability is possible. No fine-tuning is needed. Hallucinations are significantly
reduced and can be controlled by examining the results from the vector database queries.
Authorization can be implemented. Machine Learning operations, MLOPS. To better
understand the importance of MLOPs, it is helpful to compare model creation to conventional
application development. Conventional application development, like implementing a new microservice
that adds a new feature to an application, starts with reviewing a specification. Any new data
structures or any changes to existing data
structures are designed first. The design of the data should not change once coding begins.
The service is then implemented and coding is the main activity in this process.
Unit tests and end-to-end tests are also coded. These tests prove that the code is not faulty and
correctly implements the specification. They can be run automatically by a CI-CD pipeline before deploying the entire application. Creating a model and training it is
different. An understanding of the raw data and the needed prediction is the first step.
ML engineers do have to write some code to implement their neural networks or set up an
algorithm, but coding is not the dominant activity. Repeated experimentation is the main activity. During experimentation, the design of the data, the design of the model,
and the parameters used will all change. After every experiment, metrics are created that show
how the model performed as it was trained. Metrics are also generated for model performance against a
validation set and a test set. These metrics are used to prove the
quality of the model. Once a model is ready to be incorporated into an application, it needs to be
packaged and deployed. MLOPs, short for machine learning operations, is a set of practices and
tools aimed at addressing these differences. Experiment tracking and collaboration or ETHI
features most associated with MLOPs, but the more modern MLOPs
tools in the industry today can do much more. For example, they can provide a runtime environment
for your experiments, and they can package and deploy models once they are ready to be integrated
into an application. Below is a superset of features found in MLOPs tools today.
This list also includes other things to consider, such as support and
data integration. 1. Support from a major player, MLOP's techniques and features are constantly
evolving. You want a tool that is backed by a major player, ensuring that the tool is under
constant development and improvement. Backslash. 2. Modern data lake integration, experiments
generate a lot of structured and
unstructured data. Ideally, this could be stored in the data warehouse and the data lake.
However, many MLOPs tools were around before the open table formats that gave rise to the
modern data lake, so most will have a separate solution for their structured data.
Backslash.3. Experiment tracking. Keep track of each experiment's datasets,
models, hyperparameters, and metrics. Experiment tracking should also facilitate repeatability.
4. Facilitate collaboration. Allow team members to view the results of all experiments run by
all ML engineers. 5. Model packaging. Package the model such that it is accessible from
other programming environments. Backslash dot six, model serving. Deploying models to an organization's
formal environments. You will not need this if you have found a way to incorporate your models into
an existing CCD pipeline. Backslash dot seven, model registry. Maintain all versions of all models. Backslash dot 8.
Serverless functions. Some tools provide features that allow code to be annotated in such a way that
a function or model can be deployed as a containerized service for running experiments
in a cluster. Backslash dot 9. Data pipeline capabilities. Some MLOPs tools aim to provide
complete end-to-end
capabilities and have features that allow you to build pipelines for retrieving and
storing your raw data. You will not need this if you already have a data pipeline.
Backslash dot. 10. Training Pipeline Capabilities. The ability to orchestrate
your serverless functions into a directed acyclic graph. Also allows for the scheduling and running of training
pipelines. The impact of GPUS on your AI data INFRA STRUCTUREA chain is as strong as its weakest link
and your AI ML infrastructure is only AS fast as your slowest component. If you train machine
learning models with GPS, then your weak link may be your storage solution. The result is what we
call the starving GPU problem. The starving GPU problem occurs when your network or your storage
solution cannot serve training data to your training logic fast enough to fully utilize
your GPUs. The symptoms are fairly obvious. If you monitor your GPUs, you will notice that they
never get close to being fully utilized. If you have
instrumented your training code, then you will notice that total training time is dominated by
I.O. Unfortunately, there is bad news for those who are wrestling with this issue. GPUs are getting
faster. Let's look at the current state of GPUs and some advances being made with them to understand
how this problem will only get worse in the coming years. The current state of GPUs are getting faster. Not only is raw performance getting better,
but memory and bandwidth are also increasing. Let's take a look at these three characteristics
of NVIDIA's most recent GPUs the A100, the H100 and the H200. GPU performance memory memory bandwidth a 10624 teraflops 40GB 1555GB per
second H11979 teraflops 80GB 3.35TB per second H201979 teraflops 141GB for 8 terabyte per second info node the table above uses the statistics that
align with a pcie peripheral component interconnect express socket solution for the a100 and the sxm
server pci express module socket solution for the h100 and the h200 sxm statistics do not exist for
the a100 with respect to performance the floating point 16 tensor core do not exist for the A100. With respect to performance, the floating-point 16
tensor core statistic is used for the comparison. A few comparative observations on the statistics
above are worth calling out. First, the H100 and the H200 have the same performance,
1979 teraflops, which is 3.17 times greater than the A100. The H100 has twice as much memory as
the A100 and the memory bandwidth increased by a similar amount. Which makes sense otherwise,
the GPU would starve itself. The H200 can handle a whopping 141 gigabytes of memory and its memory
bandwidth also increased proportionally with respect to the other GPUs. Let's look at
each of these statistics in more detail and discuss what it means to machine learning.
Performance. A teraflop, tflop, is 1 trillion, 10 to the power of 12, floating point operations per
second. That is a 1 with 12 zeros after it, 1 trillion. It is hard to equate TFLOPs to IO
demand in gigabytes as the floating point
operations that occur during model training involve simple tensor math as well as first
derivatives against the loss function, A.K.A. gradients. However, relative comparisons are
possible. Looking at the statistics above, we see that the H100 and the H200, which both perform at 1,979 teraflops, are three times faster,
potentially consuming data three times faster if everything else can keep UP.
GPU memory, also known as video RAM or graphics RAM. The GPU memory is separate from the system's
main memory, RAM, and is specifically designed to handle the intensive graphical processing tasks performed
by the graphics card. GPU memory dictates batch size when training models. In the past,
batch size decreased when training logic moved from a CPU to a GPU. However, as GPU memory
catches up with CPU memory in terms of capacity, the batch size used for GPU training will increase. When performance
and memory capacity increase at the same time, the result is larger requests where each gigabyte of
training data is getting processed faster. Memory bandwidth. Think of GPU memory bandwidth as the
highway that connects the memory and computation cores. It determines how much data can be
transferred per unit of time. Just like a
wider highway allows more cars to pass in a given amount of time, a higher memory bandwidth allows
more data to be moved between memory and the GPU. As you can see, the designers of these GPUs
increased the memory bandwidth for each new version proportional to memory. Therefore,
the internal data bus of the chip will not be the bottleneck.
Supercharge object storage for model training. If you are experiencing the starving GPU problem,
then consider using a 100GB network and NVMe drives. A recent benchmark using Minio with such a configuration achieved 325GB per second on gets and 165GB per second on puts with just 32 nodes of off-the-shelf NVMe SSDs.
As the computing world has evolved and the price of DRAM has plummeted we find that server
configurations often come with 500GB or more of DRAM. When you are dealing with larger deployments,
even those with ultra-dense NVMe drives, the number of servers, multiplied by the DRAM on
those servers can quickly add up often
to many TBs per instance. That DRAM pool can be configured as a distributed shared pool of memory
and is ideal for workloads that demand massive IOPS and throughput performance. As a result,
we built MinioCache to enable our Enterprise and Enterprise Lite customers to configure their
infrastructure to take advantage of this shared memory pool to further improve performance for core AI workloads, like GPU training, while simultaneously retaining
full persistence. A tale of two organizations that take very different approaches on their AI
ML journey. Organization number one has a culture
of iterative improvements. They believe that all big initiatives can be broken down into smaller,
more manageable projects. These smaller projects are then scheduled in such a way that each one
builds on the results of the previous project to solve problems of greater and greater complexity.
They also like these small projects organized in such a way that each one
delivers value to the business. They have found that projects that are purely about improving
infrastructure or modernizing software without any new features for the business are not very
popular with the executives in control of budgets. Consequently, they have learned that requesting
fancy storage appliances and compute clusters for a generative AI proof of concept is not the
best way to orchestrate infrastructure improvements and new software capabilities.
Rather, they will start small with infrastructure products that can scale as they grow,
and they will start with simple AI models so they can get their MLOPs tooling in place and
figure out how to work with existing DevOps teams and C, CD pipelines.
Organization number two has a shiny objects culture. When the
newest idea enters the industry, it first tackles the highest profile challenge to demonstrate ITS
technical might. They have found these projects are highly visible both internally and externally.
If something breaks, then smart people can always fix it. Organization number one structured its
first project by building out a
portion of its AI data infrastructure while working on a recommendation model for its main commerce
site. The recommendation model was relatively simple to train. It is a discriminative model
that uses datasets that already exist on a file share. However, at the end of this project the
team had also built out a small but scalable modern data lake implemented
mlops tooling and had some best practices in place for training and deploying models
even though the model is not complicated it still added a lot of efficiencies to their site
they used these positive results to get funding for their next project which will be a generative
ai solution organization number two built a chatbot for their e-commerce
site that answered customer questions about products. Large language models are fairly
complicated, the team was not familiar with fine-tuning or retrieval augmented generation,
so all engineer cycles for this project were focused on moving quickly over a steep learning
curve. When the model was complete, it produced okay results, nothing spectacular. Unfortunately,
it had to be manually side-loaded into the pre-production and production environments
because there was no MLOP's tooling in place to deploy it. This caused a little friction with the
DevOps team. The model itself also had a few stability issues in production. The cluster it
was running in did not have enough compute for a generative AI workload.
There were a few severity 1 calls, which resulted in an emergency enhancement to the cluster
so the LLM would not fail under heavy traffic conditions.
After the project, a retrospective determined that they needed to augment their infrastructure
if they were going to be successful with AI.
A plan for building your AI ML data infrastructure The short story above is a simple narrative of
two extreme circumstances. Building AI models, both discriminative and generative, is significantly
different from conventional software development. This should be taken into account when queuing up
an AI ML effort. The graphic below is a visual depiction of the story told in the previous
section. It is a side-by-side
comparison of AI data infrastructure first versus the model first approach. As the story above
showed, each of the bricks below for the infrastructure first approach does no thave
to be a standalone project. Organizations should look for creative ways to deliver on AI while
their infrastructure is being built out. This can be done by understanding all the possibilities with AI, starting simple, and then picking AI projects of increasing complexity.
Conclusion This post outlines our experience in
working with enterprises to construct a modern data lake reference architecture for AI, ML.
It identifies the core components, the key building blocks and the trade-offs of different
AI approaches. The foundational element is a modern data lake built on top of an object store.
The object store must be capable of delivering performance at scale,
where scale is hundreds of petabytes and often exabytes.
By following this reference architecture, we anticipate the user will be able to build a
flexible, extensible data infrastructure which while targeted at AI and ML,
will be equally
performant on all OLAP workloads. To get specific recommendations on the component parts, please
don't hesitate to reach out to meet Keith at MIN. EO, thank you for listening to this HackerNoon
story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.
