The Good Tech Companies - An Architect's Guide to Machine Learning Operations and Required Data Infrastructure
Episode Date: September 5, 2024This story was originally published on HackerNoon at: https://hackernoon.com/an-architects-guide-to-machine-learning-operations-and-required-data-infrastructure. MLOps i...s a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #minio, #minio-blog, #mlops, #machine-learning-operations, #machine-learning, #data-engineering, #data-lake, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. MLOps is a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production. Some organizations start off with a few homegrown tools that version datasets after each experiment and checkpoint models after every epoch of training. Many organizations have chosen to adopt a formal tool that has experiment tracking, collaboration features, model serving capabilities, and even pipeline features.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
An Architect's Guide to Machine Learning Operations and Required Data Infrastructure,
by Minio. MLOPs, short for Machine Learning Operations, is a set of practices and tools
aimed at addressing the specific needs of engineers building models and moving them
into production. Some organizations start off with a few homegrown tools that version
datasets after each experiment
and checkpoint models after every epoch of training. On the other hand, many organizations
have chosen to adopt a formal tool that has experiment tracking, collaboration features,
model-serving capabilities, and even pipeline features for processing data and training models.
To make the best choice for your organization, you should understand all the capabilities available from the leading MLOPs tools in the industry. If you got homegrown root,
you should understand the capabilities you are giving up. A homegrown approach is fine for small
teams that need to move quickly and may know they've time to evaluate a new tool. If you choose
to implement a third-party tool, then you will need to pick the tool that best matches your organization's engineering workflow. This could be tricky because the top tools today
vary significantly in their approach and capabilities. Regardless of your choice,
you will need data infrastructure that can handle large volumes of data and serve training sets in
a performant manner. Checkpointing models and versioning large datasets require scalable
capacity, and if you large datasets require scalable capacity,
and if you are using expensive GPUs, you will need performant infrastructure to get the most out of your investment. In this post, I will present a feature list that architects should
consider regardless of the approach or tooling they choose. This feature list comes from a
research and experiments with three of the top MLOPs, Kubeflow, MLflow, and MLrun. For organizations
that chose to start off with a homegrown solution, I will present a data infrastructure that can
scale and perform. Spoiler alert, all you need here is Minio. When it comes to third-party tools,
I have noticed a pattern with the vendors I have researched. For organizations that choose to adopt
MLOPs tooling, I will present this pattern and tie it back to our modern data lake reference architecture.
Before diving into features and infrastructure requirements, let's better understand the importance of MLOPs.
To do this, it is helpful to compare model creation applications conventional application development, like implementing a new microservice that addies a new feature to an application,
starts with reviewing a specification. New data structures or changes to existing data
structures are designed first. The design of the data should not change once coding begins.
The service is then implemented, and coding is the main activity in this process.
Unit tests and end-to-end
tests are also coded. These tests prove that the code is not faulty and correctly implements the
specification. They can be run automatically by ACI-CD pipeline before deploying the entire
application. Creating a model and training it is different. The first step is understanding the raw
data and the needed prediction. ML engineers do have to write some code to implement their neural networks or set up an algorithm, but coding is not the dominant activity.
The main activity is repeated experimentation.
During experimentation, the design of the data, the design of the model, and the parameters used will all change.
After every experiment, metrics are created that show how the model performed as
it was trained. Metrics are also generated to determine model performance against a validation
set and a test set. These metrics are used to prove the quality of the model. You should save
the model after every experiment, and every time you change your datasets, you should save them as
well. Once a model is ready to be incorporated into an application, it must be packaged and deployed. To summarize, MLOPs is to machine learning what DevOps is to
traditional software development. Both are a set of practices and principles aimed at improving
collaboration between engineering teams, the dev or ML, and IT operations, ops, teams. The goal is
to streamline the development lifecycle, from planning and
development to deployment and operations, using automation. One of the primary benefits of these
approaches is continuous improvement. Let's go a little deeper into MLOPs and look at specific
features to consider. 10 MLOPS features to consider experiment tracking and collaboration
are the features most associated with MLOPs, but today's more modern MLOPs tools can do much more. For example, SOMA can provide a runtime
environment for your experiments. Others can package and deploy models once they are ready
to be integrated into an application. Below is a superset of features found in MLOPs tools today.
This list also includes other things to consider, such as support and
data integration. 1. Support from a major player, MLOP's techniques and features are constantly
evolving. You want a tool that is backed by a major player, Google, Databricks, or McKinsey
and company-backed Kubeflow, MLflow, and MLrun, respectively, ensuring constant development and
improvement.
As a concrete example, many popular tools today were created before large language models,
LLMs, consequently, many are adding new features to support generative AI.
Backslash dot, 2. Modern data lake integration. Experiments generate a lot of structured and unstructured data. An MLOP's tool that is perfectly integrated
with the modern data lake or data lakehouse would store unstructured data in the data lake.
This is Minio directly and structured data would go into the data warehouse.
Unfortunately, many MLOP's tools were around before the open table formats that gave rise
to the modern data lake, so most will have a separate solution for their structured data.
This is typically an open source relational database that your data infrastructure will
need to support. With respect to unstructured data, datasets and model checkpoints,
all the major tools in the industry use Minio since we have been around since 2014.
Backslash.3 Experiment Tracking
Probably the most important feature of an MLOP's tool is
keeping track of datasets, models, hyperparameters, and metrics for each experiment. Experiment
tracking should also facilitate repeatability. If you got a desirable result five experiments ago
and the experiments afterward degraded the performance of your model, then you should be
able to use your MLOP's tool to go back and get the exact
hyperparameters and dataset features used that produce the desirable result.
Backslash dot 4. Facilitate collaboration. An important component of an MLOPS tool is the
portal or UI used to present the results of each experiment. This portal should be accessible to
all team members so that they can see each other's experiments and make recommendations. Some MLOP's tools have fancy graphical
features that allow for custom graphs to be created comparing results from experiments.
Backslash dot 5. Model Packaging. This capability packages a model such that it is accessible from
other programming environments, typically as a microservice. This is a nice feature to have.
A trained model is nothing more than a serialized object. Many organizations may have this figured
out already. Backslash dot 6. Model serving. Once a model is packaged as a service, this feature
will allow for the automated deployment of the service containing the model to the organization's
formal environments. You will not need this feature if
you have a mature C-CD pipeline capable of managing all software assets across environments.
Backslash dot 7. Model Registry. A model registry provides a view of all the models currently under
management by your MLOPs tool. After all, the creation of production-grade models is the goal
of all MOPs.
This view should show models that got deployed to production as well as models that never made it into production. Models that made it into production should be tagged in such a way that
you can also determine the version of the application or service that they were deployed
into. Backslash dot 8. Serverless functions. Some tools provide features that allow code to
be annotated so that a function
or module can be deployed as a containerized service for running experiments in a cluster.
If you decide to use this feature, then make sure all your engineers are comfortable with
this technique. It can be a bit of a learning curve, engineers with a DevOps background will
have an easier time, while engineers who previously studied machine learning with
little coding experience will struggle. Backslash dot 9. Data pipeline capabilities. Some MLOPs tools aim to
provide complete end-to-end capabilities and have features specific to building data pipelines for
retrieving raw data, processing it, and storing clean data. Pipelines are usually specified as
directed acyclic graphs, DAGs. Some tools also have
scheduling capabilities. When used in conjunction with serverless functions this can be a powerful
low-code solution to developing and running data pipelines. You will not need this if you are
already using a pipeline or workflow tool. Backslash dot 10. Training pipeline capabilities.
This is similar to data pipelines but a training pipeline picks up where data pipelines leave off.
A training pipeline allows you to call your data access code, send data to your training logic,
and annotate data artifacts and models so that they are automatically saved.
Similar to data pipelines, this feature can be used in conjunction with serverless functions
to create DAGs and schedule experiments. If you are already using a distributed training tool, then you may not need this feature.
It is possible to start distributed training from a training pipeline, but this could be too complex.
MLOPS and storage After looking at the differences between
traditional application development and machine learning, it should be clear that to be successful
with machine learning, you need some form of MLOPs and a data infrastructure capable of performance and scalable capacity. Homegrown
solutions are fine if you need to start a project quickly and do not have time to evaluate a formal
MLOPs tool. If you take this approach, the good news is that all you need for your data infrastructure
is Minio. Minio is S3 compatible so if you started with another tool and used an
S3 interface to access your datasets, then your code will just work. If you are starting out,
then you can use our Python SDK, which is also S3 compatible. Consider using the enterprise
version of Minio, which has caching capabilities that can greatly speed up data access for training
sets. Check out the real reasons why AI is built
on object storage where we dive into how and why Minio is used to support MLOPs. Organizations
that choose a homegrown solution should still familiarize themselves with the 10 features
described above. You may eventually outgrow your homegrown solution, and the most efficient way
forward is to adopt an MLOPs tool. Adopting a third-party MLOPs tool
is the best way to go for large organizations with several AI ML teams creating models of
different types. The MLOPs tool with the most features is not necessarily the best tool.
Look at the features above and make note of the features that you need, the features you currently
have as part of your existing C, CD pipeline, and finally, the features you do not want, this will help you find the best fit.
MLOP's tools have a voracious appetite for large petabytes of object storage.
Many of them automatically version your datasets with each experiment and automatically checkpoint
your models after each epoch. Here again, Minio can help since capacity is not a problem.
Similar to the homegrown solution, consider using the enterprise edition of Minio.
The caching features work automatically once configured for a bucket so even Theta MLOP's
tool does not request the use of the cache. Minio will automatically cache frequently accessed
objects like a training set. A wishlist for the future Many of the MLOP's
tools on the market today use an open-source relational database to store the structured
data generated during model training which is usually metrics and hyperparameters.
Unfortunately, this will be a new database that needs to be supported by your organization.
Additionally, if an organization is moving toward a modern data lake or data lakehouse,
then an additional relational database is not needed. What would be nice for major MLOPs vendors to consider is using an OTF-based data warehouse to store their structured data.
All the major MLOPs vendors use Minio under the hood to store unstructured data.
Unfortunately, this is generally deployed as a separate small instance that is installed
as a part of the overall larger installation of the MLOPS tool. Additionally, it is usually an
older version of Minio, which goes against Oerithos of always running the latest and greatest.
For existing Minio customers, it would be nice to allow the MLOPS tool to use a bucket within
an existing installation. For customers new to Minio, the MLOPs tool should support the latest version of Minio. Once installed, Minio can also be used
for purposes within your organization beyond MLOPs resources, namely anywhere the strengths
of object storage are required. Conclusion In this post, I presented an architect's guide to
MLOPs by investigating both MLOPs' features and the data
infrastructure needed to support these features. At a high level, organizations can build a
homegrown solution, or they can deploy a third-party solution. Regardless of the direction chosen,
it is important to understand all the features available in the industry today.
Homegrown solutions allow you to start a project quickly, but you may soon outgrow your solution.
It is also important to understand your specific needs and how MLOPs will work with an existing
C-CD pipeline. Many MLOPs tools are feature-rich and contain features that you may never use or
that you already have as part of your C-CD pipeline. To successfully implement MLOPs,
you need a data infrastructure that can support it.
In this post, I presented a simple solution for those who chose a homegrown solution and
described what to expect from third-party tools and the resources they require.
I concluded with a wish list for further development of MLOPs tools that would help
them to better integrate with the modern data lake. For more information on using the modern
data lake to support AI, ML workloads, check out AI, ML within a modern data lake. For more information on using the modern data lake to support AI
ML workloads, check out AI ML within a modern data lake. If you have any questions, be sure
to reach out to us on Slack. Thank you for listening to this Hackernoon story,
read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.
