The Good Tech Companies - The Real Reasons Why AI is Built on Object Storage
Episode Date: August 29, 2024This story was originally published on HackerNoon at: https://hackernoon.com/the-real-reasons-why-ai-is-built-on-object-storage. From no limits on unstructured data to h...aving greater control over serving models, here are some reasons why AI is built on object storage. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #artificial-intelligence, #minio, #minio-blog, #object-storage, #data-science, #s3, #aiml-workflows, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. MinIO Object Store is the de facto standard for massive unstructured data lakes. MinIO is compatible with all the modern machine learning frameworks. It is 100% S3 API-compatible, so you can perform ML workloads against your on-premise or on-device object store.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
The real reasons why AI is built on object storage, by Minio.
Number 1. No limits on unstructured data in the current paradigm of machine learning,
performance and ability scales with compute, which is really a proxy for dataset size and
model size, scaling laws for neural language models, Kaplan et al. Over the past few years, this has berowed on sweeping changes to how machine learning
and data infrastructure is built, namely, the separation of storage and compute,
the construction of massive cloud-native data lakes filled with unstructured data,
and specialized hardware that can do matrix multiplication really fast.
When a training dataset, or even an individual shard of a dataset,
requires more space than is available in system memory and or local storage,
the importance of decoupling storage from compute becomes glaringly evident.
When training on data that resides on the Minio object store, there are no limits to your training
data size. Due to Minio's focus on simplicity and I.O. throughput, it is the network that becomes
the sole limiting factor for training speed and GP utilization. In addition to affording the best
performance of any object store, Minio is compatible with all the modern machine learning
frameworks. The Minio object store is also 100% S3 API compatible, so you can perform ML workloads
against your on-premise or on-device object store using familiar dataset utilities like the Torch Data S3 datapipe. In the event where file system-like
capabilities are required by your consuming application, you can even use Minio with
object store file interfaces like Mountpoint S3 or S3FS. In a future blog post, we will use the
Minio Python SDK in custom implementations of some
common PyTorch and FairSeq interfaces, like Dataset and Task, respectively, in order to
enable no-limits training data and high GPU utilization for model training.
Beyond performance and compatibility with the modern ML stack, the design choices of
object storage, namely, 1. a flat namespace, 2. the encapsulation of the whole
object and its metadata as the lowest logical entity, and 3. simple HTTP verbs APIs are what
have led to object storage becoming the de facto standard for massive unstructured data lakes.
A look at the recent history of machine learning shows that training data, and in a sense,
model architectures themselves, has become less structured and more general.
It used to be the case that models were predominantly trained on tabular data.
Nowadays, there is a much broader range, from paragraphs of plain text to hours of video.
As model architectures and ML applications evolve, object stores stateless, schema-less,
and consequently,
scalable nature only becomes more critical. 2. Rich metadata for models and DATASETS
Due to the design choices of the Minio object store, every object can contain
schema-less metadata without sacrificing performance or requiring the use of a
dedicated metadata server. Imagination is really the only limit when it comes to what kind of metadata you want to add to your objects.
However, here are some ideas that could be particularly useful for ML-related objects.
For model checkpoints,
loss function value,
time taken for training,
dataset used for training.
For datasets,
name of paired index files,
if applicable,
dataset category, train, validation,
test, information about the dataset's format. Highly descriptive metadata like this can be
particularly powerful when paired with the ability to efficiently index and query this metadata,
even across billions of objects, something that the Minio Enterprise Catalog affords.
For example, you could query for model checkpoints that are tagged as
tested or checkpoints that have been trained on a particular dataset. 3. Models and datasets are
available, auditable, and versionable as machine learning models and their datasets become
increasingly critical assets, it has become just as important to store and manage these assets in
a way that is fault-tolerant, auditable, and versionable.
Datasets and the models that train on them are valuable assets that are the hard-earned products of time, engineering effort, and money. Accordingly, they should be protected in a way
that doesn't encumber access by applications. Minio's in-line operations like bitrot checking
and erasure coding, along with features like multi-site, active-active replication ensure resilience of
these objects at scale. With generative AI in particular, knowing which version of which
dataset was used to train a particular model that is being served is helpful when debugging
hallucinations and other model misbehavior. If model checkpoints are properly versioned,
it becomes easier to trust a quick rollback to a previously served version of the checkpoint.
With the Minio Object Store, you get these benefits to a previously served version of the checkpoint.
With the Minio Object Store, you get these benefits for your objects right out of the box.
4. Owned Serving Infrastructure The Minio Object Store is,
fundamentally, an object store that you, or your organization, controls. Whether the use case is for prototyping, security, regulatory, or economic purposes, control is the common thread.
Accordingly, IF-trained model checkpoints reside on the object store,
it affords you greater control over the task of serving models for inference or consumption.
In a previous post, we explored the benefits of storing model files on object store and how to serve them directly with the Torch serve inference framework from PyTorch.
However, this is an entirely model and
framework agnostic strategy. But why does this matter? Network lag or outages on third-party
model repositories could make models slow to get served for inference or entirely unavailable.
Furthermore, in a production environment where inference servers are scaling and need to pull
model checkpoints routinely, this problem can be exacerbated.
In the most secure and or critical of circumstances, it's best to avoid third-party dependency
over the internet where you can.
With Minio as a private or hybrid cloud object store, it is possible to avoid these problems
entirely.
Closing T-HOTS These four reasons are by no means an exhaustive
list.
Developers and organizations use Minio object
storage for their AI workloads for a whole variety of reasons, ranging from ease of development to
its super-light footprint. In the beginning of this post, we covered the driving forces behind
the adoption of high-performance object store for AI. Whether or not the scaling laws hold,
what's certainly going to be true is that organizations and their AI workloads will
always benefit from the best I.O. throughput capability available. In addition to that,
we can be fairly confident that developers will never ask for APIs that are harder to use and
software that does not just work. In any future where these assumptions hold, high-performance
object store is the way. For any architects and engineering decision-makers reading this,
many of the best practices mentioned here can be automated to ensure object storage is leveraged
in a way that makes your AI ML workflows simpler and more scalable. This can be done through the
use of any of the modern MLOPs tool sets. AI MLSME Keith Pijanowski has explored many of these tools.
Search our blog site for Kubeflow, MLflow,
and MLrun for more information on MLOP's tooling. However, if these MLOP's tools are not an option
for your organization and you need to get going quickly, then the techniques shown in this post
are the best way to get started managing your AI, ML workflows with Minio. For developers,
or anybody who's curious slightly smiling face, in a future
blog post, we will don't end to end walkthrough of adapting AML framework to leverage object store
with the goal of, no limits, training data and proper GPU utilization. Thanks for reading,
I hope it was informative. As always, if you have any questions join our slack channel or drop us a
note at hello admin eo thank you for
listening to this hackernoon story read by artificial intelligence visit hackernoon.com
to read write learn and publish
