The Good Tech Companies - The Real Reasons Why AI is Built on Object Storage

Episode Date: August 29, 2024

This story was originally published on HackerNoon at: https://hackernoon.com/the-real-reasons-why-ai-is-built-on-object-storage. From no limits on unstructured data to h...aving greater control over serving models, here are some reasons why AI is built on object storage. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #artificial-intelligence, #minio, #minio-blog, #object-storage, #data-science, #s3, #aiml-workflows, #good-company, and more. This story was written by: @minio. Learn more about this writer by checking @minio's about page, and for more stories, please visit hackernoon.com. MinIO Object Store is the de facto standard for massive unstructured data lakes. MinIO is compatible with all the modern machine learning frameworks. It is 100% S3 API-compatible, so you can perform ML workloads against your on-premise or on-device object store.

Transcript
Discussion (0)
Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. The real reasons why AI is built on object storage, by Minio. Number 1. No limits on unstructured data in the current paradigm of machine learning, performance and ability scales with compute, which is really a proxy for dataset size and model size, scaling laws for neural language models, Kaplan et al. Over the past few years, this has berowed on sweeping changes to how machine learning and data infrastructure is built, namely, the separation of storage and compute, the construction of massive cloud-native data lakes filled with unstructured data, and specialized hardware that can do matrix multiplication really fast.
Starting point is 00:00:42 When a training dataset, or even an individual shard of a dataset, requires more space than is available in system memory and or local storage, the importance of decoupling storage from compute becomes glaringly evident. When training on data that resides on the Minio object store, there are no limits to your training data size. Due to Minio's focus on simplicity and I.O. throughput, it is the network that becomes the sole limiting factor for training speed and GP utilization. In addition to affording the best performance of any object store, Minio is compatible with all the modern machine learning frameworks. The Minio object store is also 100% S3 API compatible, so you can perform ML workloads
Starting point is 00:01:21 against your on-premise or on-device object store using familiar dataset utilities like the Torch Data S3 datapipe. In the event where file system-like capabilities are required by your consuming application, you can even use Minio with object store file interfaces like Mountpoint S3 or S3FS. In a future blog post, we will use the Minio Python SDK in custom implementations of some common PyTorch and FairSeq interfaces, like Dataset and Task, respectively, in order to enable no-limits training data and high GPU utilization for model training. Beyond performance and compatibility with the modern ML stack, the design choices of object storage, namely, 1. a flat namespace, 2. the encapsulation of the whole
Starting point is 00:02:06 object and its metadata as the lowest logical entity, and 3. simple HTTP verbs APIs are what have led to object storage becoming the de facto standard for massive unstructured data lakes. A look at the recent history of machine learning shows that training data, and in a sense, model architectures themselves, has become less structured and more general. It used to be the case that models were predominantly trained on tabular data. Nowadays, there is a much broader range, from paragraphs of plain text to hours of video. As model architectures and ML applications evolve, object stores stateless, schema-less, and consequently,
Starting point is 00:02:45 scalable nature only becomes more critical. 2. Rich metadata for models and DATASETS Due to the design choices of the Minio object store, every object can contain schema-less metadata without sacrificing performance or requiring the use of a dedicated metadata server. Imagination is really the only limit when it comes to what kind of metadata you want to add to your objects. However, here are some ideas that could be particularly useful for ML-related objects. For model checkpoints, loss function value, time taken for training,
Starting point is 00:03:18 dataset used for training. For datasets, name of paired index files, if applicable, dataset category, train, validation, test, information about the dataset's format. Highly descriptive metadata like this can be particularly powerful when paired with the ability to efficiently index and query this metadata, even across billions of objects, something that the Minio Enterprise Catalog affords.
Starting point is 00:03:42 For example, you could query for model checkpoints that are tagged as tested or checkpoints that have been trained on a particular dataset. 3. Models and datasets are available, auditable, and versionable as machine learning models and their datasets become increasingly critical assets, it has become just as important to store and manage these assets in a way that is fault-tolerant, auditable, and versionable. Datasets and the models that train on them are valuable assets that are the hard-earned products of time, engineering effort, and money. Accordingly, they should be protected in a way that doesn't encumber access by applications. Minio's in-line operations like bitrot checking and erasure coding, along with features like multi-site, active-active replication ensure resilience of
Starting point is 00:04:25 these objects at scale. With generative AI in particular, knowing which version of which dataset was used to train a particular model that is being served is helpful when debugging hallucinations and other model misbehavior. If model checkpoints are properly versioned, it becomes easier to trust a quick rollback to a previously served version of the checkpoint. With the Minio Object Store, you get these benefits to a previously served version of the checkpoint. With the Minio Object Store, you get these benefits for your objects right out of the box. 4. Owned Serving Infrastructure The Minio Object Store is, fundamentally, an object store that you, or your organization, controls. Whether the use case is for prototyping, security, regulatory, or economic purposes, control is the common thread.
Starting point is 00:05:08 Accordingly, IF-trained model checkpoints reside on the object store, it affords you greater control over the task of serving models for inference or consumption. In a previous post, we explored the benefits of storing model files on object store and how to serve them directly with the Torch serve inference framework from PyTorch. However, this is an entirely model and framework agnostic strategy. But why does this matter? Network lag or outages on third-party model repositories could make models slow to get served for inference or entirely unavailable. Furthermore, in a production environment where inference servers are scaling and need to pull model checkpoints routinely, this problem can be exacerbated.
Starting point is 00:05:45 In the most secure and or critical of circumstances, it's best to avoid third-party dependency over the internet where you can. With Minio as a private or hybrid cloud object store, it is possible to avoid these problems entirely. Closing T-HOTS These four reasons are by no means an exhaustive list. Developers and organizations use Minio object storage for their AI workloads for a whole variety of reasons, ranging from ease of development to
Starting point is 00:06:10 its super-light footprint. In the beginning of this post, we covered the driving forces behind the adoption of high-performance object store for AI. Whether or not the scaling laws hold, what's certainly going to be true is that organizations and their AI workloads will always benefit from the best I.O. throughput capability available. In addition to that, we can be fairly confident that developers will never ask for APIs that are harder to use and software that does not just work. In any future where these assumptions hold, high-performance object store is the way. For any architects and engineering decision-makers reading this, many of the best practices mentioned here can be automated to ensure object storage is leveraged
Starting point is 00:06:49 in a way that makes your AI ML workflows simpler and more scalable. This can be done through the use of any of the modern MLOPs tool sets. AI MLSME Keith Pijanowski has explored many of these tools. Search our blog site for Kubeflow, MLflow, and MLrun for more information on MLOP's tooling. However, if these MLOP's tools are not an option for your organization and you need to get going quickly, then the techniques shown in this post are the best way to get started managing your AI, ML workflows with Minio. For developers, or anybody who's curious slightly smiling face, in a future blog post, we will don't end to end walkthrough of adapting AML framework to leverage object store
Starting point is 00:07:31 with the goal of, no limits, training data and proper GPU utilization. Thanks for reading, I hope it was informative. As always, if you have any questions join our slack channel or drop us a note at hello admin eo thank you for listening to this hackernoon story read by artificial intelligence visit hackernoon.com to read write learn and publish

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.