The Good Tech Companies - An Architect's Guide to Machine Learning Operations and Required Data Infrastructure

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. An Architect's Guide to Machine Learning Operations and Required Data Infrastructure, by Minio. MLOPs, short for Machine Learning Operations, is a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production. Some organizations start off with a few homegrown tools that version datasets after each experiment and checkpoint models after every epoch of training. On the other hand, many organizations have chosen to adopt a formal tool that has experiment tracking, collaboration features,

Starting point is 00:00:35 model-serving capabilities, and even pipeline features for processing data and training models. To make the best choice for your organization, you should understand all the capabilities available from the leading MLOPs tools in the industry. If you got homegrown root, you should understand the capabilities you are giving up. A homegrown approach is fine for small teams that need to move quickly and may know they've time to evaluate a new tool. If you choose to implement a third-party tool, then you will need to pick the tool that best matches your organization's engineering workflow. This could be tricky because the top tools today vary significantly in their approach and capabilities. Regardless of your choice, you will need data infrastructure that can handle large volumes of data and serve training sets in a performant manner. Checkpointing models and versioning large datasets require scalable

Starting point is 00:01:24 capacity, and if you large datasets require scalable capacity, and if you are using expensive GPUs, you will need performant infrastructure to get the most out of your investment. In this post, I will present a feature list that architects should consider regardless of the approach or tooling they choose. This feature list comes from a research and experiments with three of the top MLOPs, Kubeflow, MLflow, and MLrun. For organizations that chose to start off with a homegrown solution, I will present a data infrastructure that can scale and perform. Spoiler alert, all you need here is Minio. When it comes to third-party tools, I have noticed a pattern with the vendors I have researched. For organizations that choose to adopt MLOPs tooling, I will present this pattern and tie it back to our modern data lake reference architecture.

Starting point is 00:02:08 Before diving into features and infrastructure requirements, let's better understand the importance of MLOPs. To do this, it is helpful to compare model creation applications conventional application development, like implementing a new microservice that addies a new feature to an application, starts with reviewing a specification. New data structures or changes to existing data structures are designed first. The design of the data should not change once coding begins. The service is then implemented, and coding is the main activity in this process. Unit tests and end-to-end tests are also coded. These tests prove that the code is not faulty and correctly implements the specification. They can be run automatically by ACI-CD pipeline before deploying the entire

Starting point is 00:02:56 application. Creating a model and training it is different. The first step is understanding the raw data and the needed prediction. ML engineers do have to write some code to implement their neural networks or set up an algorithm, but coding is not the dominant activity. The main activity is repeated experimentation. During experimentation, the design of the data, the design of the model, and the parameters used will all change. After every experiment, metrics are created that show how the model performed as it was trained. Metrics are also generated to determine model performance against a validation set and a test set. These metrics are used to prove the quality of the model. You should save the model after every experiment, and every time you change your datasets, you should save them as

Starting point is 00:03:40 well. Once a model is ready to be incorporated into an application, it must be packaged and deployed. To summarize, MLOPs is to machine learning what DevOps is to traditional software development. Both are a set of practices and principles aimed at improving collaboration between engineering teams, the dev or ML, and IT operations, ops, teams. The goal is to streamline the development lifecycle, from planning and development to deployment and operations, using automation. One of the primary benefits of these approaches is continuous improvement. Let's go a little deeper into MLOPs and look at specific features to consider. 10 MLOPS features to consider experiment tracking and collaboration are the features most associated with MLOPs, but today's more modern MLOPs tools can do much more. For example, SOMA can provide a runtime

Starting point is 00:04:30 environment for your experiments. Others can package and deploy models once they are ready to be integrated into an application. Below is a superset of features found in MLOPs tools today. This list also includes other things to consider, such as support and data integration. 1. Support from a major player, MLOP's techniques and features are constantly evolving. You want a tool that is backed by a major player, Google, Databricks, or McKinsey and company-backed Kubeflow, MLflow, and MLrun, respectively, ensuring constant development and improvement. As a concrete example, many popular tools today were created before large language models,

Starting point is 00:05:15 LLMs, consequently, many are adding new features to support generative AI. Backslash dot, 2. Modern data lake integration. Experiments generate a lot of structured and unstructured data. An MLOP's tool that is perfectly integrated with the modern data lake or data lakehouse would store unstructured data in the data lake. This is Minio directly and structured data would go into the data warehouse. Unfortunately, many MLOP's tools were around before the open table formats that gave rise to the modern data lake, so most will have a separate solution for their structured data. This is typically an open source relational database that your data infrastructure will need to support. With respect to unstructured data, datasets and model checkpoints,

Starting point is 00:05:54 all the major tools in the industry use Minio since we have been around since 2014. Backslash.3 Experiment Tracking Probably the most important feature of an MLOP's tool is keeping track of datasets, models, hyperparameters, and metrics for each experiment. Experiment tracking should also facilitate repeatability. If you got a desirable result five experiments ago and the experiments afterward degraded the performance of your model, then you should be able to use your MLOP's tool to go back and get the exact hyperparameters and dataset features used that produce the desirable result.

Starting point is 00:06:29 Backslash dot 4. Facilitate collaboration. An important component of an MLOPS tool is the portal or UI used to present the results of each experiment. This portal should be accessible to all team members so that they can see each other's experiments and make recommendations. Some MLOP's tools have fancy graphical features that allow for custom graphs to be created comparing results from experiments. Backslash dot 5. Model Packaging. This capability packages a model such that it is accessible from other programming environments, typically as a microservice. This is a nice feature to have. A trained model is nothing more than a serialized object. Many organizations may have this figured out already. Backslash dot 6. Model serving. Once a model is packaged as a service, this feature

Starting point is 00:07:17 will allow for the automated deployment of the service containing the model to the organization's formal environments. You will not need this feature if you have a mature C-CD pipeline capable of managing all software assets across environments. Backslash dot 7. Model Registry. A model registry provides a view of all the models currently under management by your MLOPs tool. After all, the creation of production-grade models is the goal of all MOPs. This view should show models that got deployed to production as well as models that never made it into production. Models that made it into production should be tagged in such a way that you can also determine the version of the application or service that they were deployed

Starting point is 00:07:57 into. Backslash dot 8. Serverless functions. Some tools provide features that allow code to be annotated so that a function or module can be deployed as a containerized service for running experiments in a cluster. If you decide to use this feature, then make sure all your engineers are comfortable with this technique. It can be a bit of a learning curve, engineers with a DevOps background will have an easier time, while engineers who previously studied machine learning with little coding experience will struggle. Backslash dot 9. Data pipeline capabilities. Some MLOPs tools aim to provide complete end-to-end capabilities and have features specific to building data pipelines for

Starting point is 00:08:36 retrieving raw data, processing it, and storing clean data. Pipelines are usually specified as directed acyclic graphs, DAGs. Some tools also have scheduling capabilities. When used in conjunction with serverless functions this can be a powerful low-code solution to developing and running data pipelines. You will not need this if you are already using a pipeline or workflow tool. Backslash dot 10. Training pipeline capabilities. This is similar to data pipelines but a training pipeline picks up where data pipelines leave off. A training pipeline allows you to call your data access code, send data to your training logic, and annotate data artifacts and models so that they are automatically saved.

Starting point is 00:09:17 Similar to data pipelines, this feature can be used in conjunction with serverless functions to create DAGs and schedule experiments. If you are already using a distributed training tool, then you may not need this feature. It is possible to start distributed training from a training pipeline, but this could be too complex. MLOPS and storage After looking at the differences between traditional application development and machine learning, it should be clear that to be successful with machine learning, you need some form of MLOPs and a data infrastructure capable of performance and scalable capacity. Homegrown solutions are fine if you need to start a project quickly and do not have time to evaluate a formal MLOPs tool. If you take this approach, the good news is that all you need for your data infrastructure

Starting point is 00:10:00 is Minio. Minio is S3 compatible so if you started with another tool and used an S3 interface to access your datasets, then your code will just work. If you are starting out, then you can use our Python SDK, which is also S3 compatible. Consider using the enterprise version of Minio, which has caching capabilities that can greatly speed up data access for training sets. Check out the real reasons why AI is built on object storage where we dive into how and why Minio is used to support MLOPs. Organizations that choose a homegrown solution should still familiarize themselves with the 10 features described above. You may eventually outgrow your homegrown solution, and the most efficient way

Starting point is 00:10:41 forward is to adopt an MLOPs tool. Adopting a third-party MLOPs tool is the best way to go for large organizations with several AI ML teams creating models of different types. The MLOPs tool with the most features is not necessarily the best tool. Look at the features above and make note of the features that you need, the features you currently have as part of your existing C, CD pipeline, and finally, the features you do not want, this will help you find the best fit. MLOP's tools have a voracious appetite for large petabytes of object storage. Many of them automatically version your datasets with each experiment and automatically checkpoint your models after each epoch. Here again, Minio can help since capacity is not a problem.

Starting point is 00:11:26 Similar to the homegrown solution, consider using the enterprise edition of Minio. The caching features work automatically once configured for a bucket so even Theta MLOP's tool does not request the use of the cache. Minio will automatically cache frequently accessed objects like a training set. A wishlist for the future Many of the MLOP's tools on the market today use an open-source relational database to store the structured data generated during model training which is usually metrics and hyperparameters. Unfortunately, this will be a new database that needs to be supported by your organization. Additionally, if an organization is moving toward a modern data lake or data lakehouse,

Starting point is 00:12:10 then an additional relational database is not needed. What would be nice for major MLOPs vendors to consider is using an OTF-based data warehouse to store their structured data. All the major MLOPs vendors use Minio under the hood to store unstructured data. Unfortunately, this is generally deployed as a separate small instance that is installed as a part of the overall larger installation of the MLOPS tool. Additionally, it is usually an older version of Minio, which goes against Oerithos of always running the latest and greatest. For existing Minio customers, it would be nice to allow the MLOPS tool to use a bucket within an existing installation. For customers new to Minio, the MLOPs tool should support the latest version of Minio. Once installed, Minio can also be used for purposes within your organization beyond MLOPs resources, namely anywhere the strengths

Starting point is 00:12:56 of object storage are required. Conclusion In this post, I presented an architect's guide to MLOPs by investigating both MLOPs' features and the data infrastructure needed to support these features. At a high level, organizations can build a homegrown solution, or they can deploy a third-party solution. Regardless of the direction chosen, it is important to understand all the features available in the industry today. Homegrown solutions allow you to start a project quickly, but you may soon outgrow your solution. It is also important to understand your specific needs and how MLOPs will work with an existing C-CD pipeline. Many MLOPs tools are feature-rich and contain features that you may never use or

Starting point is 00:13:37 that you already have as part of your C-CD pipeline. To successfully implement MLOPs, you need a data infrastructure that can support it. In this post, I presented a simple solution for those who chose a homegrown solution and described what to expect from third-party tools and the resources they require. I concluded with a wish list for further development of MLOPs tools that would help them to better integrate with the modern data lake. For more information on using the modern data lake to support AI, ML workloads, check out AI, ML within a modern data lake. For more information on using the modern data lake to support AI ML workloads, check out AI ML within a modern data lake. If you have any questions, be sure

Starting point is 00:14:11 to reach out to us on Slack. Thank you for listening to this Hackernoon story, read by Artificial Intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - An Architect's Guide to Machine Learning Operations and Required Data Infrastructure

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.