Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x09: Focusing MLOps on the Data Scientist with Adam Probst of ZenML

Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederick Van Haren. And this is the Utilizing AI podcast. Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. So over the last few months, and actually the last years of this podcast,

Starting point is 00:00:23 we've talked quite a lot about AI ops, which is the use of AI to support enterprise operations in IT. We've also talked a lot about ML ops, which is essentially using machine learning to basically improving the structure around machine learning. Frederick, I think that it can be a little confusing these two terms. And I think that some people, especially on the IT side of things, don't really understand what MLOps is. Right, a lot of work goes into building models

Starting point is 00:00:54 and the real challenge is to go from prototype and experimenting to production in a structure fashion. And so MLOps is really a mechanism to help provide repeatable and a structured way to build models, which is really key, right? It doesn't make a lot of sense if you can build a model, but you don't have the ability to rebuild that model over and over and over at will. And so I think today would be good to hear from ZenML and how this can be done in a repeatable and a controlled matter.

Starting point is 00:01:29 And luckily they actually use open source. So I would really be interested to hear more about their framework. Absolutely. And so as Frederick mentioned, we're joined here from ZenML by Adam Probst. Adam, why don't you go ahead and introduce yourself a little bit and then we'll dive in

Starting point is 00:01:44 and talk a little bit more about MLOps. Yes, for sure. Hi, Stephen. Hi, Frederick. I'm Adam. I'm a co-creator of ZenML, and we were facing exactly these problems in our earlier startups. So we were using predictive maintenance for vehicle and maintenance optimization in big commercial vehicle fleetsets and then figured out that the way bigger problem what we were solving was not the trucks preventing from breaking down but to bring this machine learning model what we built into production and that's not that's what we not just did for one or two times but for a hundred times and this is when we saw that this is a way bigger thing out there, what needs to be solved.

Starting point is 00:02:28 And not just for predictive maintenance, but so many other AI use cases. And this is how we were diving into MLOps. Yeah, it does seem like, well, there's almost a stigma attached to machine learning and data science that it's sort of a game for academics or, I don't know, an experiment, not a real production application for business. And my understanding is that you sort of went through that learning process as well, and you realized that making this part of the business, making this a real application to do real work, was the real work that needed to be done. Is that right? Yes, exactly. Definitely. So we were,

Starting point is 00:03:13 first of all, you have to understand how the whole MLOps world is functioning, what roles are participating, what are the titles. So it's really fragmented and just forming right now. So you have several types of people. They are called, we can talk about that in depth right now, depending on where you want to dive in. But you have ML engineers, you have data scientists, you have data engineers, you have the ops guy, and you have researchers somehow included.

Starting point is 00:03:45 And right now, the whole industry is moving into the direction of putting that into a production scenario like you can compare it to a factory. But this is not how data science should get into production because you have different, way different characteristics in data science than in a normal waterfall production scenario like we used to have it in software engineering. Yeah, so when we talk a little bit about MLOs, we always talk about pipelines. And like you mentioned, there are different roles during each stage of the pipeline. So how do you integrate all these different roles in the different stages of a pipeline? I mean, all pipelines are not equal, right?

Starting point is 00:04:33 So there are simple pipelines and a lot more complex pipelines. But how do you kind of correlate the roles with the framework, so to speak, you're presenting? So for that, we need to understand what roles are in there. So let's start, for example, with the data engineer. The data engineer is somebody who's taking care of the data, is filling some nulls, shaping it up for the next station. And then, as I say, station, you have to hand it over to at some point to a next person. And that would then be the ML researcher or also somehow data science called

Starting point is 00:05:14 data scientist. So it's but just for you to see the whole process and the steps in between. So the data engineer is giving it to the ML researcher, the ML researcher is then using, well, different tools, PyTorch, TensorFlow to train the model. And then, again, there is a station giving it over to either an ML engineer, which is in between, sometimes it depends on the companies, how they are set up, or directly to an ops guy who's bringing it into production on Kubernetes or wherever. So you have different stages, you have different phases of ownership for the machine learning pipeline, and then in the end you hopefully have it in production. But if it then then breaks nobody knows who is owning the whole thing

Starting point is 00:06:06 so this is why it's very very hard to set it up like a production line in a factory like i just imagined it but it's another layer on top because data science so many things are changing in between the the data is changing constantly, then you can't walk through the same process. The accuracy or the results of the models are changing, so you have to have a loop back, and sometimes the loop is going to the very beginning, sometimes the loop is just going one step before

Starting point is 00:06:41 and redoing the last step. So it's a very fragile setting and you can't compare it to a production line like Henry Ford was inventing it. But this is exactly where we would like to get the data scientist in the center and give him or her full ownership over the whole production line. And we don't name it production line anymore. We name it machine learning pipeline. But let's dive into that. I'm super curious where you want to bring that,

Starting point is 00:07:18 Frederik. Right. I think, you know, in the AI world, a lot of things are changing, right? The frameworks are changing. The frameworks are changing, the tools are changing, the hardware is getting faster and faster, there's specialized hardware. And so there's a continuous change. So I would even say that it's extremely difficult to keep up with all the changes, right? And certainly MLOps, when you read a little bit about ML Ops, you will see that there are more recent views on ML Ops with newer tools. How do you integrate that all into the framework? How do you keep the framework moving with all the changes? And maybe that's a better way to frame it. Yes, that's very interesting. So in software engineering, everything is about technical debt, right? So machine learning in particular is so-called, I read a quote which says machine learning is the high interest credit card of technical debt problem. And another quote what I would like to bring in is

Starting point is 00:08:27 as we are bringing together the whole fragmented space of machine learning, and as you said, models are changing, tools are popping out every week, which are really great, but somehow focusing only one vertical

Starting point is 00:08:43 or one step in the machine learning pipeline. And it's very interesting to bring these together in another layer of abstraction. And this is where we think machine learning pipelines should be owned like on another abstraction layer. And you don't need to know every detailed step. So you don't need to use Kubernetes or you don't need to know every detailed step so you don't need to use kubernetes or you don't need to really in detail know what what feature stores are doing um it's it's it's like a pilot who doesn't know how the the plane was built or the runway was built or the airport was built but they are using it and own the whole process because they are the captain or the pilot on the plane so that's very similar to how we see the data scientists.

Starting point is 00:09:27 They are using the infrastructure, which is already there, but on a different abstraction layer. So that's how we bring that together. Right. I mean, it's a question we get a lot when we talk to organizations that don't have anything from an AI perspective, but they want to get started. The fact that everything changes all the time is overwhelming, right? Because they feel that by the time they get started on AI and they have finished all their meetings, they feel like the world has moved on and that they they're working on the past so when you when you talk to customers how do you how do you help them get started right what is the what is the best way for an AI for a new company that wants to do AI

Starting point is 00:10:17 has a great idea around AI and maybe has some data engineers and so on, but how do you get started? Right. Because a lot of people just don't know how to get started. Yes, definitely. That's, that's a big problem. And some, sometimes they are just blocked by their own legacy. And this is also a fear of what we would like to take away that because as you mentioned, we are an open source tool and we are integrating into existing systems, we are also integrating to your legacy. If you are forced by your corporate IT department that you have to use a cloud provider, AWS or GCP or whatever,

Starting point is 00:11:01 or you are forced to use some other tools which are already out there, we could integrate them and bring them into a common framework where you can start using them right away. So you don't need to change or rewrite your whole code, you can still use your old tools, but you have the possibility to scale and to for example also you did everything locally and with a flip of a switch you can then deploy it on a cloud and this is very interesting for for users

Starting point is 00:11:36 and customers as they are very often blocked by by existing solutions in their legacy systems yeah i think one of the weaknesses of ml ops isLOps is it's not something you can buy, right? You can buy hardware, you can buy tools, but processes like MLOps, you know, you can't really buy that. Was that one of the ideas behind ZenML is really to kind of guide users through the whole MLOps process? I mean, it's, you can read books about it, but it's like anything else. You need some experience, you need to have made some failures in the past because you learn more from failures than from successes. Is that kind of where you kind of help the most? It's on the MLOps part?

Starting point is 00:12:25 Or where would you say what you bring to the table with the framework? Yes, exactly. So the problems what we were facing was that we were using many tools which were out there. But the glue code, the artifact tracking, the metadata tracking in between was done manually. Or we had to do some, make some glue code. And that was the big challenge for us, because it was super manual, it was not very, you couldn't automate it quite well, because every tool is, has different outputs, and what the other input tool would need, or the other tool as input would need.

Starting point is 00:13:07 So we have, this were exactly the challenges what we had before. And this is why we were abstracting with ZenML. We were, I don't wanna promote by the way ZenML too much right now, because I'm really hoping to dive into the problems and share our understanding of how we were thinking and what problems we had. So this is why we thought we would need something on a higher level, which will take these problems apart and solve them individually.

Starting point is 00:13:42 And you mentioned one problem in the beginning, the reproducibility. So if you cannot reproduce an experiment, you cannot improve it because if the data is changing or whatever is changing, hyperparameters, you don't know whether you performed now better or whether it was just luck. And this control process was something what we definitely needed for our predictive maintenance models, for example. And in the future, it will be super interesting for corporates or for bigger companies who are forced by law to do some audits. already from the get-go, we are enabling this auditability that you can go back in time and see how your algorithm or your model were fed, by which data, by which hyperparameters. You have maybe a YAML file, which is writing down

Starting point is 00:14:37 all the relevant characteristics of the experiment. And then you can go back in time and see by when drifted your model away and who is guilty right now and how can you improve the process in the future? Right. I mean, I think there's a, there's a couple of things that is important. I think the, the fact that you can repeat and create kind of a baseline to improve. I think the second one would

Starting point is 00:15:05 i what i hear a lot is is people that have relative success experimenting and prototyping but fail on the production side right bringing a model from from prototype to production is is a lot more difficult than what people expect. When you talk about your framework, does your framework then go full cycle, experimenting, prototyping to production and then feeding data back? Is it like a full circle process? Yes, definitely. It's not an end-to-end platform because these tend to be very opinionated.

Starting point is 00:15:46 What we name ourselves is a framework, an MLOps framework, but also from data sourcing to deployment. And so this is exactly where we see ourselves. And yes, we are covering the whole process. Yeah. It seems like you're bringing a level of maturity to MLOps that sometimes we don't even see in DevOps. I mean, it does seem analogous to DevOps, but it seems like what you're talking about really is more mature, I guess is the right word for it, business processes. Do you see yourselves in some ways, I don't want to say in competition, but competing for mindshare with the DevOps trend, with IT ops and application development all focused on that, and that getting a lot of the

Starting point is 00:16:39 press? I would say we learn from them. So what DevOps was 20 years ago or how it developed over the last 20 years is now what will happen to MLOps. So back then, tools like Terraform or whatever, it was super fragmented and tools were able to bring everything what is needed together quite well and accessible for everyone. And this is what we imagine to do with Xenomil, to have really this abstraction layer who everyone can understand, like a data scientist in particular. And with that, you are also able to dive into the DevOps world. So we don't see them as competition at all. We would like to integrate and give the data scientists the possibility now to use Kubernetes, for example, with their current skill set.

Starting point is 00:17:33 So that's why everything is connected and hopefully will benefit from each other. Another area I think that I see MLOps being sort of trapped between a rock and a hard place is that in many ways, data scientists and, you know, people trying to roll out ML models are stuck between operations and the lines of business at a company. So you have the demands, as you mentioned, for example, you know, mobility company or utility or, you know, whatever you are, whatever the business really is, making demands on the machine learning model. And then you also have IT operations trying to translate that into production. And in many ways, I feel like MLOps gets stuck in the middle and has to translate between these two people who frankly, really don't understand each other.

Starting point is 00:18:22 Do you experience that? And how does having a mature framework help to alleviate that problem? Yes, definitely. We can see that problem. So what we saw is that data scientists wanted to bring their machine learning models into production and gave it over to the ops team. And the ops team was then not able to translate everything what was done from the data scientist who has a bit domain knowledge and knows the business case a bit better. The ops guys were not really able to translate that into code every detail. So they had to make it production ready, which is a loss of information

Starting point is 00:19:05 or a loss of quality of the model in the end. Just a simple fact, if they need to transfer it from Python to C++, for example, you cannot translate every creativity what the data scientist had when they were having the business case in mind into the production scenario.

Starting point is 00:19:23 And this was super, super frustrating for the data scientist who is basically would like to own the whole pipeline because they created it in its core in the experimenting phase. So this is how we saw the problems coming up until now and that they don't understand each other and don't speak the same language.

Starting point is 00:19:48 And this is why we saw some potential for another tool out there, which is trying to bring them to the same level. Yeah, so with the framework, you kind of create an abstraction layer where one person can have a complete overview of the pipeline without really having to understand all the different components.

Starting point is 00:20:09 And I presume that it's also a lot easier than to pick pieces of the pipeline and you can optimize portions of the pipeline as you go. Was that part of the strategy too? I guess with the AI market moving so fast that you want to be able to switch out, you know, Docker or Docker Compose with Kubernetes or whatever comes after, I presume, right? Is the framework as, how should I say, as modular as it looks? Yes, that was very important from the beginning,

Starting point is 00:20:46 from the development, that we are just a bit opinionated that you have a default. You can fall back if you don't care how it's going to deploy. We can decide for you what deployment tool we're going to use. But if you want to swap out and get in your infrastructure, that's completely possible. And in these terms, why data science is so different is you learn on the go as well.

Starting point is 00:21:13 And you learn if you run through every model, the results will be fed back and you learn again. Again, for a traditional DevOps scenario, it was like you put in more power, you make the roles more narrow, and the gain what you will have is a higher productivity. The productivity is not so important for like this output productivity for machine learning pipelines because the ultimate goal is to have

Starting point is 00:21:47 a better model and this is something what you only can find out if you are experimenting a little bit and not just in the experimenting phase but also in the deployment phase in the conversion phase towards production so this is why you need somebody and we think it's a data scientist, who can step back and interact with every part of the machine learning pipeline. And just with that overview, they cannot optimize just hyperparameters in the training phase, but they can optimize the orchestra of all the pipeline steps which are which are playing together so and this is the big difference from traditional product um software engineering where everything has to be more productive to hey we have a new um a new kind of uh research part, which we now can, we have a better output of better experiments. So this is just the difference of how we see it. And this is why we would like to set somebody in the position of having a big overview over

Starting point is 00:23:00 the whole process. It's an interesting approach. I do like it a lot and it's also important to notice that you provide this as open source. So can you talk a little bit about that? You know, why open source? Why not closed loop source and so on? Sure. So the analogy to DevOps from 20 years ago is also on the business model. I would say, or we think that what SaaS was 10, 15 years ago will be open source from now on or in the future. Open source itself is not a business model.

Starting point is 00:23:38 It's more a mindset or a funnel, let's say. You get a lot of credibility because everyone knows your code. You have more trust because, again, everyone can see your code and can tune it a little bit and change it. But the outreach is way better than that people trust these models way better than that people are yeah trust these models way better or the the whole

Starting point is 00:24:07 yeah framework let's say sentimental right now and um with that we will for sure have another business model behind which will sustain us financially but the idea of creating an open source framework is to reach way more people. And we also know that 99% of our users won't ever pay for our product, but that's fine. Other companies have shown it like other open source companies that you have to change the world and then you can monetize it just on a fraction of of the world and this is this is exactly what we imagine and we also think that open source is is a really fair and innovative way of collaborating with the community yeah i think that's that's important i mean if you get

Starting point is 00:25:00 if you can get a lot of feedback from your own users, it kind of fits the AI principle, right? You use the data from your own users to improve your own products. So that's kind of a nice analogy. But do you expect then the development to continue through the community or do you expect to do the development yourself with the help of the community? There are two extremes. The one is that you develop what is expected from you.

Starting point is 00:25:29 Like you have, for example, if you have big customers, but you're still developing open source, you will get driven into one direction, which you might tailor towards that one customer, but not for everyone. On the other side, if you're completely listening to the community, it might be very noisy. So many requests will be coming in.

Starting point is 00:25:51 So what we will try to find or are currently as well doing, which you can see in our GitHub repository with our roadmap, is a mix but a bit more towards the community. So we don't care about the monetization and the corporates right now, but everyone who is in the community

Starting point is 00:26:13 is somehow affiliated anyhow with the corporate because how many hobby projects have you done when you were bringing machine learning into production? So in the end, everyone is also associated with a company with big data in the background. But that's our path of our business model or outreach. Yeah, it does take some care when you're developing open source projects like that, because of course, if you listen to the users, like you said, you may get a vocal minority that wants to take something in one direction, whereas you may see maybe a bigger picture use case that is more focused in this direction. Of course, you also have to be careful because you don't want to not listen to the open

Starting point is 00:26:58 source users and go in that direction when they really need you to go there. So I think it does take a very strong leader and strong leadership on the part of the company in order to focus an open source project and incorporate the lessons of the users without avoiding them. I was curious to ask you, within the companies that are using ZenML-based solution, who's driving it? Is it IT operations? Is it data science and machine learning? Or is it the lines of business at this point? That's very interesting. So currently, many companies are still in the research phase. And the business case behind is not so defined yet. So they are, well, the majority, some of them are really bringing it in production and earning money with it.

Starting point is 00:27:53 But what we saw so far is that the main drivers are either the machine learning engineers or there are sometimes also M ml ops teams out there and um normally you would you would think that a data uh a devops engineer will be the one who is helped the most because they now don't have to bring the the machine learning into production from whatever was thrown over the fence from the data scientist but now they can relax and just plug in their infrastructure in the framework and and chill but these guys are super happy but they are not the drivers so the drivers are really the data scientists or the machine learning engineers who are now motivated to bring their use case as close as possible to the experiment into the production scenario.

Starting point is 00:28:52 So they don't have any loss of information until they really bring it into production. Because there is no DevOps engineer who is shredding the code just to make it production ready. And these are the drivers. And most of the time, the data scientists are also the ones who have the business case in mind, like the product owners or whoever, because they know the domain they have a phd in physics and no way better what you can do with it than an engineer who is just bringing it into production so this is why we are also bringing the data scientists in the center and making fully the owner of the whole machine learning pipeline and this is why they are super happy with it that because they yeah they get translated into production as best as possible yeah we talked a

Starting point is 00:29:51 little bit about ml ops and that things are changing i mean for you personally where do you think ml ops should improve moving forward um so what we can see right now, it's a very, very noisy field. Tools are popping up every week, which is good because the best tools will win, but there won't be one winner. So if you see an AWS SageMaker or a big player in the game, maybe by a cloud provider, it is not a market the winner takes at all. But it doesn't matter whether it's fragmenting more or consolidating more. We can see both directions already. That, for example, Feast was bought by Tekton.

Starting point is 00:30:40 It's a feature store. The one is open source, the one is closed source. So they are consolidating on the side. On the other side, tools are popping up, they are fragmenting more, but what is always needed is a tool which is bringing them together to avoid the glue coat, like a framework, like Sentinel.

Starting point is 00:30:59 And this is why, no matter how this whole landscape will change, there will be a need for our tool or similar tools. We don't claim to be the best, even though we are. No, but this is the idea behind that the MLOps market is super, super vivid right now. Yeah. And I think it's great to hear the kind of approach that you've got with looking at the community, looking at open source, not worrying so much about monetization, hoping that you can build the best tool for the job and then that the job will adopt the tool down the road. And also, I love the fact that you both come

Starting point is 00:31:45 from a very practical background of trying to actually develop an ML application, not just coming to it, sort of hoping to build a tool for people, you know, abstract people. You're kind of building something for yourselves in a way, which I really, really appreciate. So, well, thank you so much for this discussion.

Starting point is 00:32:02 It's really been interesting. But the time has come for us to transition into stage two of the Utilizing AI podcast. As you have been warned, every episode we ask our guests three questions. season. And note to listeners, our guest has not been prepped on these questions ahead of time. So this is going to be really off the cuff and hopefully a little bit of fun. This season, we're also changing things up. I'm going to ask a question as is Frederick, but we're also going to have a question from a special guest. So to start things off, Frederick, do you want to ask yours first? Sure. So is MLOps a lasting trend or just a step on the way for ML and DevOps to become normal? Very nice question.

Starting point is 00:32:51 So MLOps will be needed as DevOps is needed since 20 years. So it's going to be the underlying necessity for every machine learning development, because otherwise it doesn't scale. Thank you for that. My question, and again, this is one of those that we've asked a few times, how big do you see ML models getting?

Starting point is 00:33:19 Today we have a hundred billion parameter model. Is that going to look small in the future future or have we reached some kind of limit? Also a nice question. No, it will explode. It will keep on exploding. Like just check out the GPT development from one, two, two, three. The four will be a magnitude higher

Starting point is 00:33:43 and just the data what was collected last two years is more than in history so this is why it will keep on and the everything has to keep up as well as the mlops well thanks for that i think that that's what we've heard a couple of times here, and not just from the companies making the chips, I might add. Finally, as promised, we're going to have a question from outside the podcast here. We actually are bringing in the editor for Gestalt IT, Zach DeMeyer, with a question. Hi, Utilizing AI. I'm Zach DeMeier, writer here at Gestalt IT. And I have a question for you. What's the most innovative use of AI you've seen in the real world? Currently, I think it's autonomous driving.

Starting point is 00:34:32 So the models which are continuously in production and shadowing themselves in cars and swapped out. So it's incredible what, for example, Andrej Karpathy is doing from, he was doing that in research and now he's doing it in real life. So this is, I think, the most impressive use of AI currently because it's a ton of data,

Starting point is 00:35:01 thousands, 10,000s of cars are sending high quality videos to the cloud and they are continuously training four or five models in parallel. And that just needs to scale. And this is what's impressive and also inspired us for a framework doing similar things, but on a different scale. I'm glad that you were able to come up with something on the cuff here on the fly. We look forward also to hear what question you might have for a future guest. So afterwards, we'll record one if you have one. And if you, the listener, want to join this, you can. Just send us an email at host at utilizingai.com, and we'll record your question

Starting point is 00:35:43 for a future guest. So Adam, thank you for joining us today. Where can people connect with you and follow your thoughts on enterprise AI applications and other topics? Sure. So we would love to create a big Slack community. So please join our Slack channel. You can find everything. Also the invitation link on our GitHub repo. It's zenml-io slash zenml on GitHub. And please find me on LinkedIn, connect with me, and we are super happy to get in touch and build that framework together. Great. And we'll include that link in the show notes so folks can just click right on through. How about you, Frederik? What's going on

Starting point is 00:36:23 in your life? Well, I'm still doing consulting and services in the HPC and AI market, but currently I'm working on a design for a large-scale GPU cluster for a customer, so that's keeping me busy. You can find me on LinkedIn and on Twitter as Frederick V. Heron. Excellent. And as for me, you can find me on most social media networks at S Foskett. I will point out that this week is our Cloud Field Day event. So if you go to techfieldday.com, you'll be able to see a little bit of me talking to some of the leading companies that are deploying cloud technologies in the enterprise. So thank you for listening to the Utilizing AI podcast.

Starting point is 00:37:03 If you enjoyed this discussion, please do subscribe in your favorite podcast application. You can also find us on YouTube. You can review the show on iTunes as well. That does really help. And please do share the show in your favorite MLOps community or with your friends. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. But for show notes and more episodes, you can go to utilizing-ai.com, or you can find us on Twitter at utilizing underscore AI. Thanks for listening, and we'll see you next week.

Your Ad Here

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x09: Focusing MLOps on the Data Scientist with Adam Probst of ZenML

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.