Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x09: Focusing MLOps on the Data Scientist with Adam Probst of ZenML
Episode Date: November 2, 2021Many data scientists and ML engineers have faced the challenge of putting AI models into production, and this is the core of MLOps. In this episode, Adam Probst, Co-Founder of ZenML, joins Frederic Va...n Haren and Stephen Foskett to discuss the challenges of putting ML models into production. Machine learning pipelines are inherently complex and fragile and require feedback and tuning, and this requires a new approach with continuous improvement and tight integration. Although reminiscent of DevOps, MLOps demands even more collaboration between IT operations, developers and data scientists, and lines of business. ZenML prepares ready-to-use MLOps infrastructure to these groups so they can focus on the model rather than the platform. Three Questions Stephen: How big can ML models get? Will today's hundred-billion parameter model look small tomorrow or have we reached the limit? Frederic: Is MLOps a lasting trend or just a step on the way for ML and DevOps becoming normal? Zach DeMeyer: What's the most innovative use of AI you've seen in the real world? Guests and Hosts Adam Probst, Co-Founder, ZenML. Connect with ZenML on GitHub, LinkedIn and on Twitter @zenml_io. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 11/02/2021 Tags: @zenml_io, @SFoskett, @FredericVHaren
Transcript
Discussion (0)
I'm Stephen Foskett.
I'm Frederick Van Haren.
And this is the Utilizing AI podcast.
Welcome to Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics.
So over the last few months,
and actually the last years of this podcast,
we've talked quite a lot about AI ops,
which is the use of AI to support enterprise operations in IT. We've also talked a lot about
ML ops, which is essentially using machine learning to basically improving the structure
around machine learning. Frederick, I think that it can be a little confusing these two terms.
And I think that some people,
especially on the IT side of things,
don't really understand what MLOps is.
Right, a lot of work goes into building models
and the real challenge is to go from prototype
and experimenting to production in a structure fashion.
And so MLOps is really a mechanism to help provide repeatable and a
structured way to build models, which is really key, right? It doesn't make a lot of sense if
you can build a model, but you don't have the ability to rebuild that model over and over and
over at will. And so I think today would be good to hear from ZenML
and how this can be done in a repeatable
and a controlled matter.
And luckily they actually use open source.
So I would really be interested to hear more
about their framework.
Absolutely.
And so as Frederick mentioned, we're joined here
from ZenML by Adam Probst.
Adam, why don't you go ahead and introduce yourself
a little bit and then we'll dive in
and talk a little bit more about MLOps. Yes, for sure. Hi, Stephen. Hi, Frederick. I'm
Adam. I'm a co-creator of ZenML, and we were facing exactly these problems in our earlier
startups. So we were using predictive maintenance for vehicle and maintenance optimization in big
commercial vehicle fleetsets and then figured out
that the way bigger problem what we were solving was not the trucks preventing from breaking down
but to bring this machine learning model what we built into production and that's not that's what
we not just did for one or two times but for a hundred times and this is when we saw that this
is a way bigger thing out there, what needs to be solved.
And not just for predictive maintenance, but so many other AI use cases.
And this is how we were diving into MLOps.
Yeah, it does seem like, well, there's almost a stigma attached to machine learning and
data science that it's sort of a game for academics or, I don't know,
an experiment, not a real production application for business. And my understanding is that you
sort of went through that learning process as well, and you realized that making this part of
the business, making this a real application to do real work, was the real work
that needed to be done. Is that right? Yes, exactly. Definitely. So we were,
first of all, you have to understand how the whole MLOps world is functioning, what roles are
participating, what are the titles. So it's really fragmented and just forming right now.
So you have several types of people.
They are called, we can talk about that in depth right now,
depending on where you want to dive in.
But you have ML engineers, you have data scientists,
you have data engineers, you have the ops guy,
and you have researchers somehow included.
And right now, the whole industry is moving into the direction of putting that into a
production scenario like you can compare it to a factory.
But this is not how data science should get into production because you have different, way different characteristics in data science
than in a normal waterfall production scenario like we used to have it in software engineering.
Yeah, so when we talk a little bit about MLOs, we always talk about pipelines. And like you
mentioned, there are different roles during each stage of the pipeline.
So how do you integrate all these different roles in the different stages of a pipeline?
I mean, all pipelines are not equal, right?
So there are simple pipelines and a lot more complex pipelines.
But how do you kind of correlate the roles with the framework, so to speak, you're presenting?
So for that, we need to understand what roles are in there.
So let's start, for example, with the data engineer.
The data engineer is somebody who's taking care of the data,
is filling some nulls, shaping it up for the next station.
And then, as I say, station, you have to hand it over to at some point
to a next person. And that would then be the ML researcher or also somehow data science called
data scientist. So it's but just for you to see the whole process and the steps in between. So
the data engineer is giving it to the ML researcher,
the ML researcher is then using, well, different tools, PyTorch, TensorFlow to train the model.
And then, again, there is a station giving it over to either an ML engineer, which is in between,
sometimes it depends on the companies, how they are set up, or directly to an ops guy
who's bringing it into production on Kubernetes or wherever. So you have different stages, you have
different phases of ownership for the machine learning pipeline, and then in the end you
hopefully have it in production. But if it then then breaks nobody knows who is owning the whole thing
so this is why it's very very hard to set it up like a production line in a factory
like i just imagined it but it's another layer on top because data science so many things are
changing in between the the data is changing constantly,
then you can't walk through the same process.
The accuracy or the results of the models are changing,
so you have to have a loop back,
and sometimes the loop is going to the very beginning,
sometimes the loop is just going one step before
and redoing the last step.
So it's a very fragile setting
and you can't compare it to a production line
like Henry Ford was inventing it.
But this is exactly where we would like to get the data scientist
in the center and give him or her full ownership over
the whole production line. And we don't name it production line anymore. We name it machine
learning pipeline. But let's dive into that. I'm super curious where you want to bring that,
Frederik. Right. I think, you know, in the AI world, a lot of things are changing, right? The
frameworks are changing. The frameworks are changing,
the tools are changing, the hardware is getting faster and faster, there's specialized hardware.
And so there's a continuous change. So I would even say that it's extremely difficult to keep up with all the changes, right? And certainly MLOps, when you read a little bit about ML Ops, you will see that there are more recent views on
ML Ops with newer tools. How do you integrate that all into the framework? How do you keep
the framework moving with all the changes? And maybe that's a better way to frame it.
Yes, that's very interesting. So in software engineering, everything is about technical debt, right? So machine learning in particular is so-called, I read a quote which says machine learning is the high interest credit card of technical debt problem. And another quote
what I would like to bring in is
as we are bringing together
the whole fragmented space
of machine learning,
and as you said,
models are changing,
tools are popping out every week,
which are really great,
but somehow focusing only one vertical
or one step in the machine learning pipeline.
And it's very interesting to bring these together in another layer of abstraction. And this is where
we think machine learning pipelines should be owned like on another abstraction layer.
And you don't need to know every detailed step. So you don't need to use Kubernetes or you don't need to know every detailed step so you don't need to use
kubernetes or you don't need to really in detail know what what feature stores are doing um it's
it's it's like a pilot who doesn't know how the the plane was built or the runway was built or
the airport was built but they are using it and own the whole process because they are the captain
or the pilot on the plane so that's very similar to how we see the data scientists.
They are using the infrastructure, which is already there,
but on a different abstraction layer.
So that's how we bring that together.
Right. I mean, it's a question we get a lot when we talk to organizations
that don't have anything from an AI perspective, but they want to get started.
The fact that everything changes all the time is overwhelming, right?
Because they feel that by the time they get started on AI and they have finished all their meetings, they feel like the world has moved on and that they they're working on the past so when you when you talk to customers how do you how do you help them
get started right what is the what is the best way for an AI for a new company that wants to do AI
has a great idea around AI and maybe has some data engineers and so on, but how do you get started? Right. Because a lot of people just don't know how to get started.
Yes, definitely. That's, that's a big problem. And some,
sometimes they are just blocked by their own legacy.
And this is also a fear of what we would like to take away that because as you
mentioned, we are an open source tool and we are integrating into existing systems,
we are also integrating to your legacy.
If you are forced by your corporate IT department that you have to use a
cloud provider, AWS or GCP or whatever,
or you are forced to use some other tools
which are already out there, we could integrate them
and bring them into a common framework
where you can start using them right away.
So you don't need to change or rewrite your whole code,
you can still use your old tools,
but you have the possibility to scale and to for example also you did everything locally and
with a flip of a switch you can then deploy it on a cloud and this is very interesting for for users
and customers as they are very often blocked by by existing solutions in their legacy systems
yeah i think one of the weaknesses of ml ops isLOps is it's not something you can buy, right?
You can buy hardware, you can buy tools, but processes like MLOps, you know, you can't
really buy that.
Was that one of the ideas behind ZenML is really to kind of guide users through the whole MLOps process? I mean, it's,
you can read books about it, but it's like anything else. You need some experience,
you need to have made some failures in the past because you learn more from failures than from
successes. Is that kind of where you kind of help the most? It's on the MLOps part?
Or where would you say what you bring to the table with the framework?
Yes, exactly.
So the problems what we were facing was that we were using many tools which were out there.
But the glue code, the artifact tracking, the metadata tracking in between was done manually.
Or we had to do some,
make some glue code. And that was the big challenge for us, because it was super manual,
it was not very, you couldn't automate it quite well, because every tool is, has different outputs,
and what the other input tool would need, or the other tool as input would need.
So we have, this were exactly the challenges
what we had before.
And this is why we were abstracting with ZenML.
We were, I don't wanna promote by the way ZenML
too much right now, because I'm really hoping
to dive into the problems and share our understanding of
how we were thinking and what problems we had. So this is why we thought we would need something
on a higher level, which will take these problems apart and solve them individually.
And you mentioned one problem in the beginning,
the reproducibility. So if you cannot reproduce an experiment, you cannot improve it because
if the data is changing or whatever is changing, hyperparameters, you don't know whether you
performed now better or whether it was just luck. And this control process was something
what we definitely needed for our predictive maintenance models, for example.
And in the future, it will be super interesting for corporates or for bigger companies who are forced by law to do some audits. already from the get-go, we are enabling this auditability that you can go back in time and see how your algorithm
or your model were fed, by which data, by which hyperparameters.
You have maybe a YAML file, which is writing down
all the relevant characteristics of the experiment.
And then you can go back in time and see by when drifted your model away
and who is guilty right now and how can you improve the process in the
future?
Right. I mean, I think there's a, there's a couple of things that is important.
I think the,
the fact that you can repeat and create kind of a baseline to improve.
I think the second one would
i what i hear a lot is is people that have relative success experimenting and prototyping
but fail on the production side right bringing a model from from prototype to production
is is a lot more difficult than what people expect.
When you talk about your framework, does your framework then go full cycle,
experimenting, prototyping to production and then feeding data back?
Is it like a full circle process?
Yes, definitely.
It's not an end-to-end platform because these tend to be very opinionated.
What we name ourselves is a framework, an MLOps framework, but also from data sourcing to
deployment. And so this is exactly where we see ourselves. And yes, we are covering the whole
process. Yeah. It seems like you're bringing a
level of maturity to MLOps that sometimes we don't even see in DevOps. I mean, it does seem
analogous to DevOps, but it seems like what you're talking about really is more mature, I guess is
the right word for it, business processes. Do you see yourselves in some ways,
I don't want to say in competition, but competing for mindshare with the DevOps trend,
with IT ops and application development all focused on that, and that getting a lot of the
press? I would say we learn from them. So what DevOps was 20 years ago or how it developed over the last 20 years is now what will happen to MLOps.
So back then, tools like Terraform or whatever, it was super fragmented and tools were able to bring everything what is needed together quite well and accessible for everyone. And this is what we imagine to do with Xenomil,
to have really this abstraction layer who everyone can
understand, like a data scientist in particular.
And with that, you are also able to dive into the DevOps world.
So we don't see them as competition at all.
We would like to integrate and give the data scientists the possibility now to use Kubernetes,
for example, with their current skill set.
So that's why everything is connected and hopefully will benefit from each other.
Another area I think that I see MLOps being sort of trapped between a rock and a hard place is that in many
ways, data scientists and, you know, people trying to roll out ML models are stuck between operations
and the lines of business at a company. So you have the demands, as you mentioned, for example,
you know, mobility company or utility or, you know, whatever you are, whatever the business
really is, making demands on the machine learning model. And then you also have IT operations trying
to translate that into production. And in many ways, I feel like MLOps gets stuck in the middle
and has to translate between these two people who frankly, really don't understand each other.
Do you experience that? And how does having a mature framework help to alleviate that problem?
Yes, definitely.
We can see that problem.
So what we saw is that data scientists wanted to bring their machine learning models into
production and gave it over to the ops team.
And the ops team was then not able to translate everything what was done from the data scientist
who has a bit domain knowledge and knows the business case a bit better. The ops guys were
not really able to translate that into code every detail. So they had to make it production ready, which is a loss of information
or a loss of quality of the model in the end.
Just a simple fact,
if they need to transfer it from Python to C++,
for example,
you cannot translate every creativity
what the data scientist had
when they were having the business case in mind
into the production scenario.
And this was super, super frustrating
for the data scientist who is basically
would like to own the whole pipeline
because they created it in its core
in the experimenting phase.
So this is how we saw the problems coming up until now
and that they don't understand each other
and don't speak the same language.
And this is why we saw some potential
for another tool out there,
which is trying to bring them to the same level.
Yeah, so with the framework,
you kind of create an abstraction layer
where one person can have a complete overview
of the pipeline without really having
to understand all the different components.
And I presume that it's also a lot easier than to pick pieces of the pipeline and you
can optimize portions of the pipeline as you go.
Was that part of the strategy too?
I guess with the AI market moving so fast that you want to be able to
switch out, you know, Docker or Docker Compose with Kubernetes or whatever comes after, I
presume, right?
Is the framework as, how should I say, as modular as it looks?
Yes, that was very important from the beginning,
from the development, that we are just a bit opinionated
that you have a default.
You can fall back if you don't care how it's going to deploy.
We can decide for you what deployment tool we're
going to use.
But if you want to swap out and get in your infrastructure,
that's completely possible.
And in these terms, why data science is so different is you learn on the go as well.
And you learn if you run through every model, the results will be fed back and you learn again. Again, for a traditional DevOps scenario,
it was like you put in more power,
you make the roles more narrow,
and the gain what you will have is a higher productivity.
The productivity is not so important
for like this output productivity
for machine learning pipelines
because the ultimate goal is to have
a better model and this is something what you only can find out if you are experimenting a little bit
and not just in the experimenting phase but also in the deployment phase in the conversion phase
towards production so this is why you need somebody and we think it's a data scientist, who can step back and interact with every part of the machine learning pipeline.
And just with that overview, they cannot optimize just hyperparameters in the training phase, but they can optimize the orchestra of all the pipeline steps which are which are playing
together so and this is the big difference from traditional product um software engineering where
everything has to be more productive to hey we have a new um a new kind of uh research part, which we now can, we have a better output of better experiments.
So this is just the difference of how we see it.
And this is why we would like to set somebody in the position of having a big overview over
the whole process.
It's an interesting approach. I do like it a lot and
it's also important to notice that you provide this as open source. So can you talk a little
bit about that? You know, why open source? Why not closed loop source and so on?
Sure. So the analogy to DevOps from 20 years ago is also on the business model.
I would say, or we think that what SaaS was 10, 15 years ago will be open source from
now on or in the future.
Open source itself is not a business model.
It's more a mindset or a funnel, let's say.
You get a lot of credibility
because everyone knows your code.
You have more trust because, again,
everyone can see your code
and can tune it a little bit and change it.
But the outreach is way better
than that people trust these models way better than that people are yeah trust these models way better or the the whole
yeah framework let's say sentimental right now and um with that we will for sure have another
business model behind which will sustain us financially but the idea of creating an open source framework is to reach way more people.
And we also know that 99% of our users won't ever pay for our product,
but that's fine.
Other companies have shown it like other open source companies that you have
to change the world and then you can monetize it just on a fraction of of the world and this is
this is exactly what we imagine and we also think that open source is is a really fair and innovative
way of collaborating with the community yeah i think that's that's important i mean if you get
if you can get a lot of feedback from your own users, it kind of fits the AI principle, right?
You use the data from your own users to improve your own products.
So that's kind of a nice analogy.
But do you expect then the development to continue through the community
or do you expect to do the development yourself
with the help of the community?
There are two extremes.
The one is that you develop what is expected from you.
Like you have, for example, if you have big customers,
but you're still developing open source,
you will get driven into one direction,
which you might tailor towards that one customer,
but not for everyone.
On the other side, if you're completely listening to the community,
it might be very noisy.
So many requests will be coming in.
So what we will try to find
or are currently as well doing,
which you can see in our GitHub repository
with our roadmap,
is a mix but a bit more towards the community.
So we don't care about the monetization
and the corporates right now,
but everyone who is in the community
is somehow affiliated anyhow with the corporate
because how many hobby projects have you done
when you were bringing machine learning into production?
So in the end, everyone is also associated with a company
with big data in the background. But that's our path of our business model or outreach.
Yeah, it does take some care when you're developing open source projects like that,
because of course, if you listen to the users, like you said, you may get a vocal minority that wants to take something in one direction, whereas you may see maybe a bigger picture use case that is more focused in this
direction. Of course, you also have to be careful because you don't want to not listen to the open
source users and go in that direction when they really need you to go there. So I think it does
take a very strong leader and strong leadership on the part of the company in order to focus an open source project
and incorporate the lessons of the users without avoiding them. I was curious to ask you,
within the companies that are using ZenML-based solution, who's driving it? Is it IT operations? Is it data science and machine
learning? Or is it the lines of business at this point? That's very interesting. So currently,
many companies are still in the research phase. And the business case behind is not so defined yet.
So they are, well, the majority, some of them are really bringing it in production and earning
money with it.
But what we saw so far is that the main drivers are either the machine learning engineers
or there are sometimes also M ml ops teams out there and um
normally you would you would think that a data uh a devops engineer will be the one who is
helped the most because they now don't have to bring the the machine learning into production
from whatever was thrown over the
fence from the data scientist but now they can relax and just plug in their infrastructure in
the framework and and chill but these guys are super happy but they are not the drivers
so the drivers are really the data scientists or the machine learning engineers who are now motivated to bring their use case as close as possible to the experiment into the production scenario.
So they don't have any loss of information until they really bring it into production.
Because there is no DevOps engineer who is shredding the code just to make it production ready.
And these are the drivers.
And most of the time, the data scientists are also the ones who have the business case in mind,
like the product owners or whoever, because they know the domain they have a phd in physics and
no way better what you can do with it than an engineer who is just bringing it into production
so this is why we are also bringing the data scientists in the center and making fully
the owner of the whole machine learning pipeline and this is why they are super happy with it that because they yeah they get translated into production as best as possible yeah we talked a
little bit about ml ops and that things are changing i mean for you personally where do
you think ml ops should improve moving forward um so what we can see right now, it's a very, very noisy field. Tools are popping up every week,
which is good because the best tools will win, but there won't be one winner. So if you see an
AWS SageMaker or a big player in the game, maybe by a cloud provider, it is not a market the winner takes at all.
But it doesn't matter whether it's fragmenting more
or consolidating more.
We can see both directions already.
That, for example, Feast was bought by Tekton.
It's a feature store.
The one is open source, the one is closed source.
So they are consolidating on the side.
On the other side, tools are popping up,
they are fragmenting more,
but what is always needed is a tool
which is bringing them together to avoid the glue coat,
like a framework, like Sentinel.
And this is why, no matter how this whole landscape
will change, there will be a need for our
tool or similar tools.
We don't claim to be the best, even though we are.
No, but this is the idea behind that the MLOps market is super, super vivid right now. Yeah. And I think it's great to hear the kind of approach that
you've got with looking at the community, looking at open source, not worrying so much about
monetization, hoping that you can build the best tool for the job and then that the job will adopt
the tool down the road. And also, I love the fact that you both come
from a very practical background
of trying to actually develop an ML application,
not just coming to it,
sort of hoping to build a tool for people,
you know, abstract people.
You're kind of building something for yourselves in a way,
which I really, really appreciate.
So, well, thank you so much for this discussion.
It's really been interesting.
But the time has come for us to transition into stage two of the Utilizing AI podcast. As you have been warned, every episode we ask our guests three questions. season. And note to listeners, our guest has not been prepped on these questions ahead of time. So
this is going to be really off the cuff and hopefully a little bit of fun. This season,
we're also changing things up. I'm going to ask a question as is Frederick, but we're also going
to have a question from a special guest. So to start things off, Frederick, do you want to ask
yours first? Sure. So is MLOps a lasting trend or just a step on the way
for ML and DevOps to become normal?
Very nice question.
So MLOps will be needed as DevOps is needed since 20
years.
So it's going to be the underlying necessity
for every machine learning development,
because otherwise it doesn't scale.
Thank you for that.
My question, and again, this is one of those that we've asked a few times,
how big do you see ML models getting?
Today we have a hundred billion parameter model.
Is that going to look small in the future future or have we reached some kind of limit?
Also a nice question.
No, it will explode.
It will keep on exploding.
Like just check out the GPT development
from one, two, two, three.
The four will be a magnitude higher
and just the data what was collected last two years is more
than in history so this is why it will keep on and the everything has to keep up as well as the mlops
well thanks for that i think that that's what we've heard a couple of times here, and not just from the companies making the chips, I might add. Finally, as promised, we're going to have a question from outside the podcast here. We actually are bringing in the editor for Gestalt IT, Zach DeMeyer, with a question.
Hi, Utilizing AI. I'm Zach DeMeier, writer here at Gestalt IT.
And I have a question for you.
What's the most innovative use of AI
you've seen in the real world?
Currently, I think it's autonomous driving.
So the models which are continuously in production
and shadowing themselves in cars and swapped out.
So it's incredible what, for example,
Andrej Karpathy is doing from,
he was doing that in research
and now he's doing it in real life.
So this is, I think, the most impressive use of AI currently
because it's a ton of data,
thousands, 10,000s of cars
are sending high quality videos to the cloud and they are
continuously training four or five models in parallel. And that just needs to scale. And
this is what's impressive and also inspired us for a framework doing similar things, but on a
different scale. I'm glad that you were able to come up with something on the
cuff here on the fly. We look forward also to hear what question you might have for a future guest.
So afterwards, we'll record one if you have one. And if you, the listener, want to join this,
you can. Just send us an email at host at utilizingai.com, and we'll record your question
for a future guest. So Adam, thank you for joining us today.
Where can people connect with you and follow your thoughts on enterprise AI
applications and other topics?
Sure. So we would love to create a big Slack community.
So please join our Slack channel. You can find everything.
Also the invitation link on our GitHub repo. It's zenml-io slash zenml on GitHub. And please find me on LinkedIn, connect with me, and we are super
happy to get in touch and build that framework together. Great. And we'll include that link in
the show notes so folks can just click right on through. How about you, Frederik? What's going on
in your life? Well, I'm still doing consulting and services in the HPC and AI market, but currently I'm
working on a design for a large-scale GPU cluster for a customer, so that's keeping me busy.
You can find me on LinkedIn and on Twitter as Frederick V. Heron. Excellent. And as for me,
you can find me on most social media networks at S Foskett.
I will point out that this week is our Cloud Field Day event.
So if you go to techfieldday.com, you'll be able to see a little bit of me talking to
some of the leading companies that are deploying cloud technologies in the enterprise.
So thank you for listening to the Utilizing AI podcast.
If you enjoyed this discussion, please do subscribe in your favorite podcast application. You can also find us on YouTube. You can review the show on iTunes as well. That does really help. And please do share the show in your favorite MLOps community or with your friends. This podcast is brought to you by gestaltit.com, your home for IT coverage from
across the enterprise. But for show notes and more episodes, you can go to utilizing-ai.com,
or you can find us on Twitter at utilizing underscore AI. Thanks for listening, and we'll
see you next week.