Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 10: Bringing DevOps Principles to MLOps with @GaetCast
Episode Date: October 27, 2020Stephen Foskett and Andy Thurai discuss the parallels between DevOps and MLOps with Gaetan Castelein of Tecton. We are in the middle of a shift in analytics and software engineering, with DevOps and c...ontinuous deployment, and this is colliding with the development of data analytics and big data. Machine Learning allows organizations to handle this explosion of data and build new applications and automate new business processes, but MLOps must be converged with big data and DevOps tooling to make this a reality. One key enabler of this transformation is the creation of an ML feature store, which stores curated features for machine learning pipelines. Feature stores typically enable users to build features, have standardized feature definitions, run models using these curated features, and manage MLOps. This episode features: Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day. Find Stephen's writing at GestaltIT.com and on Twitter at @SFoskett Andy Thurai, technology influencer and thought leader. Find Andy's content at theFieldCTO.com and on Twitter at @AndyThurai Gaetan Castelein, VP of Marketing at Tecton (@TectonAI). Find Gaetan on Twitter at @GaetCast Date: 10/27/2020 Tags: @SFoskett, @AndyThurai, @GaetCast, @TectonAI
Transcript
Discussion (0)
Welcome to Utilizing AI, the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics. Each episode brings experts in enterprise
infrastructure together to discuss applications of AI in today's data center. Today, we're
discussing the application of DevOps principles to machine learning and MLOps. I'm Stephen Foskett,
organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter at
sfoskett. Now let's meet my co-host, Andy Therai. Hello, I am Andy Therai, founder and principal at
thefieldcto.com. You can find me on at Andy Therai. All right. Hi. So I'm Gaetan Castellin, also known as GC. I'm head of marketing at Tecton.
I've been at the company now for about a year. At Tecton, we're building a data platform or a
feature store for machine learning. So we're very excited to be talking about that topic with you
today. And you can find us at tecton.ai. Okay, so today's conversation is an interesting one. So we had a enterprise IT
where we had all of the mature components from mainframe to 3D or the blah and all that messaging
service middleware, the whole nine yards. And then we moved from there into more of a cloud-based
economy, right? And everything was about infrastructure, everything about code, everything about automation. But now we are kind of in the nasal stages of moving into the
database economy. But the problem with that is that there's way too much of data we are getting
in and we are not using it enough to get proper insights from that. One issue could be because
we don't have a good set of tools. Second issue could be because we don't have a good set of tools.
Second issue could be that we don't know how to engage in data economy. JC, what do you think?
What are the issues you're seeing? Yeah, I think it's a great question. So
generally speaking, we're seeing kind of a major shift in analytics and software engineering in
the sense that over the past decade or so,
we've made a lot of investments and progress
in software engineering.
We have all this DevOps tooling.
We can now deploy applications on a daily basis,
release code much faster than in the past.
And in parallel to that, there's been a lot of innovation
in big data and analytics.
It started with like this Hadoop transformation,
but there's Snowflake, the cloud data warehouse,
there's Spark, new processing engines.
So we've really kind of beefed up our ability
to manage and analyze vast amounts of data.
The problem that we see is these two stacks
have really evolved in different silos.
And now we're getting to the point where the volume, the speed of data, just becomes unmanageable with the human in the loop.
And that's really where machine learning comes in.
Machine learning allows you to generate inside influence on data at a scale where an individual human couldn't keep up.
But to really benefit from machine learning, we also need to get into a mode where we can
bring machine learning to production.
We really need to be able to take that human out of the loop and build new customer-facing
applications, automate new business processes using machine learning.
But that also requires a huge change. It requires us to take that analytics stack and kind of converge it with that
software engineering stack so that we're able to bring analytics to production.
And that's a big change.
And I think what's happening now,
what we're seeing is all this DevOps tooling that we've built for software
engineering, we don't have that for
machine learning and for analytics. And all of that tooling needs to really appear on the scene.
Otherwise, it is going to be increasingly difficult for us to really get machine learning
to production. And that's where this notion of bringing DevOps to machine learning and machine
learning data really comes in. So there's a question, there are a couple of points in there that you talked about, I want to double click.
The particular ones, let me ask you the first one, you talked about having a humongous amount of data, right?
That's a debatable term because, you know, some people will have megabytes of data and they say,
oh my God, I got too much of data. And then a fully automated digital economy,
for example, I could throw on names like Uber
is one example.
They not only they have a ton of data,
even the number of models they get,
that they cycle through, people don't realize that
if I request, you know, over service,
the amount of models they do to predict,
to connect me with the driver.
Anupam Sharma, MD, The pricing models that they run through that often how it is the number of models itself could be hundreds of models that they'll do so.
Anupam Sharma, MD, So talk about those two. What do you mean by large amount of data on the data economy. What, what do you think is large amount for enterprise for them to start thinking about and second how often models get updated and those kinds of things as well?
Yeah, a great question.
You know, I think large is obviously a very relative term,
but I think the point is that like the volume of data
is constantly increasing and increasing pretty fast.
And we now have, for example, streaming data
where data comes much faster
than we used to have it. And so the point where we talk about big data is really, I think, when
it becomes very difficult to generate all of the inside, squeeze all the value you can
from your data with human processes, right? And so that's where your automation comes into play.
On your question on Uber, yeah, it's fascinating.
So Tecton was founded by the original creators
of the Uber Michelangelo platform.
And five years ago, super difficult for Uber
to get even one model to production.
And then they decided to invest in this infrastructure for machine
loading, like an MNOps platform that took care of both of models and
data.
And that allowed them to now deploy tens of thousands of models in
production and covering a very broad range of use cases like ETA
forecasting, like surge pricing, like fraud detection.
And it's kind of been a fascinating transformation.
And a lot of that was enabled by Michelangelo
because what Michelangelo did was bring those DevOps processes
to this process of building machine learning models
and enable data scientists to build new models
and get them to production quickly and reliably,
just like we do with software
today in most organizations.
So you talked about one specific element of the streaming data coming in at a speed, the
velocity that we have never seen before.
Do the digital enterprises have a problem because of too much of a velocity of the streaming
data that's coming in or because they have to figure out a way to integrate that with
the existing data and the models?
And also because obviously when the streaming data comes in, you do mostly inference with
that.
But then also you've got to figure out a way to get that into your data store.
So your future store and the future and the models will be updated.
So that whole process, walk me through that
and how generally a good enterprise should do it,
good data enterprise should do it.
Yeah, yeah, yeah.
And streaming data is a super interesting change.
I was at Confluent for a few years before joining Tekton.
So very, very familiar with what's going on with Kafka.
And I think a lot of the use case was initially for Kafka really about application integration.
It's an application team that needs to have better access to data sitting in a database
and wants to build an event-driven application.
And then eventually it makes its way into big data and analytics.
And I think specifically in this world of analytics, what streaming data means is that you get your data much faster.
Like it's no longer just available on a daily basis after the batch job.
It is now coming in continuously with maybe a few seconds delay, but it's coming in very fast.
And so I think the impact there is that with the speed of data coming in, it's even more
difficult for humans to keep up, right? So in the past, we used to have the daily report with like,
here's the batch data from yesterday, and like, here's how inventory is evolving, or here's how
demand is evolving, and a human can make that decision.
Now with streaming data, you have to make these decisions
and adjustments on a continual basis,
and it's just not possible for humans to keep up.
And so in terms of how enterprises should manage this data,
I mean, I think we also see many stages of how refined the data could be.
So data will typically come in very raw, maybe into a data lake of some sort,
and then gets refined many times and gets refined into, for example, a data warehouse.
And that's great for BI.
But then you also need to have access to this refining data
for machine learning, especially if you
want to automate these decisions
and feed analytics into a production
application. Now you've got to automate that decision
making process, and machine
learning ends up being the most efficient
way to do that.
Just like we have clean data
for BI with a data warehouse,
we think that we need that clean data for machine learning.
And that's really where the feature store comes in.
This is a topic actually that came up when we were talking with Karen Lopez,
data check.
She was talking about the fact that you need to be careful about what kind of
data you've got in your system and basically the whole process of managing
data.
And I think that's something maybe that people might overlook if they're thinking about bringing
these things in, because I guess in a way people are used to, I think they think that
it's more intelligent than it is and it's more autonomous than it is.
And yet, you know, the whole process of data management becomes even more important when
you've got a computer system in there.
Yes, absolutely. the whole process of data management becomes even more important when you've got a computer system in there.
Yes, absolutely.
And I think a lot of the times we see like data management in and of itself is already complicated enough, right?
Like all these batch pipelines, super difficult to manage,
how do you make sure that the data is clean, that there's no data drift,
that you do great data validation at ingest.
And then all of that becomes even much more complicated
once you move these pipelines to production.
And so building these production pipelines,
which is really what enables us to refine this data
for machine learning,
is actually a super complicated process.
And, you know, we're talking about how difficult it is to bring
machine learning to production. I think a lot of time that's being lost today in production
machine learning projects is actually being spent on that process of building production pipelines
to serve clean, highly refined data to machine learning models.
There's a point you mentioned on there, which is very key. You said the data scientists, you know, have this too much
of data and then from what I've seen before, look at the end of the day, data
scientists is a new crop of engineers or scientists, however you want to call it,
that they are coming in on board.
And given that the data economy is not at all, you don't have that many data engineers. Hence,
you know, finding a qualified experience, good ones is very expensive. And you know,
companies land up hiring, paying a ton of money for them. And then what I hear commonly is that
about 80% of their time is spent in, you know, data management,
right? Data cleansing, data management, data wrangling, you know, finding futures, all of
the above, right? They spend only 20% of the time in producing the actual models, which is their job,
okay? And that's in its own position, number one because they are only 20% efficient right it and then
Another common pattern. I'm seeing which is even more problematic is that even though they create amazing models.
The engineering team or the DevOps team or the infrastructure team couldn't figure out how to get the model into production or productionizing a proper model. And then half of the models are thrown away.
So which means effectively you're doing less than 10% of the models
originally thought out or ideated.
Why is that?
And how can it be fixed?
Yeah.
Yeah.
10% is definitely not a great, great metric, great place to be at.
You know, I think it goes back to the challenge that we were talking about at the
beginning, which is we just don't have the tooling to get machine learning to production today. I
mean, it's two things. It's a combination of processes and capabilities, like what does the
organization look like? What are people able to do? And then on the other hand, it's like tooling.
And really this notion of bringing machine learning
to production is a completely new endeavor for most organizations and so they don't have
the right processes and all the right tooling in place and it's very complicated because
with a traditional application all we need to get to production is code and code is something that we own we manage we build it so it's got relatively
low entropy and and we kind of we can control it pretty easily models and data are a little bit
different like models mostly like code mostly stateless and and you know controllable manageable
data is very very different data,
but you now have to treat data like code
because data is going to define your application
when it comes to machine learning,
in the sense that if you train your model
with a different data set,
the application is going to behave differently.
And so you need to be able to manage that data
just with the same reliability and efficiency
as you've been managing code in the past.
And it's way more complicated
because we're not building data.
We source data from a number of places.
Typically, your data is imperfect.
It's not always a totally known quantity,
like it may drift or change over time.
And yet, we have to manage this refined data
we feed into our models
with the same efficiency as code.
I think that's what creates a lot of these challenges is we're not set up to manage that well today.
And so if we look at these two new artifacts, models and data for machine learning,
we think there's actually quite a bit of innovation in terms of getting models to production.
There's platforms like Kubeflow, SageMaker, and others in that MNOP space, which are really designed to get models to production
quickly and reliably. And that space is not perfect today, but there's definitely innovation,
and it's getting better. What is still highly problematic is tooling for data. And as long as
we don't have tooling for data, it is going to be very difficult
to get machine learning to production.
And so what's going on there is
you have a data scientist
who's not a software engineer,
not a data engineer.
Like they do exploration
and they use notebooks and Python
and like code that's mostly optimized
for data exploration and experimentation.
They figure out the features, they train the models,
and then they pass the features to an engineering team
who can re-implement that pipeline for production.
But that process of handing off things to a different team
for re-implementation is very inefficient
and really goes against the principles of DevOps.
And so what we think we need is processes and tooling to also manage machine learning data with the same efficiency and bring DevOps to machine learning data.
There's definitely a process component where we need to empower the data scientists to control the features all the way from development to production.
But there's also a big tooling gap.
Like to enable that, we need the right tools underneath.
And that's really where Tecton is investing in bringing DevOps to machine
learning data.
And in support of that transition, there's like this nascent product
category that's called a feature store, which is really what Tecton does.
And it seems to be emerging.
Like we're seeing a lot of
inbound inquiries and a lot of interest around
feature stores, and
that to me is tooling
that is an essential component
of that DevOps stack for
machine learning, for getting operational
machine learning to production.
That notion of feature
store.
Okay, I get that. I mean, based on that explanation,
almost every digital company or data enterprise company
or data economy company should have the problem.
But from what I'm seeing, Uber had the problem
and they put a ton of money, what, two, three years ago
into Michelangelo platform to build that,
which came out really great.
And then Lyft is doing some of
that too. And then GoJack is doing that. So there are only very few companies doing this to solve
this issue. Is that because other digital enterprises don't have the problem or don't
have enough data or they don't know that there is a problem, such problem exists? They're doing it
the old fashioned way without knowing there is a problem.
Yeah, yeah, great question.
You know, I think it really depends
what companies are at in the operational
machine learning journey.
So if they're just, you know, doing batch machine learning
or like batch influence,
this problem may not be as prominent.
If they just have a couple of models in production,
those models are not using streaming data may not be as prominent. If they just have a couple of models in production,
those models are not using streaming data and don't need very fresh features,
maybe this is not as much of an issue.
But definitely, as companies begin to use very fresh data,
like streaming data sources or real-time data sources,
as they begin to expand into not deploying
just one model in production, but tens of models,
as they expand their data science teams, this problem becomes very evident over time.
But there's also not really been any commercial or open source offerings in the past.
And so companies had a choice to either build their own infrastructure, very much like Uber did with Michelangelo.
And this is something we have seen a number of companies actually building
their own feature stores in-house
or they can do things manually
like they've been doing in the past.
But then building a feature store in-house
is not a small investment.
It's a complicated piece of technology.
And it's almost like asking,
if you're building a BI stack,
do you want to be building your own data warehouse?
Like, is that the best investment dollar that you can spend?
Whereas if you could buy a great data warehouse off the shelf, wouldn't that be an easier way to get there?
I think those are some of the discussions that we're having with customers is like, is it the right thing to be investing on building your own feature store?
Or can you just buy one, get up to speed much faster and not have to invest those engineering dollars?
And it seems like that's a nas difference between the various products in this category.
Do you want to talk to that a little bit?
Yeah, I think it's fascinating.
You know, it's a feature store really is like a nascent product category.
Like two, three years ago, there was no commercial offering available.
Now there's a few companies,
Tecton being one of those.
There's also some interesting open source projects appearing like Feast, for example,
which is an open source project out of Gojek.
But what's also obvious
is that the definition of a feature store
is very different across companies.
And I think as an industry,
we kind of need to converge on a common definition.
And looking at the Tecton definition, for example, we think a feature store should cover
like the build, run, manage spectrum in the sense that it should enable data scientists
to build features collaboratively, have standardized feature definitions, and then apply those
to a feature store, which then allows you to run your models by automatically processing the feature values, curating those
feature values, and serving them for online influence.
And then all of that should be manageable with like a manager where you can do things
like discover features, track data lineage, monitor data for things like data drift and service levels
for online serving.
And so for us, that's like the full spectrum.
But then definitely depending on where you're at as an organization.
So for example, some organizations already have production pipelines.
And so they don't necessarily need a feature store to build new pipelines and new features. They really need a way to store the values, having a single source of truth of features, of data,
and then being able to serve those both for online and training.
And so those are two different things.
So some of the feature stores don't need to cover that build-run-manage spectrum.
Some of them actually focus more on the run aspect.
I think Feast would be a good example there.
The idea with Feast is that you're coming in with existing pipelines, production pipelines,
and Feast is going to focus on curating the data, providing a single source of truth and serving the data.
So yeah, so there's definitely many different definitions and classes of feature stores.
And I think as the product category matures, we are going to see kind of a more common
definition over time.
But definitely very interesting to see that dynamic of this new product category taking
shape.
I know that it's an existing problem with the digital enterprises.
And you know that, but a lot of companies don't know that.
So meaning if you have a ton of data,
if you figure out certain futures,
because obviously future is the basis for model.
So when you have an idea, you have a business idea,
you have a problem, figure out a problem, find out the future set, and then create a model and try to productionize
that. The critical component is identifying the future. Because let's say if you have like 100
data scientists across the board, whether within your organization or even partner organizations,
if they were going to be looking to solve a similar problem,
there is no place they can go and take a look at saying that,
okay, has this been done before?
That's a major problem that some of those feature stores,
including yours, solve.
If they're not using you, how are they solving the problem today?
Are they even looking at the problem?
So I think the question is like,
how are they doing feature extraction
and future engineering today, right?
Yeah, yeah.
Yeah.
Feature source, feature engineering.
Yeah, yeah.
You know, good question.
I think there is tooling for like data scientists
to do data exploration and they typically use,
you know, there's data preparation tools like
Alteryx, there's data engineering tools like Spark and Databricks and the notebooks.
And I think all that stuff is great for like doing data exploration and experimentation
on features.
But what those tools don't do is get your features to production.
And so that's where you're left with that gap of like,
I do have tooling as a data scientist to do feature engineering,
but I'm not empowered to get that data,
those features all the way to production.
And that's where the process breaks down because that's where you have to throw your features over the wall
to the separate data engineering team.
And then you're dealing with like months of the days
and a lot of like coordination between data scientists and data engineers.
And that's, that's what we think the process needs to become a lot simpler.
We need to empower data scientists to not just like engineer features, but really get
them to production.
Well, that explains why more than 50% of the models fail not to make it into the production,
right?
Yeah, yeah, yeah, for sure.
Like, I don't know what the exact number is,
but we do see it a lot.
And like, even for the ones that do make it to production,
it oftentimes just takes way too long, right?
Like if it takes a year to get a model to production,
by the time it's in production,
your data scientists have like 10 new ideas
of things that they could make better,
like that they want to iterate on,
but they just can't implement them.
So I think it's frustrating for many teams
to be stuck in that situation.
So sort of to wrap up the conversation then,
I think one of the things that happened
when I was talking to you, GC, last time on our briefing
is when I came to understand
sort of the fundamental analogy of what you're
describing to sort of what we already do in enterprise tech.
The fact that a feature is really sort of the AI equivalent of a data point or a file,
an image, all these other things that we're already used to storing. The idea that we
need a specialized storage platform for features in AI suddenly just, yes, yes, we do. Of course
we need that. It made a lot of sense to me. And I think that after this discussion, I think a lot
of the listeners may be saying the same thing. Andy, do you want to sum up a little bit here of sort of what is this and how
does it affect people and how will it come into the enterprise? Right now, most of the future store
getting the features on the model to production is done very manually. It's a very cumbersome process, very manually,
very inefficient, as Jisi was saying. It's not uncommon to see some of the models,
even if it's a kick-ass model, if it is produced within a matter of weeks, by the time when the
DevOps teams and the infrastructure teams figure out how to get the model into production, how to keep it up to
date, it could take months, if not close to a year. By then, probably that's not even a viable
model. Your business model has changed. Your business problem has changed. Maybe you solved
the problem, or maybe you're not even in business given the current economy, right? So not only
creating a model faster should work faster, getting insights from
the data, but also getting into production should be the fastest, most efficient. And that's where
companies like TechTown would help. All right. Well, thank you so much. Yeah. So, I mean,
I think your observation was very accurate. Like we have, you know, but I think you compare it to storage.
I think I would compare it more to like a mix between a database and a data warehouse
because ultimately what we do is we curate data, we provide highly refined data,
but we also serve that data online for models, right?
So it's like this highly refined analytical data that gets served online at very low latency.
And so from that standpoint, it's kind of like a hybrid, I'd say, between a database and a data warehouse. It's highly refined analytical data that gets served online at very low latency.
And so from that standpoint, it's kind of like a hybrid, I'd say, between a database and a data warehouse.
But yeah, there's no question. I mean, I think you're going to see a lot of investment in the coming few years on platforms for operational machine learning,
MLOps platforms, data platforms for ML, because there's such a big gap today, it has to be the future.
You know, we've talked about software is eating the world,
and indeed it has had a huge impact over the past decade.
We also talk about data is the new oil, which is also very accurate,
but somehow these two things need to come together, right?
Like analytics and the world of software and production software need to come together.
And the path to get there is via machine learning
and getting machine learning to production.
And the path to get there is really by having better tooling
and better infrastructure and investment
in these MNOps or DevOps platforms for machine learning.
So, you know, strong fan of what's happening there.
Very excited.
Would be super interesting to see how this space evolves.
Well, thank you very much for joining us today.
GC, where can people connect with you
and follow your thoughts on enterprise AI and other topics?
Yeah, for sure.
So, tecton.ai, in our blog there is a great place to go.
On Twitter, at tectonownAI. Those are the two best
places to follow us. Great. And Andy? You can find me on Twitter at Andy Thurai or you can find
on me on my website. That's thefieldcto.com. Again, that's thefieldcto.com.
Thanks a lot. And you can find me on Twitter at S Foskett, and you'll find my writing at
gestaltit.com. Thank you for listening to the Utilizing AI podcast. If you enjoyed the discussion,
please remember to subscribe, rate, and review the show in iTunes, since that really does help
our visibility. And please share this show with your friends. This podcast was brought to you by gestaltit.com,
your home for IT coverage across the enterprise,
and thefieldcto.com.
For show notes and more episodes,
go to utilizing-ai.com
or find us on Twitter at utilizing underscore AI.
Thanks, and we'll see you next time.