Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 10: Bringing DevOps Principles to MLOps with @GaetCast

Starting point is 00:00:00 Welcome to Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, and other artificial intelligence topics. Each episode brings experts in enterprise infrastructure together to discuss applications of AI in today's data center. Today, we're discussing the application of DevOps principles to machine learning and MLOps. I'm Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. You can find me on Twitter at sfoskett. Now let's meet my co-host, Andy Therai. Hello, I am Andy Therai, founder and principal at thefieldcto.com. You can find me on at Andy Therai. All right. Hi. So I'm Gaetan Castellin, also known as GC. I'm head of marketing at Tecton. I've been at the company now for about a year. At Tecton, we're building a data platform or a

Starting point is 00:00:54 feature store for machine learning. So we're very excited to be talking about that topic with you today. And you can find us at tecton.ai. Okay, so today's conversation is an interesting one. So we had a enterprise IT where we had all of the mature components from mainframe to 3D or the blah and all that messaging service middleware, the whole nine yards. And then we moved from there into more of a cloud-based economy, right? And everything was about infrastructure, everything about code, everything about automation. But now we are kind of in the nasal stages of moving into the database economy. But the problem with that is that there's way too much of data we are getting in and we are not using it enough to get proper insights from that. One issue could be because we don't have a good set of tools. Second issue could be because we don't have a good set of tools.

Starting point is 00:01:46 Second issue could be that we don't know how to engage in data economy. JC, what do you think? What are the issues you're seeing? Yeah, I think it's a great question. So generally speaking, we're seeing kind of a major shift in analytics and software engineering in the sense that over the past decade or so, we've made a lot of investments and progress in software engineering. We have all this DevOps tooling. We can now deploy applications on a daily basis,

Starting point is 00:02:15 release code much faster than in the past. And in parallel to that, there's been a lot of innovation in big data and analytics. It started with like this Hadoop transformation, but there's Snowflake, the cloud data warehouse, there's Spark, new processing engines. So we've really kind of beefed up our ability to manage and analyze vast amounts of data.

Starting point is 00:02:40 The problem that we see is these two stacks have really evolved in different silos. And now we're getting to the point where the volume, the speed of data, just becomes unmanageable with the human in the loop. And that's really where machine learning comes in. Machine learning allows you to generate inside influence on data at a scale where an individual human couldn't keep up. But to really benefit from machine learning, we also need to get into a mode where we can bring machine learning to production. We really need to be able to take that human out of the loop and build new customer-facing

Starting point is 00:03:18 applications, automate new business processes using machine learning. But that also requires a huge change. It requires us to take that analytics stack and kind of converge it with that software engineering stack so that we're able to bring analytics to production. And that's a big change. And I think what's happening now, what we're seeing is all this DevOps tooling that we've built for software engineering, we don't have that for machine learning and for analytics. And all of that tooling needs to really appear on the scene.

Starting point is 00:03:51 Otherwise, it is going to be increasingly difficult for us to really get machine learning to production. And that's where this notion of bringing DevOps to machine learning and machine learning data really comes in. So there's a question, there are a couple of points in there that you talked about, I want to double click. The particular ones, let me ask you the first one, you talked about having a humongous amount of data, right? That's a debatable term because, you know, some people will have megabytes of data and they say, oh my God, I got too much of data. And then a fully automated digital economy, for example, I could throw on names like Uber is one example.

Starting point is 00:04:29 They not only they have a ton of data, even the number of models they get, that they cycle through, people don't realize that if I request, you know, over service, the amount of models they do to predict, to connect me with the driver. Anupam Sharma, MD, The pricing models that they run through that often how it is the number of models itself could be hundreds of models that they'll do so. Anupam Sharma, MD, So talk about those two. What do you mean by large amount of data on the data economy. What, what do you think is large amount for enterprise for them to start thinking about and second how often models get updated and those kinds of things as well?

Starting point is 00:05:07 Yeah, a great question. You know, I think large is obviously a very relative term, but I think the point is that like the volume of data is constantly increasing and increasing pretty fast. And we now have, for example, streaming data where data comes much faster than we used to have it. And so the point where we talk about big data is really, I think, when it becomes very difficult to generate all of the inside, squeeze all the value you can

Starting point is 00:05:38 from your data with human processes, right? And so that's where your automation comes into play. On your question on Uber, yeah, it's fascinating. So Tecton was founded by the original creators of the Uber Michelangelo platform. And five years ago, super difficult for Uber to get even one model to production. And then they decided to invest in this infrastructure for machine loading, like an MNOps platform that took care of both of models and

Starting point is 00:06:10 data. And that allowed them to now deploy tens of thousands of models in production and covering a very broad range of use cases like ETA forecasting, like surge pricing, like fraud detection. And it's kind of been a fascinating transformation. And a lot of that was enabled by Michelangelo because what Michelangelo did was bring those DevOps processes to this process of building machine learning models

Starting point is 00:06:36 and enable data scientists to build new models and get them to production quickly and reliably, just like we do with software today in most organizations. So you talked about one specific element of the streaming data coming in at a speed, the velocity that we have never seen before. Do the digital enterprises have a problem because of too much of a velocity of the streaming data that's coming in or because they have to figure out a way to integrate that with

Starting point is 00:07:09 the existing data and the models? And also because obviously when the streaming data comes in, you do mostly inference with that. But then also you've got to figure out a way to get that into your data store. So your future store and the future and the models will be updated. So that whole process, walk me through that and how generally a good enterprise should do it, good data enterprise should do it.

Starting point is 00:07:33 Yeah, yeah, yeah. And streaming data is a super interesting change. I was at Confluent for a few years before joining Tekton. So very, very familiar with what's going on with Kafka. And I think a lot of the use case was initially for Kafka really about application integration. It's an application team that needs to have better access to data sitting in a database and wants to build an event-driven application. And then eventually it makes its way into big data and analytics.

Starting point is 00:08:06 And I think specifically in this world of analytics, what streaming data means is that you get your data much faster. Like it's no longer just available on a daily basis after the batch job. It is now coming in continuously with maybe a few seconds delay, but it's coming in very fast. And so I think the impact there is that with the speed of data coming in, it's even more difficult for humans to keep up, right? So in the past, we used to have the daily report with like, here's the batch data from yesterday, and like, here's how inventory is evolving, or here's how demand is evolving, and a human can make that decision. Now with streaming data, you have to make these decisions

Starting point is 00:08:49 and adjustments on a continual basis, and it's just not possible for humans to keep up. And so in terms of how enterprises should manage this data, I mean, I think we also see many stages of how refined the data could be. So data will typically come in very raw, maybe into a data lake of some sort, and then gets refined many times and gets refined into, for example, a data warehouse. And that's great for BI. But then you also need to have access to this refining data

Starting point is 00:09:26 for machine learning, especially if you want to automate these decisions and feed analytics into a production application. Now you've got to automate that decision making process, and machine learning ends up being the most efficient way to do that. Just like we have clean data

Starting point is 00:09:41 for BI with a data warehouse, we think that we need that clean data for machine learning. And that's really where the feature store comes in. This is a topic actually that came up when we were talking with Karen Lopez, data check. She was talking about the fact that you need to be careful about what kind of data you've got in your system and basically the whole process of managing data.

Starting point is 00:10:05 And I think that's something maybe that people might overlook if they're thinking about bringing these things in, because I guess in a way people are used to, I think they think that it's more intelligent than it is and it's more autonomous than it is. And yet, you know, the whole process of data management becomes even more important when you've got a computer system in there. Yes, absolutely. the whole process of data management becomes even more important when you've got a computer system in there. Yes, absolutely. And I think a lot of the times we see like data management in and of itself is already complicated enough, right?

Starting point is 00:10:33 Like all these batch pipelines, super difficult to manage, how do you make sure that the data is clean, that there's no data drift, that you do great data validation at ingest. And then all of that becomes even much more complicated once you move these pipelines to production. And so building these production pipelines, which is really what enables us to refine this data for machine learning,

Starting point is 00:10:58 is actually a super complicated process. And, you know, we're talking about how difficult it is to bring machine learning to production. I think a lot of time that's being lost today in production machine learning projects is actually being spent on that process of building production pipelines to serve clean, highly refined data to machine learning models. There's a point you mentioned on there, which is very key. You said the data scientists, you know, have this too much of data and then from what I've seen before, look at the end of the day, data scientists is a new crop of engineers or scientists, however you want to call it,

Starting point is 00:11:42 that they are coming in on board. And given that the data economy is not at all, you don't have that many data engineers. Hence, you know, finding a qualified experience, good ones is very expensive. And you know, companies land up hiring, paying a ton of money for them. And then what I hear commonly is that about 80% of their time is spent in, you know, data management, right? Data cleansing, data management, data wrangling, you know, finding futures, all of the above, right? They spend only 20% of the time in producing the actual models, which is their job, okay? And that's in its own position, number one because they are only 20% efficient right it and then

Starting point is 00:12:28 Another common pattern. I'm seeing which is even more problematic is that even though they create amazing models. The engineering team or the DevOps team or the infrastructure team couldn't figure out how to get the model into production or productionizing a proper model. And then half of the models are thrown away. So which means effectively you're doing less than 10% of the models originally thought out or ideated. Why is that? And how can it be fixed? Yeah. Yeah.

Starting point is 00:12:56 10% is definitely not a great, great metric, great place to be at. You know, I think it goes back to the challenge that we were talking about at the beginning, which is we just don't have the tooling to get machine learning to production today. I mean, it's two things. It's a combination of processes and capabilities, like what does the organization look like? What are people able to do? And then on the other hand, it's like tooling. And really this notion of bringing machine learning to production is a completely new endeavor for most organizations and so they don't have the right processes and all the right tooling in place and it's very complicated because

Starting point is 00:13:35 with a traditional application all we need to get to production is code and code is something that we own we manage we build it so it's got relatively low entropy and and we kind of we can control it pretty easily models and data are a little bit different like models mostly like code mostly stateless and and you know controllable manageable data is very very different data, but you now have to treat data like code because data is going to define your application when it comes to machine learning, in the sense that if you train your model

Starting point is 00:14:14 with a different data set, the application is going to behave differently. And so you need to be able to manage that data just with the same reliability and efficiency as you've been managing code in the past. And it's way more complicated because we're not building data. We source data from a number of places.

Starting point is 00:14:30 Typically, your data is imperfect. It's not always a totally known quantity, like it may drift or change over time. And yet, we have to manage this refined data we feed into our models with the same efficiency as code. I think that's what creates a lot of these challenges is we're not set up to manage that well today. And so if we look at these two new artifacts, models and data for machine learning,

Starting point is 00:14:57 we think there's actually quite a bit of innovation in terms of getting models to production. There's platforms like Kubeflow, SageMaker, and others in that MNOP space, which are really designed to get models to production quickly and reliably. And that space is not perfect today, but there's definitely innovation, and it's getting better. What is still highly problematic is tooling for data. And as long as we don't have tooling for data, it is going to be very difficult to get machine learning to production. And so what's going on there is you have a data scientist

Starting point is 00:15:30 who's not a software engineer, not a data engineer. Like they do exploration and they use notebooks and Python and like code that's mostly optimized for data exploration and experimentation. They figure out the features, they train the models, and then they pass the features to an engineering team

Starting point is 00:15:49 who can re-implement that pipeline for production. But that process of handing off things to a different team for re-implementation is very inefficient and really goes against the principles of DevOps. And so what we think we need is processes and tooling to also manage machine learning data with the same efficiency and bring DevOps to machine learning data. There's definitely a process component where we need to empower the data scientists to control the features all the way from development to production. But there's also a big tooling gap. Like to enable that, we need the right tools underneath.

Starting point is 00:16:27 And that's really where Tecton is investing in bringing DevOps to machine learning data. And in support of that transition, there's like this nascent product category that's called a feature store, which is really what Tecton does. And it seems to be emerging. Like we're seeing a lot of inbound inquiries and a lot of interest around feature stores, and

Starting point is 00:16:49 that to me is tooling that is an essential component of that DevOps stack for machine learning, for getting operational machine learning to production. That notion of feature store. Okay, I get that. I mean, based on that explanation,

Starting point is 00:17:06 almost every digital company or data enterprise company or data economy company should have the problem. But from what I'm seeing, Uber had the problem and they put a ton of money, what, two, three years ago into Michelangelo platform to build that, which came out really great. And then Lyft is doing some of that too. And then GoJack is doing that. So there are only very few companies doing this to solve

Starting point is 00:17:31 this issue. Is that because other digital enterprises don't have the problem or don't have enough data or they don't know that there is a problem, such problem exists? They're doing it the old fashioned way without knowing there is a problem. Yeah, yeah, great question. You know, I think it really depends what companies are at in the operational machine learning journey. So if they're just, you know, doing batch machine learning

Starting point is 00:17:57 or like batch influence, this problem may not be as prominent. If they just have a couple of models in production, those models are not using streaming data may not be as prominent. If they just have a couple of models in production, those models are not using streaming data and don't need very fresh features, maybe this is not as much of an issue. But definitely, as companies begin to use very fresh data, like streaming data sources or real-time data sources,

Starting point is 00:18:19 as they begin to expand into not deploying just one model in production, but tens of models, as they expand their data science teams, this problem becomes very evident over time. But there's also not really been any commercial or open source offerings in the past. And so companies had a choice to either build their own infrastructure, very much like Uber did with Michelangelo. And this is something we have seen a number of companies actually building their own feature stores in-house or they can do things manually

Starting point is 00:18:52 like they've been doing in the past. But then building a feature store in-house is not a small investment. It's a complicated piece of technology. And it's almost like asking, if you're building a BI stack, do you want to be building your own data warehouse? Like, is that the best investment dollar that you can spend?

Starting point is 00:19:09 Whereas if you could buy a great data warehouse off the shelf, wouldn't that be an easier way to get there? I think those are some of the discussions that we're having with customers is like, is it the right thing to be investing on building your own feature store? Or can you just buy one, get up to speed much faster and not have to invest those engineering dollars? And it seems like that's a nas difference between the various products in this category. Do you want to talk to that a little bit? Yeah, I think it's fascinating. You know, it's a feature store really is like a nascent product category. Like two, three years ago, there was no commercial offering available.

Starting point is 00:20:02 Now there's a few companies, Tecton being one of those. There's also some interesting open source projects appearing like Feast, for example, which is an open source project out of Gojek. But what's also obvious is that the definition of a feature store is very different across companies. And I think as an industry,

Starting point is 00:20:23 we kind of need to converge on a common definition. And looking at the Tecton definition, for example, we think a feature store should cover like the build, run, manage spectrum in the sense that it should enable data scientists to build features collaboratively, have standardized feature definitions, and then apply those to a feature store, which then allows you to run your models by automatically processing the feature values, curating those feature values, and serving them for online influence. And then all of that should be manageable with like a manager where you can do things like discover features, track data lineage, monitor data for things like data drift and service levels

Starting point is 00:21:07 for online serving. And so for us, that's like the full spectrum. But then definitely depending on where you're at as an organization. So for example, some organizations already have production pipelines. And so they don't necessarily need a feature store to build new pipelines and new features. They really need a way to store the values, having a single source of truth of features, of data, and then being able to serve those both for online and training. And so those are two different things. So some of the feature stores don't need to cover that build-run-manage spectrum.

Starting point is 00:21:47 Some of them actually focus more on the run aspect. I think Feast would be a good example there. The idea with Feast is that you're coming in with existing pipelines, production pipelines, and Feast is going to focus on curating the data, providing a single source of truth and serving the data. So yeah, so there's definitely many different definitions and classes of feature stores. And I think as the product category matures, we are going to see kind of a more common definition over time. But definitely very interesting to see that dynamic of this new product category taking

Starting point is 00:22:23 shape. I know that it's an existing problem with the digital enterprises. And you know that, but a lot of companies don't know that. So meaning if you have a ton of data, if you figure out certain futures, because obviously future is the basis for model. So when you have an idea, you have a business idea, you have a problem, figure out a problem, find out the future set, and then create a model and try to productionize

Starting point is 00:22:51 that. The critical component is identifying the future. Because let's say if you have like 100 data scientists across the board, whether within your organization or even partner organizations, if they were going to be looking to solve a similar problem, there is no place they can go and take a look at saying that, okay, has this been done before? That's a major problem that some of those feature stores, including yours, solve. If they're not using you, how are they solving the problem today?

Starting point is 00:23:21 Are they even looking at the problem? So I think the question is like, how are they doing feature extraction and future engineering today, right? Yeah, yeah. Yeah. Feature source, feature engineering. Yeah, yeah.

Starting point is 00:23:35 You know, good question. I think there is tooling for like data scientists to do data exploration and they typically use, you know, there's data preparation tools like Alteryx, there's data engineering tools like Spark and Databricks and the notebooks. And I think all that stuff is great for like doing data exploration and experimentation on features. But what those tools don't do is get your features to production.

Starting point is 00:24:06 And so that's where you're left with that gap of like, I do have tooling as a data scientist to do feature engineering, but I'm not empowered to get that data, those features all the way to production. And that's where the process breaks down because that's where you have to throw your features over the wall to the separate data engineering team. And then you're dealing with like months of the days and a lot of like coordination between data scientists and data engineers.

Starting point is 00:24:26 And that's, that's what we think the process needs to become a lot simpler. We need to empower data scientists to not just like engineer features, but really get them to production. Well, that explains why more than 50% of the models fail not to make it into the production, right? Yeah, yeah, yeah, for sure. Like, I don't know what the exact number is, but we do see it a lot.

Starting point is 00:24:48 And like, even for the ones that do make it to production, it oftentimes just takes way too long, right? Like if it takes a year to get a model to production, by the time it's in production, your data scientists have like 10 new ideas of things that they could make better, like that they want to iterate on, but they just can't implement them.

Starting point is 00:25:04 So I think it's frustrating for many teams to be stuck in that situation. So sort of to wrap up the conversation then, I think one of the things that happened when I was talking to you, GC, last time on our briefing is when I came to understand sort of the fundamental analogy of what you're describing to sort of what we already do in enterprise tech.

Starting point is 00:25:31 The fact that a feature is really sort of the AI equivalent of a data point or a file, an image, all these other things that we're already used to storing. The idea that we need a specialized storage platform for features in AI suddenly just, yes, yes, we do. Of course we need that. It made a lot of sense to me. And I think that after this discussion, I think a lot of the listeners may be saying the same thing. Andy, do you want to sum up a little bit here of sort of what is this and how does it affect people and how will it come into the enterprise? Right now, most of the future store getting the features on the model to production is done very manually. It's a very cumbersome process, very manually, very inefficient, as Jisi was saying. It's not uncommon to see some of the models,

Starting point is 00:26:31 even if it's a kick-ass model, if it is produced within a matter of weeks, by the time when the DevOps teams and the infrastructure teams figure out how to get the model into production, how to keep it up to date, it could take months, if not close to a year. By then, probably that's not even a viable model. Your business model has changed. Your business problem has changed. Maybe you solved the problem, or maybe you're not even in business given the current economy, right? So not only creating a model faster should work faster, getting insights from the data, but also getting into production should be the fastest, most efficient. And that's where companies like TechTown would help. All right. Well, thank you so much. Yeah. So, I mean,

Starting point is 00:27:20 I think your observation was very accurate. Like we have, you know, but I think you compare it to storage. I think I would compare it more to like a mix between a database and a data warehouse because ultimately what we do is we curate data, we provide highly refined data, but we also serve that data online for models, right? So it's like this highly refined analytical data that gets served online at very low latency. And so from that standpoint, it's kind of like a hybrid, I'd say, between a database and a data warehouse. It's highly refined analytical data that gets served online at very low latency. And so from that standpoint, it's kind of like a hybrid, I'd say, between a database and a data warehouse. But yeah, there's no question. I mean, I think you're going to see a lot of investment in the coming few years on platforms for operational machine learning,

Starting point is 00:28:00 MLOps platforms, data platforms for ML, because there's such a big gap today, it has to be the future. You know, we've talked about software is eating the world, and indeed it has had a huge impact over the past decade. We also talk about data is the new oil, which is also very accurate, but somehow these two things need to come together, right? Like analytics and the world of software and production software need to come together. And the path to get there is via machine learning and getting machine learning to production.

Starting point is 00:28:32 And the path to get there is really by having better tooling and better infrastructure and investment in these MNOps or DevOps platforms for machine learning. So, you know, strong fan of what's happening there. Very excited. Would be super interesting to see how this space evolves. Well, thank you very much for joining us today. GC, where can people connect with you

Starting point is 00:28:53 and follow your thoughts on enterprise AI and other topics? Yeah, for sure. So, tecton.ai, in our blog there is a great place to go. On Twitter, at tectonownAI. Those are the two best places to follow us. Great. And Andy? You can find me on Twitter at Andy Thurai or you can find on me on my website. That's thefieldcto.com. Again, that's thefieldcto.com. Thanks a lot. And you can find me on Twitter at S Foskett, and you'll find my writing at gestaltit.com. Thank you for listening to the Utilizing AI podcast. If you enjoyed the discussion,

Starting point is 00:29:37 please remember to subscribe, rate, and review the show in iTunes, since that really does help our visibility. And please share this show with your friends. This podcast was brought to you by gestaltit.com, your home for IT coverage across the enterprise, and thefieldcto.com. For show notes and more episodes, go to utilizing-ai.com or find us on Twitter at utilizing underscore AI. Thanks, and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 10: Bringing DevOps Principles to MLOps with @GaetCast

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.