The Data Stack Show - Data Council Week (Ep 5) - The Difference Between Data Platforms and ML Platforms with Michael Del Balso of Tecton

Episode Date: April 26, 2023

Highlights from this week’s conversation include:Michael’s journey to co-founding Tecton (0:22)The evolution of MLops and platform teams (3:50)Understanding boundaries between the data platform an...d the MLops (8:42)Differences in machine learning vs data pipelines (16:58)The systems needed to handle all these types of data (22:22)Developer experience in Tecton (25:15)Automating challenges in ML development (32:30)The most difficult part of the life cycle of prediction (37:24)Exciting new developments at Tecton (39:27)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. All right, we're back in person at Data Council Austin, and we have got Mike DelBasso. He's the co-founder and CEO of Tekton, and super excited to chat with him. I'm Brooks. Again, if you've been following along, you've heard. Eric did not make it to the
Starting point is 00:00:38 conference, so I'm filling in. You're stuck with me, but have Costas here as well. He's extremely excited to chat with Mike. So, Mike, welcome to the show. Thanks for having me. Yeah. First, could you just tell us about your background and kind of what led to you starting Tech Talk? For sure. So, yeah, I've been in machine learning for a while now.
Starting point is 00:00:59 I actually got into it kind of randomly. I worked at Google, and I began working on the machine learning teams that power the ads auction. So I was a product manager for probably the, I would say at that time, the best productionized ML system probably in the world that drove all of the ads revenue there. And that was really cool. And did that for a bit and then joined from that team. Actually, that team is a team that publishes pretty famous foundational MLOps paper called Machine Learning, the High Interest Credit Card of Technical Debt or something like that. I always get the words in the wrong order there, but it's like a pretty like often cited ML, ML ops paper. But then I joined Uber and I joined Uber at a time
Starting point is 00:01:52 when there were, there was not really a lot of ML happening in Uber. And I was kind of tasked with like helping start the ML team and like bring things from zero to one. And so we went from, so we created, we actually created like the infrastructure, the ML infrastructure, Uber platform called Michelangelo. And over from, you know, I joined in 2015. So over a couple of years, two and a half years, let's say we went from, we just had like a really good run at it. We built some really good platforms. We went to, I went from like, you know, a handful of models in production to literally tens of thousands of models in production making real-time predictions.
Starting point is 00:02:32 It's like a huge scale, right? Millions of predictions a second. That powers all the kinds of ETA and fraud detection and pricing stuff that happens at Uber. And in going through that and building that stack out we came up with something called feature store and you know i published a blog post about michelangelo a while back and a lot of people said hey you know we're solving a lot of these we're trying to solve a lot of these problems that that you guys are solving and michelangelo and one of the most interesting parts of it and the hardest part for us is all of the data side.
Starting point is 00:03:06 The data pipelines that generate the features, the power models, all of the real-time serving, all of the monitoring for the data, the storage, the online and offline consistency, stuff like that. And so we recognize there's a need to start. There's like an industry need for this architecture. And that became the beginning of what we call a feature platform today. And so myself and Engelid, Michael Angel, we left to start a Tecton. And we've been doing that for
Starting point is 00:03:37 the past couple of years. And so at Tecton, we think of ourselves as, know we're kind of like the company for feature platforms we we see ourselves as that enterprise feature platform company we sell to fortune 500s and like top tech companies folks that aren't necessarily like like a google or like a facebook who are for sure going to build it themselves kind of thing but everyone else who's trying to do like real machine learning and production, we hope we can help them out and provide them the best feature engineering experience in the world. Cool. Before we hit record here, you were talking about platform teams
Starting point is 00:04:15 and kind of how companies handle build versus buy. Do you want to just speak a little more of that? Yeah, it's been interesting seeing the the evolution over the past couple years because when i was kind of when we started michelangelo in 2015 there just wasn't a lot of good ml infrastructure ml tooling the ml ops was i don't even think it was a term at that time really and honestly like the concept of buying just never came up to us it just we were never like oh maybe there's a product we just assume that there's no product to do what we wanted to do and so over time you know the industry has grown up effectively and it's become
Starting point is 00:04:59 it's been the offerings on the market have become more compelling but also at the same time as the kind of like vendor solutions have gotten have in parallel as the vendor solutions have become somewhat compelling you know ml platform teams have grown internally in companies it's become you know the need for machine learning has grown. And so the willingness of a company to invest in machine learning and a machine learning platform has grown. And we see it as often it's a parallel thing to a data platform team, a data team, or an ML platform team. You know, you've probably seen this in a bunch of companies where it's kind of like they're a consumer, they're a customer of the data platform team. So the data platform will manage the data warehouse or the data lake, stuff like that. And then the ML platform team will be like a specialty for like a specialization of that and manage a lot of the ML infrastructure on top of the core data platform. And, and so, you know, the industry, these ML platform teams have been
Starting point is 00:06:07 in a lot of companies have grown quite a bit. They've been building all kinds of cool stuff, managed training, managed model serving, like drift detection, feature pipeline management, stuff like that. And, and I think think like recently especially with like the economic situation but even before that a little bit you know like these teams were ballooning and sometimes you're getting to like 10 20 30 people on a ml platform team that's expensive and and so we've begun seeing a lot of hey why like now that that there's all these solutions we can buy, is it really strategic for us to build our own
Starting point is 00:06:50 training infrastructure? Should we just buy that? Well, it was strategic before when it didn't exist and you needed to have machine learning in production. And especially on a lot of the data side of things, the place where we play, same kind of thing. And so we're seeing these ML platform teams kind of have this like interesting kind of
Starting point is 00:07:10 like identity crisis today where they have to think about, okay, well, I thought I was like building all this cool, like invent, I came here to like invent ML infrastructure. And now it's their role is, is a lot more tied to the use cases. It's a lot more tied to like what is the like why am i actually here like like my team is building a recommender or like someone in my company needs to make recommendations ultimately or needs to detect fraud and so now they're a lot more of a less of like a carte blanche just build whatever cool tools you want and a lot closer to or a lot more being driven by what is the actual need from this business use case and how do i map how do i help
Starting point is 00:07:51 that end use case team business team map that back to like tools or whatever the right stack is and they have to be kind of like the stewards of the right stack they have to bring that and i think it's we're seeing it be kind of like a difference in identity and charter and stuff like that for ML platform teams over the past few years. And obviously people are at different parts in that journey, but I think that's a general trend we're seeing as well. Yeah, and it seems like it's kind of just generally like, you know, ML, but across the board with data teams, the business impact is now kind of number one i would say you know machine learning stuff i feel like it's been particularly like this sexy thing where someone could be like i can just go invent this new kind of completely greenfield new stack there's no best practices or established right way to do things so i don't think this is always like a
Starting point is 00:08:44 conscious thought for people working on this stuff, but it was, you know, you can see it in people's attitudes. Like I just, it's a cool place where I can just invent cool tech. And sometimes that was a little bit divorced from what the, you know, what the actual business need is. And, and more so than I saw, you know, for example, than it just like the normal data stack, for example, than it's just like the normal data stack, for example. So we've been talking about ML ops and platforms. And I'd like to begin the conversation by helping me understand the boundaries between the data platform and the ML ops or ML platform, right?
Starting point is 00:09:18 Yeah. Where each one starts, where it stops. And most importantly, where are the synergies? Because there are synergies right like i can't imagine that you have like an ml platform somewhere without also having some sort of like pre-existing at least data platform right right well those are the trickiest situations we love it when people have like great underlying data platforms. And unfortunately, that's not always the case. And so I think this is a little bit of a strategy question for MLOps vendors. If you look at MLOps vendors, they'll typically have a bunch of capabilities in their system,
Starting point is 00:09:58 which kind of are there as optional to fill in for gaps that you might have in your data stack. Because if you were missing whatever say you're missing a data observability capability it's really important for your machine learning model but the rest of the company doesn't really matter that much probably realistic that the ml ops vendor is going to add that in one way or another and and they'll say hey you don't have to use this but like it's part of it is here in case you need it which you might for your really important business use case. But anyway, sorry, I interrupted you.
Starting point is 00:10:30 Or you, that was the question. Like, what's like, what's the boundary? Yeah. What's the boundary. And also you said something very interesting. You talk about like observability, for example, and you might say, okay, like, let's say in a BI environment with, let's say, the traditional data stack, maybe people don't care that much about, like, using an observability platform, right?
Starting point is 00:10:51 But if you do ML, like, probably you need it more, right? Tell us a little bit more about that. Like, I think that it's, for many people, it's hard, like, to understand what are the differences. For sure. And I think the most important thing at the end is what are, let's say, the synergies between the two platforms. Well, okay. So I think there's two dimensions of difference also. One dimension is this data or is this a normal data BI type of thing that I'm trying to do? Or is this a machine learning kind of type of thing thing that i'm trying to do or it's like a machine learning kind of thing machine learning has there's some special requirements but a second
Starting point is 00:11:30 dimension is that is often correlated with that first dimension as is this kind of like this analytical thing i'm doing where it's kind of offline it's an internal use case and so let's so let's take imagine it's's an analytical machine learning thing. This is a big distinction we make. You know, it could be like, Hey, my finance team has to predict, forecast how many sales are going to happen next quarter. You run a job, maybe it's doing some machine learning stuff, but if it fails, it's not a big deal. Just press retry and you're good. Right. Whereas it could also be an operational thing it could be a thing that powers your product your end user experience and so that's a pretty different pretty different set
Starting point is 00:12:12 of engineering requirements right you might have a lot of users and so you have to be ready to serve at a crazy level of scale right or you may have it may be like a say it's a fraud detection situation where you have to make a decision really fast. And so, you know, someone swipes a credit card and you have to say in like 100 milliseconds, are we or 30 milliseconds? Is this acceptable or not acceptable? Or you have some like uptime availability requirements where, you know, some downstream consumer, it's something really bad is going to happen if you're not available at this kind of availability. So that kind of production and not production differentiation, I think is actually a bigger
Starting point is 00:12:58 driver of some of the differences you see in an ML stack and a kind of standard data stack. It's correlated with like machine learning or using, doing BI kind of stuff. And so of course there's, you know, there's examples that are contradictory to this. Like you can have an internal only ML application or you can have a, or you can have like a production,
Starting point is 00:13:22 like embedded analytics where you have a dashboard that's updated in real time for your customers it has nothing to do with machine learning but just in general like that types that tends to be like a pretty uh correlated distinction and so anyway so the whole point here though is that machine learning often comes with this production these production requirements production requirements are these things are listed and you can probably list a bunch more, but why would you use machine learning for, why would you go through all that trouble? It's because often these use cases tend to be pretty valuable use cases to the business.
Starting point is 00:13:54 So the business is like, hey, for me to prevent fraud, that's worth so much money for me. So I'm going to really invest in it. Whereas like the 101th dashboard and the company incrementally maybe that's not going to merit like a 50 person team or something like that and so that's why we see like different levels of investment different levels of like how of willingness to do something custom different levels of yeah like different often different stacks for those things as well and so then coming back to like why a machine learning what's the kind of boundary between machine learning or ml ops tools and the data tools it often becomes a boundary between like production data tools and like
Starting point is 00:14:36 non-production data tools but machine learning definitely has and we can like go through a bunch of these things but like if you want to look at like an ml platform like what are the things in an ml platform you have got to train a model right so it's like i break it into like model stuff and data stuff yes model you got to train a model you got to evaluate the model you got to serve the model yes and you know there's pretty good systems for that stuff today and you can go and find like really nice open source tooling for that or you can find a vendor solution that'll do it all in one. And then on the data side, well, first, like what is the data? What is the data side in a machine learning use case?
Starting point is 00:15:13 Well, the data is, you know, your model takes in data. They're called features to make predictions, right? You've taken some data about your users, about your product, whatever. And hopefully they're up to date and they're fresh and stuff like that. So you can make high quality predictions and they're expressive, stuff like that. So there's a lot of good information going into your model. So the model can make a good prediction. But that's a hard problem, a hard data engineering problem in and of itself.
Starting point is 00:15:39 So, you know, we find that to get a machine learning application into production, it's not just a, let me deploy a model. It's let me deploy the model and a whole bunch of supporting data pipelines that are often more complicated than your kind of like BI pipelines, powering a dashboard. And that's like a really big hard part. And that's the data pipeline thing for machine learning is always in the boundary of is this a data thing is this a machine learning yeah yeah but it tends to be the hardest you know you've heard i'm sure everybody on this podcast has always said like you know the hard part in machine learning is the data and all of that kind of stuff it's because it is and that's actually you know that's the layer that we focus on at techcon but it's a lesson first learned actually when we were building Michelangelo.
Starting point is 00:16:27 First, we started with this model thing and we were going to all of the different data science teams internally and we were saying, hey, let's help you out. Let's help you get surge pricing into whatever, into production. And we would find that there's a bunch of,
Starting point is 00:16:40 cool, the model side of things work. And then we would do a bunch of custom data engineering. And then we would go to the next team there's you know 200 data team there are different teams internally and we were doing the same data engineering things again and again so we we centralized that we automated that in the ml platform and specifically only ml teams have these needs we can get into what these specific needs are but that became the feature store the feature platform and so so that has become a separate thing than what you have in a traditional data platform you don't need a lot of the yeah real-time serving streaming stuff in the exact same way yeah that's super interesting so in terms of like the data and let's say the pipelines
Starting point is 00:17:22 actually the basic principle of a pipeline remains the same, right? Yes, absolutely. You have some data and you go through stages where you transform the data, right? But how this is different in the case of ML? Machine learning versus not ML. Yeah. Good question. So I'll call out two big things and then we can talk about what what are the implications of these differences right just thinking about this use case one thing is that i have a train i have two consumption use cases for
Starting point is 00:17:53 my data in machine learning so the data again is features features are let's think about what some features are as an example it could be i'm predicting fraud, right? So I want to, let's say I have one feature, which is how large is this transaction that someone's making right now? And how does it compare to an average transaction? Let's just say that's got a bunch of different types of features like that, right? And so I need to use that data to build my model, to train my machine learning model. That's kind of consumption scenario one. And that's, I'm doing that in Jupyter. I'm doing that and I'm plugging that into Scikit-learn
Starting point is 00:18:31 or Tytorch or whatever. And that's offline. Yep. And then I get a model from that. And then I deploy that model. And that model needs to, it's now consumption case two. It's in production. I need that, those same data, that same, you know, how big is this transaction compared to the average transaction?
Starting point is 00:18:53 I need that calculated the exact same way and delivered to that model production real time. Right. And so, so that's, you know, that's the inference step. So that's, you know, consumption case one case one is training, consumption case two is inference. The data needs to be consistent across those. And it's more so than in any other data scenario. So if you have a dashboard where you're off by a decimal place or like a format of the number is kind of different or whatever, it's not a big deal if in your prototype it was one way and in production
Starting point is 00:19:25 it's another way. But in machine learning, if there's any difference in that data, then you basically have an undefined behavior for the model. Yeah. And so, and then you get this problem. This is the drift problem that people talk about. And then you don't know what your model is going to do and it can affect the behavior in a really bad way, but it's also a very hard problem to detect and debug too. So anyway, so that's like consistency between online and offline is like a big problem.
Starting point is 00:19:50 The second problem that is pretty unique to machine learning is going back to that training side of things. So say I have, you know, 40 features, right? And then we have customers that have 4,000 features for a model. I need to know, you know, I'm trying to give my model examples of what I knew about a customer or a product or whatever at the time I had to make that prediction in the past. So I'm not really, I don't really care about what is that features value today, right now. What do I know about the customer today? I care about when this purchase was made at like 12 31 on thursday what was this features value at that time and this can be so now imagine i have to do that for every single feature and then i have to do
Starting point is 00:20:36 that for every single purchase that happened right so that's like a complicated thing you can we can imagine a bunch of different ways to do it and it's not impossible to figure out but if you're a data scientist it's like okay that's a whole other like data engineering thing i have to do and and you should just have a really clean nice workflow to make that really easy because you're trying you're doing millions of rows that potentially thousands of columns and then what's even more tricky here this is I'll say this is like challenge number three for these use cases is that you're typically not only sourcing data from where you're sourcing data from is not a simple story. Typically it's not like,
Starting point is 00:21:13 let me plug into snowflake and then just like run a query, you know, for example, these like production fraud models, they're often, okay, I'm going to pull, I'm going to run this query against snowflake.
Starting point is 00:21:23 And that will be the, what is the zip code. We expect some profile data, some slow-moving data. Then there might be some data that's based on streaming values, like how many times has this user logged in the past five minutes? And the model can learn if it's 1,000 times, it's probably something weird, and maybe this is high risk. And then there's another type of data or another type of feature, which is like real time.
Starting point is 00:21:48 It's like super real time. It's not streaming where it's asynchronously calculated but pretty fresh. It's very real time. It's like based on the data of the transaction. This transaction is coming in. I need to do some operation based on the size of the data, for example, or the IP of the transaction issuer, let's say. So now you have three different kinds of compute you have to manage. And you have to backfill all of those values through all these points in time and history.
Starting point is 00:22:17 So you can see this whole problem just like explodes, right? It's like all these different dimensions of this problem. And so the point is not to say like, hey, you can't figure out how to do one of these things or whatever just really terrible workflow for a data scientist who's just trying to like build a fraud model and put it and do their job really and so that's kind of like the set of the part of that you know most of the big problems that yeah i have yeah that's like okay that's very thought-provoking, actually. There are two things. One, the technical question that I have.
Starting point is 00:22:49 And the other one, which is probably the most important, to make the question next, is about the experience that the user has in this case. It's like the ML engineer or the data engineer or whatever. Let's start with the technology. For someone who comes, let's start like with the technology like okay for someone who comes let's say from the database systems world yeah okay you always know that database system like you cannot have like a database system that can do everything like systems tend to let's say get optimized for specific workloads
Starting point is 00:23:19 and i can't stop thinking like as you talk about all these things like in my mind I kind of like coming one after the other like the different workloads right so my first question is like what kind of like data infrastructure like what kind of data system you need in order like to do all like work with all these different types of data, right? From time series data, from streaming data, to doing slow moving, like BATS data, graph data. I know that, especially from what I know from banks, when it comes to fraud detection, graph database argues a lot, you find relationships. So taking all these things together... That's a lot yeah like it's crazy like how do you even keep this thing consistent like well yeah i mean maybe it's it would be good to
Starting point is 00:24:11 clarify that tecton's not doing all of that stuff right so we're not saying like hey we are the one system that can be better at each of these things than everybody else so we take an approach of plugging into the best in-class solution. So what we provide for a data scientist, and you're talking about the experience of using it, but we let them write their feature code, their feature engineering code in one place. We provide a really nice workflow for them to register, author, register, share, and manage these features.
Starting point is 00:24:53 But then we plug into, we send that code to the appropriate underlying infrastructure to run that. So this could be a stream processing pipeline. This could be in the real-time case. We actually run like the Python code and whatever in real time to run that efficiently. Or we often just push down SQL queries to Snowflake or we'll kick off a Spark job or something like that. So it's not intended to be like one master data engine that does everything kind of thing, but more like a common hub, a common control center for the data scientists so they can get a control of all of the
Starting point is 00:25:26 different data flows that power their ML application. Does that answer your question? Yeah, 100%. Let's talk more about the experience. How this experience looks like and how we can make it easy for an ML engineer to interact with all these different systems, right? Because each one of them I can, like, just thinking of, like, writing, like, a job for Spark and executing a SQL query on
Starting point is 00:25:54 Snowflake, they're, like, pretty different things, right? And I'm pretty sure, like, ML engineers or, like, data scientists, they prefer, like, to focus on other things, right? So how is, like like the authoring? How do you author a feature? And how we can help them
Starting point is 00:26:09 like have like a good experience with that? Good question. So I think, you know, when we see what our customers are spending their time on, especially imagine like a new use case, like, hey, we're spinning up fraud model number two or something.
Starting point is 00:26:23 A lot of the time that goes into, if you just look at like the timeline of the project, it's spent in like figuring out how to connect to something in the first place. Yeah. And getting that original integration going. And so one of the first parts of kind of the experience is getting that integration out of the way ahead of time. So, you know, this is where the ML platform comes in. We work with the ML platform before the data scientist or the feature engineer, whoever's building the model, or even knows about anything about the platform. We get all the integrations with the right data sources and stuff like that registered on Tefthon. We connect to your warehouse and your streams and your production system and stuff like that. So then that lets us provide an SDK to the data scientist who's now in the mode of, hey, I want to develop a machine learning application.
Starting point is 00:27:14 I need a training data set, right? Yeah. Okay, I want to write a feature that operates on a stream. I want to write a feature that runs in real time that's based on the data my application sends me in real time. I want to write a feature that's whatever, some SQL that runs on Snowflake, for example. Well, now there's one SDK where they can write that code snippet in the exact same way for each of those different types of compute and register it into the centralized feature repository. And so all within your Jupyter notebook,
Starting point is 00:27:46 it's literally just writing a Python function that emits either a SQL query or does an operation on a Pandas data frame or something like that. And you put a little decorator on it, and that tells us, hey, this is a feature view. Pretty straightforward experience. And then you can say featureview.run, and then we'll execute,
Starting point is 00:28:02 and we'll give you the feature values. There's not like a crazy amount of magic there. But then you can take all of these feature views, either refer to them by name, the ones that have already been generated in the feature store that are already there that someone else in your company has made, or the ones you just defined like live in your notebook,
Starting point is 00:28:21 and you can bring them all together in a list and say, hey, give me the historical training data set for this every login attempt that any user did in the past six months i want to backfill what the feature value was for all 400 features yeah so that's where like a lot of complexity comes in how do you what how for each of these feature types how do you figure out what the historical values of the feature was? How do you do it efficiently? How do you join it all together efficiently? And then how do you make it really easy to iterate on that whole thing?
Starting point is 00:28:56 That workflow, it would be such an ugly workflow normally. And we're all about making that as smooth as possible for the person prototyping their machine learning application. And then the second part, so that's just all the prototyping stage. And so you just train a model. We give you back a data frame. You just train your model on it. And then once you're happy with your model, you deploy your model.
Starting point is 00:29:18 And normally this is kind of like the main thing that people get stuck on historically. They would say, hey, okay, let's go rebuild all these pipelines in production now. This is the classic throw it over the wall to the engineers in production who rewrite everything. But in the Tekton world, you've already registered your pipelines. You've already registered your features. So they're already productionized. And so there's nothing else to do.
Starting point is 00:29:40 Your model in production just makes a call to tekton and it says hey i need these features in real time and that's already productionized and those values get served in real time it saves the it speeds up this like prototyping stage and the natural in your jupiter notebook but then it also brings the productionization stage the time for that to like zero it's instant because yeah that's not a step in the Tekton workflows. Yeah, it makes a lot of sense. You get what I mean by that? It's just like you don't have to rewrite it, basically.
Starting point is 00:30:12 It's just all... Yeah, yeah, 100%. When is the product engineering getting involved in that? Because you have a model, you expose it through a gRPC endpoint or a REST endpoint, whatever. But at some point, this thing needs to be integrated with the product experience, right? So how does this part work? Because we tend to focus a lot on the data side of things, more esoteric stuff with data engineers, ML engineers, and all these things.
Starting point is 00:30:40 But at some point, we also have to integrate with the product itself, right? So how does this work? 100%. And so the same problem that we talked about around like, hey, like, you know, in reality, you spend a lot of your time integrating with different sources. That applies just as much to like figuring out how to connect this thing to some database originally as a data source to figuring out how to connect my ML systems to production, the production, you know, the end application.
Starting point is 00:31:06 And so in the Tekton way of doing things, that's something that's handled by the ML platform team. So in setting up Tekton, your ML platform team is connecting Tekton to your production systems. So then what that does is it makes it easy for the data scientists. Now let's just think about just the flow of building an ML model independent of the platform team. Hopefully your ML platform group is not involved in building an ML model the same way you wouldn't want your data platform team involved in, you know, like every single iteration on a dashboard
Starting point is 00:31:39 or some like, you know, some analytical work. And so in that, every machine learning engineer or data scientist who's iterating they when they productionize you know tecton's already connected to their application it's already it already exists in their production environment so it's just a matter of opening up a new api a new endpoint on the Tekton side that can serve that data. And so we just expose that API and then their application just has to query from Tekton a different set of features or a different alias for a group of features. And so that's why I like that integration step. You know, you still have to do the integration up front, but you don't have to do it in every
Starting point is 00:32:21 single iteration. And that's where the real speed up happens and then the whole point from like a data science manager's perspective is great like my team can iterate so much faster because there's not all this data engineering stuff that has to happen in every single iteration. My data scientists
Starting point is 00:32:38 can affect what's happening in production without going through all of these different steps. Nice and let's talk a little bit about inference now. You said at some point that, okay, we train the model, and now we need to, in a more online nature, start creating features that we're going to fit to make the predictions, right?
Starting point is 00:33:01 And I guess, the latency and throughput requirements are a completely different workload, right? And I guess, not like, I mean, the latency and throughput requirements around that stuff, it's like, again, a completely different workload, right? So how does this work? Let's say I want to build, like, fraud detection, right? It's pretty, I mean, as you said, like, in 30 milliseconds or something like that, you need to make a decision. How does this work? And what's, like, let's say, unique challenges it has, like compared to like more traditional, more esoteric kind of uses of like ML? Yeah, so maybe it's good to start from like a,
Starting point is 00:33:35 the most basic form of the, more like the analytical ML use case. So, you know, let's go back to that example where I'm the finance team, you know, I'm the data scientist on the finance team and I want to, you know, predict sales next quarter. Well, OK, what do I need to what's the input data, the input features I need to make that prediction? Well, they come from my snowflake, let's say. Right. So, well, I can for this pipeline, I can issue a query to Snowflake.
Starting point is 00:34:05 Maybe it takes a couple seconds. I can wait for that data to come back and then run a prediction job or pass it through Scikit-Learn's inference pipeline. So that's kind of like the base case. It's the most simple thing you would do. Now when you want to go into production, you want to power your user experience by this kind of thing. Typically, you've got to go into production you want to power your user experience by this kind of thing right the typically there's you got to go faster than that and so you know you don't want your user waiting around for the page to load while you're figuring out you know something so so it's common
Starting point is 00:34:37 to have let's say like a time budget of 100 milliseconds or something like 50 milliseconds where you would say, hey, everything needs to be, the prediction needs to be 100% ready within 50 milliseconds because we just got to show the pitch. We can't wait around for all the ML stuff to happen, right? And that tends to be a real limiter for what kinds of ML can we do? If, you know, like our product, what kind of ML can we have in the product? Well, if it's slow, we're just not going to have it.
Starting point is 00:35:03 We're not going to consider having it. Right. So the problem that ML teams often have is how do we do this cool stuff? How do we do it quickly? And when we come to, you know, the different types of information that they want passed in their model, the different features, you know, they can depend on systems that are not that fast. So, so when it, for example, I want to send a, a query to my data warehouse and I have to wait around for it. Right. So there's different kind of like ways to approach that, but there, the, the ways to approach that are different depending on the underlying data infrastructure that you have to interact with. But like a super obvious example is, okay, well let's run the query ahead of time
Starting point is 00:35:45 and then just like cache the value. Right. And so maybe we run it every day or maybe we run it every 30 minutes or something like that. And so, so just like a very common thing to do is let's pre-compute these values and get them all loaded up, ready to serve really fast. And when you do that, then you have this problem of like, okay, well how fresh is this Yeah. Right. Well, if it happens once a day, then maybe it's, you know, like 18 hours old when I'm serving the value. And so this is a kind of like question that ML production, ML teams, they think about this all the time. Okay.
Starting point is 00:36:15 Well, how do we do this trade off? How do we make it go faster, but not cost too much money? So how do I keep things fresh? But also I don't want to like be constantly just like querying my warehouse and break things. Right. And then you have that type of feature. You have one. So maybe I'm using my streaming data.
Starting point is 00:36:30 And so in there, maybe I'm pre-calculating values as well and caching them. And then there's like features that depend on actual real-time data that's only available when you're making a prediction. Like the example of what is the user's IP address, right? You can't know that ahead of time, so you can't predict that ahead of time. So in that case, you have to compute that feature at prediction time. And so you need that to go really fast.
Starting point is 00:36:55 And so this is another domain where like, each of these things we can talk about and be like, yeah, we can do that. It's not impossible to run a query on a schedule and load it up. But if you're the data scientist, you just really want one thing that will handle all of this stuff for you. And so that's what we do.
Starting point is 00:37:11 We just automate the best practices. We have all the best practices built in. And then the kind of like knobs that, you know, you would really want to tune this stuff to trade off between performance costs, stuff like that. It's kind of all built in there to make it really easy for someone who's building and going to production without having to worry about a lot of the unnecessary data engineering details behind the scenes.
Starting point is 00:37:35 Yeah, that makes a lot of sense. And in this whole, I would say, lifecycle of prediction, like from getting the data, creating the feature, doing the inference, and serving, like, the user at the end. Which part is usually, like, the most time-consuming? Is, like, the feature creation part? Is the inference itself, like, how long the model takes, like, to do what it has to do?
Starting point is 00:38:01 That usually, like, takes a lot of time? Or depends? You mean the inference pipe? Like like when you're making a prediction, there's a data retrieval. Yes, data retrieval. It can depend. It can really depend. So you could have a piece of feature engineering code that could be quite complicated that has to run in real time.
Starting point is 00:38:23 That's one of these ones where, you know, it's just the reality that you can't have an arbitrarily complex thing run arbitrarily fast in real time at a cost, at a level of cost that is acceptable to you. The challenge tends not to be, once you adopt an architecture like this, speed of serving doesn't tend to be a problem. Actually, it's like, you know we this is what the online feature store is we as long as we can like manage getting fresh values into the online
Starting point is 00:38:51 feature store and we automate all of that and everything the online feature store is really fast and we'll you know we can use different underlying technologies to power that depending on the performance care the characteristics of how often the feature store is updated and what your kind of scale of serving is and your latency needs such that we can optimize cost for the customer. But those are kind of solved problems for the data retrieval. That tends not to be the hard part in the bottleneck for the user experience or getting the whole ML application up and running.
Starting point is 00:39:22 Does that make sense? It does, it does. Absolutely. Yep. So Brooks, all yours. I can't keep up. I know you got to get to the next thing. It's a conference here and he could keep talking all day.
Starting point is 00:39:40 Yeah. But it's been so fascinating. One last thing I want to ask before we sign off here I know y'all just launched some new things at Tekton can you give us like the quick overview of the launch? Awesome yeah we just launched what we call Tekton 0.6
Starting point is 00:39:56 so maybe like a week or so ago big thing there is we have almost like a completely redesigned development workflow so that things are way faster for a data scientist to do their feature engineering. Basically, the core feature, you know, we aspire to provide our customers the best feature engineering experience in the world. And we have like a totally different level of ease of use in the core workflow, the core like loop of build, write a feature and test it. And it's productionized. That's all done in your notebook now. It's like a, it's like a super beautiful, elegant experience. And so I think
Starting point is 00:40:40 people should check that out. And then I think the second thing I'll call out from this launch is one of the things that we see quite a bit is how streaming features are pretty important for a lot of types of production ML use cases. This is like, let me aggregate over a bunch of events, basically. And there's all kinds of, you might say, hey, I want to count how many times someone tried to log in over the past five minutes, 15 minutes, 15 days, whatever. And we have huge upgrades into what kind of freshness you can get from those types of features in Tekton and the speed that they run like the, like one platform for, to manage the features is that when there are particular common use cases or types of features that are quite powerful and pretty complicated for people to implement, like a lot of these like feature streaming aggregation things, we can just build special things to speed people up. And so we've got a little bit of like magic in Tekton that makes all of these kinds of
Starting point is 00:41:43 streaming aggregations super easy for people. And, and we really upgraded that in this launch too. And so we're seeing a lot of our customers love that. So those are the two things that call up. Cool. Yeah. Thanks for asking. Yeah. So for all the data scientists that are listening to you, like, man, I got to check this out. Where do they go? Go to tekton.ai. Just sign up for a free trial or shoot me an email, mike at tekton.ai, and I'd love to chat with you. Cool. Cool. Well, Mike, thanks so much for your time today.
Starting point is 00:42:12 Listeners, thank you for listening. Check out tekton.ai and subscribe to the show if you haven't yet, and we'll catch you next time. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
Starting point is 00:42:35 That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.