The Data Stack Show - Data Council Week (Ep 5) - The Difference Between Data Platforms and ML Platforms with Michael Del Balso of Tecton
Episode Date: April 26, 2023Highlights from this week’s conversation include:Michael’s journey to co-founding Tecton (0:22)The evolution of MLops and platform teams (3:50)Understanding boundaries between the data platform an...d the MLops (8:42)Differences in machine learning vs data pipelines (16:58)The systems needed to handle all these types of data (22:22)Developer experience in Tecton (25:15)Automating challenges in ML development (32:30)The most difficult part of the life cycle of prediction (37:24)Exciting new developments at Tecton (39:27)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
All right, we're back in person at Data Council Austin,
and we have got Mike DelBasso. He's the co-founder and CEO of Tekton, and super excited to chat with
him. I'm Brooks. Again, if you've been following along, you've heard. Eric did not make it to the
conference, so I'm filling in. You're stuck with me, but have Costas here as well. He's extremely
excited to chat with Mike.
So, Mike, welcome to the show.
Thanks for having me.
Yeah.
First, could you just tell us about your background and kind of what led to you starting Tech Talk?
For sure.
So, yeah, I've been in machine learning for a while now.
I actually got into it kind of randomly.
I worked at Google, and I began working on the machine learning teams that power the ads auction.
So I was a product manager for probably the, I would say at that time, the best productionized ML system probably in the world that drove all of the ads revenue there.
And that was really cool.
And did that for a bit and then joined from that team.
Actually, that team is a team that publishes pretty famous foundational MLOps paper called Machine Learning, the High Interest Credit Card of Technical Debt or something like that.
I always get the words in the wrong order there, but it's
like a pretty like often cited ML, ML ops paper. But then I joined Uber and I joined Uber at a time
when there were, there was not really a lot of ML happening in Uber. And I was kind of tasked with
like helping start the ML team and like bring things from zero to one. And so we went from,
so we created, we actually created like the infrastructure,
the ML infrastructure, Uber platform called Michelangelo.
And over from, you know, I joined in 2015. So over a couple of years, two and a half years, let's say we went from,
we just had like a really good run at it.
We built some really good platforms.
We went to, I went from like, you know, a handful of models in production to literally tens of thousands of models in production making real-time predictions.
It's like a huge scale, right?
Millions of predictions a second.
That powers all the kinds of ETA and fraud detection and pricing stuff that happens at Uber.
And in going through that and building that stack out we
came up with something called feature store and you know i published a blog post about michelangelo
a while back and a lot of people said hey you know we're solving a lot of these we're trying
to solve a lot of these problems that that you guys are solving and michelangelo and one of the
most interesting parts of it and the hardest part for us is all of the data side.
The data pipelines that generate the features,
the power models, all of the real-time serving,
all of the monitoring for the data,
the storage, the online and offline consistency,
stuff like that.
And so we recognize there's a need to start.
There's like an industry need for this architecture. And that became the beginning of what we call a feature platform today. And so
myself and Engelid, Michael Angel, we left to start a Tecton. And we've been doing that for
the past couple of years. And so at Tecton, we think of ourselves as, know we're kind of like the company for feature platforms we
we see ourselves as that enterprise feature platform company we sell to fortune 500s and
like top tech companies folks that aren't necessarily like like a google or like a
facebook who are for sure going to build it themselves kind of thing but everyone else
who's trying to do like real machine learning and production,
we hope we can help them out and provide them the best feature engineering experience in the world.
Cool.
Before we hit record here, you were talking about platform teams
and kind of how companies handle build versus buy.
Do you want to just speak a little more of that?
Yeah, it's been interesting seeing the the evolution
over the past couple years because when i was kind of when we started michelangelo in 2015
there just wasn't a lot of good ml infrastructure ml tooling the ml ops was i don't even think it
was a term at that time really and honestly like the concept of buying just never came up to us it just we were never like
oh maybe there's a product we just assume that there's no product to do what we wanted to do
and so over time you know the industry has grown up effectively and it's become
it's been the offerings on the market have become more compelling but also at the same time as the
kind of like vendor solutions have gotten have in parallel as the vendor solutions have become
somewhat compelling you know ml platform teams have grown internally in companies it's become
you know the need for machine learning has grown. And so the willingness of a company to invest in machine learning and a machine learning platform has grown.
And we see it as often it's a parallel thing to a data platform team, a data team, or an ML platform team.
You know, you've probably seen this in a bunch of companies where it's kind of like they're a consumer, they're a customer of the data platform team.
So the data platform will manage the data warehouse or the data lake, stuff like that.
And then the ML platform team will be like a specialty for like a specialization of that and manage a lot of the ML infrastructure on top of the core data platform. And, and so, you know, the industry, these ML platform teams have been
in a lot of companies have grown quite a bit. They've been building all kinds of cool stuff,
managed training, managed model serving, like drift detection, feature pipeline management,
stuff like that. And, and I think think like recently especially with like the economic
situation but even before that a little bit you know like these teams were ballooning and sometimes
you're getting to like 10 20 30 people on a ml platform team that's expensive and and so we've
begun seeing a lot of hey why like now that that there's all these solutions we can buy,
is it really strategic
for us to build our own
training infrastructure? Should we just buy that?
Well, it was strategic
before when it didn't exist and you needed to
have machine learning in production.
And especially on
a lot of the data side of things,
the place where we play,
same kind of thing. And so we're seeing these ML platform teams kind of have this like interesting kind of
like identity crisis today where they have to think about, okay, well, I thought I was
like building all this cool, like invent, I came here to like invent ML infrastructure.
And now it's their role is, is a lot more tied to the use cases.
It's a lot more tied to like what is the
like why am i actually here like like my team is building a recommender or like someone in my
company needs to make recommendations ultimately or needs to detect fraud and so now they're a lot
more of a less of like a carte blanche just build whatever cool tools you want and a lot closer to or a lot more being
driven by what is the actual need from this business use case and how do i map how do i help
that end use case team business team map that back to like tools or whatever the right stack is and
they have to be kind of like the stewards of the right stack they have to bring that and i think
it's we're seeing it be kind of like a difference in identity and charter and stuff like that for ML platform teams over the past few years.
And obviously people are at different parts in that journey, but I think that's a general trend we're seeing as well.
Yeah, and it seems like it's kind of just generally like, you know, ML, but across the board with data teams, the business impact is now kind of number one i would say
you know machine learning stuff i feel like it's been particularly like this sexy thing where
someone could be like i can just go invent this new kind of completely greenfield new stack there's
no best practices or established right way to do things so i don't think this is always like a
conscious thought for people working on this stuff,
but it was, you know, you can see it in people's attitudes.
Like I just, it's a cool place where I can just invent cool tech.
And sometimes that was a little bit divorced from what the, you know, what the actual business
need is.
And, and more so than I saw, you know, for example, than it just like the normal data
stack, for example, than it's just like the normal data stack, for example. So we've been talking about ML ops and platforms.
And I'd like to begin the conversation by helping me understand the boundaries between the data platform and the ML ops or ML platform, right?
Yeah.
Where each one starts, where it stops.
And most importantly, where are the synergies?
Because there are synergies right like i can't imagine that you have like an ml platform somewhere without also having some
sort of like pre-existing at least data platform right right well those are the trickiest situations
we love it when people have like great underlying data platforms. And unfortunately, that's not always the case.
And so I think this is a little bit of a strategy question for MLOps vendors.
If you look at MLOps vendors, they'll typically have a bunch of capabilities in their system,
which kind of are there as optional to fill in for gaps that you might have in your data stack.
Because if you
were missing whatever say you're missing a data observability capability it's really important
for your machine learning model but the rest of the company doesn't really matter that much
probably realistic that the ml ops vendor is going to add that in one way or another
and and they'll say hey you don't have to use this but like it's part of it is here in case
you need it which you might for your really important business use case.
But anyway, sorry, I interrupted you.
Or you, that was the question.
Like, what's like, what's the boundary?
Yeah.
What's the boundary.
And also you said something very interesting.
You talk about like observability, for example, and you might say, okay, like, let's say in
a BI environment with, let's say, the traditional data stack,
maybe people don't care that much about, like, using an observability platform, right?
But if you do ML, like, probably you need it more, right?
Tell us a little bit more about that.
Like, I think that it's, for many people, it's hard, like, to understand what are the differences.
For sure.
And I think the most important thing at the end is what are, let's say, the synergies between the two platforms.
Well, okay.
So I think there's two dimensions of difference also. One dimension is this data or is this a normal data BI type of thing that I'm trying to do?
Or is this a machine learning kind of type of thing thing that i'm trying to do or it's like a machine learning kind of thing machine learning has there's some special requirements but a second
dimension is that is often correlated with that first dimension as is this kind of like this
analytical thing i'm doing where it's kind of offline it's an internal use case and so let's
so let's take imagine it's's an analytical machine learning thing. This
is a big distinction we make. You know, it could be like, Hey, my finance team has to predict,
forecast how many sales are going to happen next quarter. You run a job, maybe it's doing some
machine learning stuff, but if it fails, it's not a big deal. Just press retry and you're good.
Right. Whereas it could also be an operational thing it could be a thing that
powers your product your end user experience and so that's a pretty different pretty different set
of engineering requirements right you might have a lot of users and so you have to be ready
to serve at a crazy level of scale right or you may have it may be like a say it's a fraud
detection situation where you have to make a decision really fast.
And so, you know, someone swipes a credit card and you have to say in like 100 milliseconds, are we or 30 milliseconds?
Is this acceptable or not acceptable?
Or you have some like uptime availability requirements where, you know, some downstream consumer,
it's something really bad is going to happen if you're not available at this kind of availability.
So that kind of production and not production differentiation, I think is actually a bigger
driver of some of the differences you see in an ML stack and a kind of standard data stack.
It's correlated with like machine learning or using,
doing BI kind of stuff.
And so of course there's,
you know,
there's examples that are contradictory to this.
Like you can have an internal only ML application or you can have a,
or you can have like a production,
like embedded analytics where you have a dashboard that's
updated in real time for your customers it has nothing to do with machine learning but just in
general like that types that tends to be like a pretty uh correlated distinction and so anyway
so the whole point here though is that machine learning often comes with this production these
production requirements production requirements are these things are listed and you can probably
list a bunch more,
but why would you use machine learning for, why would you go through all that trouble?
It's because often these use cases tend to be pretty valuable use cases to the business.
So the business is like, hey, for me to prevent fraud, that's worth so much money for me.
So I'm going to really invest in it.
Whereas like the 101th dashboard and the company incrementally maybe that's not
going to merit like a 50 person team or something like that and so that's why we see like different
levels of investment different levels of like how of willingness to do something custom different
levels of yeah like different often different stacks for those things as well and so then coming back to
like why a machine learning what's the kind of boundary between machine learning or ml ops tools
and the data tools it often becomes a boundary between like production data tools and like
non-production data tools but machine learning definitely has and we can like go through a bunch
of these things but like if you want to look at like an ml platform like what are the things in an ml platform you have got to train a model right
so it's like i break it into like model stuff and data stuff yes model you got to train a model you
got to evaluate the model you got to serve the model yes and you know there's pretty good systems
for that stuff today and you can go and find like really nice open source tooling for that or you
can find a vendor solution that'll do it all in one.
And then on the data side, well, first, like what is the data?
What is the data side in a machine learning use case?
Well, the data is, you know, your model takes in data.
They're called features to make predictions, right?
You've taken some data about your users, about your product, whatever.
And hopefully they're up to date and they're fresh and stuff like that.
So you can make high quality predictions and they're expressive, stuff like that.
So there's a lot of good information going into your model.
So the model can make a good prediction.
But that's a hard problem, a hard data engineering problem in and of itself.
So, you know, we find that to get a machine learning application into production, it's not just a, let me deploy a model.
It's let me deploy the model and a whole bunch of supporting data pipelines that are often more complicated than your kind of like BI pipelines, powering a dashboard.
And that's like a really big hard part.
And that's the data pipeline thing for machine learning is always in the boundary of is this a data thing
is this a machine learning yeah yeah but it tends to be the hardest you know you've heard i'm sure
everybody on this podcast has always said like you know the hard part in machine learning is
the data and all of that kind of stuff it's because it is and that's actually you know that's
the layer that we focus on at techcon but it's a lesson first learned actually when we were building Michelangelo.
First, we started with this model thing
and we were going to all of the different
data science teams internally
and we were saying,
hey, let's help you out.
Let's help you get surge pricing into whatever,
into production.
And we would find that there's a bunch of,
cool, the model side of things work.
And then we would do a bunch of custom data engineering.
And then we would go to the next team there's you know 200 data team there are
different teams internally and we were doing the same data engineering things again and again so we
we centralized that we automated that in the ml platform and specifically only ml teams have these
needs we can get into what these specific needs are but that became the feature store the feature platform and so so that has become a separate thing than what you have in a traditional
data platform you don't need a lot of the yeah real-time serving streaming stuff in the exact
same way yeah that's super interesting so in terms of like the data and let's say the pipelines
actually the basic principle of a pipeline remains the same, right?
Yes, absolutely.
You have some data and you go through stages where you transform the data, right?
But how this is different in the case of ML?
Machine learning versus not ML.
Yeah.
Good question.
So I'll call out two big things and then we can talk about what what are the implications of these differences right just thinking about this use case one thing is that i have a train i have two consumption use cases for
my data in machine learning so the data again is features features are let's think about what some
features are as an example it could be i'm predicting fraud, right? So I want to, let's say I have one feature, which
is how large is this transaction that someone's making right now? And how does it compare to an
average transaction? Let's just say that's got a bunch of different types of features like that,
right? And so I need to use that data to build my model, to train my machine learning model.
That's kind of consumption scenario one.
And that's, I'm doing that in Jupyter.
I'm doing that and I'm plugging that into Scikit-learn
or Tytorch or whatever.
And that's offline.
Yep.
And then I get a model from that.
And then I deploy that model.
And that model needs to, it's now consumption case two.
It's in production.
I need that, those same data, that same, you know, how big is this transaction compared to the average transaction?
I need that calculated the exact same way and delivered to that model production real time.
Right.
And so, so that's, you know, that's the inference step.
So that's, you know, consumption case one case one is training, consumption case two is inference.
The data needs to be consistent across those.
And it's more so than in any other data scenario.
So if you have a dashboard where you're off by a decimal place or like a format of the number is kind of different or whatever,
it's not a big deal if in your prototype it was one way and in production
it's another way.
But in machine learning, if there's any difference in that data, then you basically have an undefined
behavior for the model.
Yeah.
And so, and then you get this problem.
This is the drift problem that people talk about.
And then you don't know what your model is going to do and it can affect the behavior
in a really bad way, but it's also a very hard problem to detect and debug too. So anyway, so that's like consistency between online and offline is like a big problem.
The second problem that is pretty unique to machine learning is going back to that training
side of things.
So say I have, you know, 40 features, right?
And then we have customers that have 4,000 features for a model.
I need to know, you know, I'm trying to give my model examples of what I knew about a customer or a product or whatever at the time I had to make that prediction in the past. So I'm not really,
I don't really care about what is that features value today, right now. What do I know about the
customer today? I care about when this purchase was made at like 12 31 on thursday what was this features value at that time
and this can be so now imagine i have to do that for every single feature and then i have to do
that for every single purchase that happened right so that's like a complicated thing you can we can
imagine a bunch of different ways to do it and it's not impossible to figure out but if you're a data scientist it's like okay that's a whole other like data engineering thing
i have to do and and you should just have a really clean nice workflow to make that really easy
because you're trying you're doing millions of rows that potentially thousands of columns
and then what's even more tricky here this is I'll say this is like challenge number three for these use cases is that
you're typically not only sourcing data from where you're sourcing data from
is not a simple story.
Typically it's not like,
let me plug into snowflake and then just like run a query,
you know,
for example,
these like production fraud models,
they're often,
okay,
I'm going to pull,
I'm going to run this query against snowflake.
And that will be the,
what is the zip code.
We expect some profile data, some slow-moving data.
Then there might be some data that's based on streaming values,
like how many times has this user logged in the past five minutes?
And the model can learn if it's 1,000 times,
it's probably something weird, and maybe this is high risk.
And then there's another type of data or another type of feature, which is like real time.
It's like super real time.
It's not streaming where it's asynchronously calculated but pretty fresh.
It's very real time.
It's like based on the data of the transaction.
This transaction is coming in.
I need to do some operation based on the size of the data, for example, or the IP of the transaction issuer, let's say.
So now you have three different kinds of compute you have to manage.
And you have to backfill all of those values through all these points in time and history.
So you can see this whole problem just like explodes, right?
It's like all these different dimensions of this problem.
And so the point is not to say like, hey, you can't figure out how to do one of these things or whatever just really terrible workflow
for a data scientist who's just trying to like build a fraud model and put it and do their job
really and so that's kind of like the set of the part of that you know most of the big problems that
yeah i have yeah that's like okay that's very thought-provoking, actually.
There are two things.
One, the technical question that I have.
And the other one, which is probably the most important,
to make the question next,
is about the experience that the user has in this case.
It's like the ML engineer or the data engineer or whatever.
Let's start with the technology.
For someone who comes, let's start like with the technology like okay for someone who comes let's say from the database systems world yeah okay you always know that
database system like you cannot have like a database system that can do everything
like systems tend to let's say get optimized for specific workloads
and i can't stop thinking like as you talk about all these things like in my mind I kind of like
coming one after the other like the different workloads right so my first question is like what
kind of like data infrastructure like what kind of data system you need in order like to do all
like work with all these different types of data, right? From time series data, from streaming data, to doing slow moving, like BATS data, graph data.
I know that, especially from what I know from banks, when it comes to
fraud detection, graph database argues a lot, you find relationships.
So taking all these things together... That's a lot yeah like it's crazy like
how do you even keep this thing consistent like well yeah i mean maybe it's it would be good to
clarify that tecton's not doing all of that stuff right so we're not saying like hey we are the one
system that can be better at each of these things than everybody else so we take an approach of plugging into the best in-class solution.
So what we provide for a data scientist,
and you're talking about the experience of using it,
but we let them write their feature code,
their feature engineering code in one place.
We provide a really nice workflow for them to register, author, register,
share, and manage these features.
But then we plug into, we send that code to the appropriate underlying infrastructure to run that.
So this could be a stream processing pipeline.
This could be in the real-time case. We actually run like the Python code and whatever in real time to run that efficiently. Or we often just push down SQL queries to Snowflake
or we'll kick off a Spark job or something like that.
So it's not intended to be like one master data engine
that does everything kind of thing,
but more like a common hub, a common control center
for the data scientists so they can get a control of all of the
different data flows that power their ML application. Does that answer your question?
Yeah, 100%. Let's talk more about the experience.
How this experience looks like and how we can make it easy for an ML engineer
to interact with all these different systems, right? Because each one
of them I can, like, just thinking
of, like, writing, like, a job for
Spark and
executing a SQL query on
Snowflake, they're, like, pretty different things,
right? And I'm pretty sure, like, ML engineers
or, like, data scientists, they prefer, like, to
focus on other
things, right?
So how is, like like the authoring?
How do you author a feature?
And how we can help them
like have like a good experience with that?
Good question.
So I think, you know,
when we see what our customers
are spending their time on,
especially imagine like a new use case,
like, hey, we're spinning up
fraud model number two or something.
A lot of the time that goes into, if you just look at like the timeline of the project, it's spent in like figuring out how to connect to something in the first place.
Yeah.
And getting that original integration going.
And so one of the first parts of kind of the experience is getting that integration out of the way ahead of time.
So, you know, this is where the ML platform comes in. We work with the ML platform before the data scientist or the feature engineer,
whoever's building the model, or even knows about anything about the platform.
We get all the integrations with the right data sources and stuff like that registered on
Tefthon. We connect to your warehouse and your streams and your production system and stuff like that. So then that lets us provide an SDK to the data scientist who's now in the mode of, hey, I want to develop a machine learning application.
I need a training data set, right?
Yeah.
Okay, I want to write a feature that operates on a stream.
I want to write a feature that runs in real time that's based on the data my application
sends me in real time. I want to write a feature that's whatever, some SQL that runs on Snowflake,
for example. Well, now there's one SDK where they can write that code snippet in the exact same way
for each of those different types of compute and register it into the centralized feature
repository. And so all within your Jupyter notebook,
it's literally just writing a Python function
that emits either a SQL query or does an operation
on a Pandas data frame or something like that.
And you put a little decorator on it,
and that tells us, hey, this is a feature view.
Pretty straightforward experience.
And then you can say featureview.run,
and then we'll execute,
and we'll give you the feature values.
There's not like a crazy amount of magic there.
But then you can take all of these feature views,
either refer to them by name,
the ones that have already been generated
in the feature store that are already there
that someone else in your company has made,
or the ones you just defined like live in your notebook,
and you can bring them all together in a list and say,
hey, give me the historical training data set for this every login attempt that any user did in the past six months
i want to backfill what the feature value was for all 400 features yeah so that's where like a lot
of complexity comes in how do you what how for each of these feature types how do you figure
out what the historical values of the feature was?
How do you do it efficiently?
How do you join it all together efficiently?
And then how do you make it really easy to iterate on that whole thing?
That workflow, it would be such an ugly workflow normally.
And we're all about making that as smooth as possible for the person prototyping their machine learning application.
And then the second part, so that's just all the prototyping stage.
And so you just train a model.
We give you back a data frame.
You just train your model on it.
And then once you're happy with your model,
you deploy your model.
And normally this is kind of like the main thing that people get stuck on historically.
They would say, hey, okay,
let's go rebuild all these pipelines in production now.
This is the classic throw it over the wall to the engineers in production who rewrite everything.
But in the Tekton world, you've already registered your pipelines.
You've already registered your features.
So they're already productionized.
And so there's nothing else to do.
Your model in production just makes a call to tekton and it says hey i need these
features in real time and that's already productionized and those values get served in
real time it saves the it speeds up this like prototyping stage and the natural in your jupiter
notebook but then it also brings the productionization stage the time for that to like
zero it's instant because yeah that's not a step in the Tekton workflows.
Yeah, it makes a lot of sense.
You get what I mean by that?
It's just like you don't have to rewrite it, basically.
It's just all...
Yeah, yeah, 100%.
When is the product engineering getting involved in that?
Because you have a model, you expose it through a gRPC endpoint or a REST endpoint, whatever.
But at some point, this thing needs to be integrated with the product experience, right?
So how does this part work?
Because we tend to focus a lot on the data side of things,
more esoteric stuff with data engineers, ML engineers, and all these things.
But at some point, we also have to integrate with the product itself, right?
So how does this work?
100%.
And so the same problem that we talked about around like, hey, like, you know, in reality,
you spend a lot of your time integrating with different sources.
That applies just as much to like figuring out how to connect this thing to some database
originally as a data source to figuring out how to connect my ML systems to production,
the production, you know, the end application.
And so in the Tekton way of doing things, that's something that's handled by the ML platform team.
So in setting up Tekton, your ML platform team is connecting Tekton to your production systems.
So then what that does is it makes it easy for the data scientists.
Now let's just think about just the flow of building an ML model
independent of the platform team.
Hopefully your ML platform group is not involved in building an ML model
the same way you wouldn't want your data platform team involved in,
you know, like every single iteration on a dashboard
or some like, you know, some analytical work.
And so in that, every machine learning engineer or data scientist
who's iterating they when they productionize you know tecton's already connected to their
application it's already it already exists in their production environment so it's just a matter
of opening up a new api a new endpoint on the Tekton side that can serve that data.
And so we just expose that API and then their application just has to query from Tekton a different set of features or a different alias for a group of features.
And so that's why I like that integration step.
You know, you still have to do the integration up front, but you don't have to do it in every
single iteration.
And that's where the real speed up happens and then the whole point
from like a data science
manager's
perspective is great like my team
can iterate so much faster because there's not all this
data engineering stuff that has to happen in every
single iteration. My data scientists
can affect what's happening in production
without going through all of these different
steps.
Nice and let's talk a little bit about inference now.
You said at some point that, okay, we train the model,
and now we need to, in a more online nature,
start creating features that we're going to fit
to make the predictions, right?
And I guess, the latency and throughput requirements are a completely different workload, right? And I guess, not like, I mean, the latency and throughput requirements
around that stuff, it's like, again, a completely different workload, right?
So how does this work? Let's say I want to build, like, fraud detection, right? It's pretty,
I mean, as you said, like, in 30 milliseconds or something like that, you need to make a decision.
How does this work? And what's, like, let's say, unique challenges it has,
like compared to like more traditional,
more esoteric kind of uses of like ML?
Yeah, so maybe it's good to start from like a,
the most basic form of the,
more like the analytical ML use case.
So, you know, let's go back to that example
where I'm the finance team, you know,
I'm the data scientist on the finance team and I want to, you know, predict sales next quarter.
Well, OK, what do I need to what's the input data, the input features I need to make that prediction?
Well, they come from my snowflake, let's say. Right.
So, well, I can for this pipeline, I can issue a query to Snowflake.
Maybe it takes a couple seconds.
I can wait for that data to come back and then run a prediction job
or pass it through Scikit-Learn's inference pipeline.
So that's kind of like the base case.
It's the most simple thing you would do.
Now when you want to go into production,
you want to power your user experience by this kind of thing.
Typically, you've got to go into production you want to power your user experience by this kind of thing right the typically there's you got to go faster than that and so you know you don't want your user waiting around for the page to load while you're figuring out you know something so so it's common
to have let's say like a time budget of 100 milliseconds or something like 50 milliseconds
where you would say,
hey, everything needs to be, the prediction needs to be 100% ready within 50 milliseconds
because we just got to show the pitch.
We can't wait around for all the ML stuff to happen, right?
And that tends to be a real limiter for what kinds of ML can we do?
If, you know, like our product, what kind of ML can we have in the product?
Well, if it's slow, we're just not going to have it.
We're not going to consider having it. Right. So the problem that ML teams often have is how do we do this cool
stuff? How do we do it quickly? And when we come to, you know, the different types of information
that they want passed in their model, the different features, you know, they can depend on
systems that are not that fast. So, so when it, for example,
I want to send a, a query to my data warehouse and I have to wait around for it. Right. So there's
different kind of like ways to approach that, but there, the, the ways to approach that are
different depending on the underlying data infrastructure that you have to interact with.
But like a super obvious example is, okay, well let's run the query ahead of time
and then just like cache the value. Right. And so maybe we run it every day or maybe we run it every
30 minutes or something like that. And so, so just like a very common thing to do is let's pre-compute
these values and get them all loaded up, ready to serve really fast. And when you do that, then you
have this problem of like, okay, well how fresh is this Yeah. Right. Well, if it happens once a day, then maybe it's, you know, like 18 hours old when I'm
serving the value.
And so this is a kind of like question that ML production, ML teams, they think about
this all the time.
Okay.
Well, how do we do this trade off?
How do we make it go faster, but not cost too much money?
So how do I keep things fresh?
But also I don't want to like be constantly just like querying my warehouse and break things.
Right.
And then you have that type of feature.
You have one.
So maybe I'm using my streaming data.
And so in there, maybe I'm pre-calculating values as well and caching them.
And then there's like features that depend on actual real-time data that's only available when you're making a prediction.
Like the example of what is the user's IP address, right?
You can't know that ahead of time,
so you can't predict that ahead of time.
So in that case, you have to compute that feature
at prediction time.
And so you need that to go really fast.
And so this is another domain where like,
each of these things we can talk about and be like,
yeah, we can do that.
It's not impossible to run a query on a schedule
and load it up.
But if you're the data scientist, you just really want one thing that will handle all
of this stuff for you.
And so that's what we do.
We just automate the best practices.
We have all the best practices built in.
And then the kind of like knobs that, you know, you would really want to tune this stuff
to trade off between performance costs, stuff like that.
It's kind of all built in there to make it really easy
for someone who's building and going to production
without having to worry about a lot of the unnecessary
data engineering details behind the scenes.
Yeah, that makes a lot of sense.
And in this whole, I would say, lifecycle of prediction,
like from getting the data, creating the feature,
doing the inference, and serving, like, the user at the end.
Which part is usually, like, the most time-consuming?
Is, like, the feature creation part?
Is the inference itself, like, how long the model takes, like,
to do what it has to do?
That usually, like, takes a lot of time?
Or depends?
You mean the inference pipe? Like like when you're making a prediction,
there's a data retrieval.
Yes, data retrieval.
It can depend. It can really depend.
So you could have a piece of feature engineering code
that could be quite complicated that has to run in real time.
That's one of these ones where, you know,
it's just the reality that you can't have an arbitrarily complex thing
run arbitrarily fast in real time at a cost,
at a level of cost that is acceptable to you.
The challenge tends not to be, once you adopt an architecture like this,
speed of serving doesn't tend to be a problem.
Actually, it's like, you know we this is what the
online feature store is we as long as we can like manage getting fresh values into the online
feature store and we automate all of that and everything the online feature store is really
fast and we'll you know we can use different underlying technologies to power that depending
on the performance care the characteristics of how often the feature store is updated and
what your kind of scale of serving is and your latency needs
such that we can optimize cost for the customer.
But those are kind of solved problems for the data retrieval.
That tends not to be the hard part in the bottleneck for the user experience
or getting the whole ML application up and running.
Does that make sense?
It does, it does. Absolutely.
Yep.
So Brooks,
all yours.
I can't keep up.
I know you got to get to the next thing.
It's a conference here and he could keep talking all day.
Yeah.
But it's been so fascinating.
One last thing I want to ask before we sign off here
I know y'all just launched
some new things at Tekton can you give us like the
quick overview of
the launch? Awesome yeah we just launched
what we call Tekton 0.6
so maybe
like a week or so ago
big thing there is we have
almost like a completely redesigned
development workflow so that things are way faster for a data scientist to do their feature engineering.
Basically, the core feature, you know, we aspire to provide our customers the best feature engineering experience in the world. And we have like a totally different level of ease of use in the core workflow, the core
like loop of build, write a feature and test it. And it's productionized. That's all done in your
notebook now. It's like a, it's like a super beautiful, elegant experience. And so I think
people should check that out. And then I think the second thing I'll call out from this launch is one of the things that we see quite a bit is how streaming features are pretty important for a lot of types of production ML use cases.
This is like, let me aggregate over a bunch of events, basically.
And there's all kinds of, you might say, hey, I want to count how many times someone tried to log in over the past five minutes, 15 minutes, 15 days, whatever.
And we have huge upgrades into what kind of freshness you can get from those types of features in Tekton and the speed that they run like the, like one platform for, to manage the features
is that when there are particular common use cases or types of features that are quite
powerful and pretty complicated for people to implement, like a lot of these like feature
streaming aggregation things, we can just build special things to speed people up.
And so we've got a little bit of like magic in Tekton that makes all of these kinds of
streaming aggregations super easy for people. And, and we really upgraded that in this launch too. And so we're seeing a lot
of our customers love that. So those are the two things that call up. Cool. Yeah. Thanks for asking.
Yeah. So for all the data scientists that are listening to you, like, man, I got to check this
out. Where do they go? Go to tekton.ai. Just sign up for a free trial or shoot me an email, mike at tekton.ai,
and I'd love to chat with you.
Cool.
Cool.
Well, Mike, thanks so much for your time today.
Listeners, thank you for listening.
Check out tekton.ai and subscribe to the show if you haven't yet, and we'll catch you next
time.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.