The Data Stack Show - Data Council Week (Ep 2) - The Convergence of MLops and DataOps With Team Featureform
Episode Date: April 24, 2023Highlights from this week’s conversation include:Introducing the team from Featureform (0:31)In the work vs. leading the work (3:01)Difference between MLOps and data ops (7:06)The MLOps cycle (10:12...)What is Featureform and what makes it different? (13:30)Is there another layer needed in feature stores? (18:46)Getting in touch with Featureform (23:55)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
All right, what's up everybody? This is Brooks.
I am usually behind the scenes on the show running things for Eric and Kostas, but we are live at
Data Council this week. And Eric, unfortunately, he's fine, but got in a biking accident and wasn't
able to make it. So I am filling in for Eric, and we have an awesome group of folks here.
We have Shabnam, Makiko, and Simba from Featureform.
And we are super excited to chat with you guys
here live in person at Data Council.
So would love just quickly to go around.
Shabnam, we can start with you.
But we'd love to just go around
and hear a little bit about your background.
Yeah, I guess I have an interesting background in tech. Well, I'm currently COO at Futureform,
but got my first job in tech working at Slack. I joined them in 2015 when they were still pretty
small, working in global biz ops. I then decided to blow up my career and pivot to software
engineering and did that for a couple
years before meeting the wonderful Simba and here I am at Futureform. Hey yeah so I joined Futureform
last October as head of ML Ops so my focus is on entirely on number one like helping users
specifically data scientists and ML platform engineers, develop and deploy the platforms and systems to make sure that production ML is going well for them.
A lot of times that means working from everyone from early stage startups to...
We have some pretty large international enterprise users now at this point. But prior to joining Feature Forum, I was actually an ML platform engineer over at Mailchimp,
which was acquired by Intuit earlier than October of last year.
And I've held various roles either as a data scientist or as a platform engineer, even
as an analytics engineer way before the term came into vogue.
So really working with folks up and down the data science machine learning value chain.
Cool.
So you're up.
Yeah.
And they didn't both mention,
but they're both bootcampers.
We just realized right now,
a lot of our execs are out of bootcamp.
So yeah, myself, I am maybe a little more boring
in that I went to UCC Anchor or CS
when I was at Google for not very long immediately was
like is this what I'm supposed to do for 40 years and realized that's not the life I wanted for
myself and left to start my first company without any idea without any like like no product and
that was my last company so I ended up being modestly successful we were handling over 100
million MAU at our peak doing personalization
predicting subscription predicting churn and we built all this now we call it an MLOps in-house
and one thing we built which nowadays we call a feature store was like so pivotal to us
and I realized that there was such an opportunity there so that kind of became the foundation what
is now a feature farm cool what's it been like going from kind of doing the work to helping others do the work?
Has there been any like surprises there?
Yeah.
And it was the irony, of course, was last year when I was at, there are certain points
in time, like, especially with the, you know, the acquisition of MailChimp, I, that was
like the third or fourth acquisition I had gone through at a company.
Not that company, but just across my career.
So I had been kind of figuring out what my options were.
I knew I wanted to continue to build stuff.
I knew I wanted to have an impact.
Something that I was a little bit frustrated with
and that I kind of sympathize with a lot of our enterprise users
was the feeling of like
your work not mattering like you having no impact on like either the overall architecture or the
direction of your company so I was thinking about like wait do I go join a startup do I go
work as a consultant for like Google for cloud services or AWS. And I was famous for saying I
am never ever going to go work for an MLOps company. I'm certainly not going to become
the feature store girl. And I, you know, I had to eat my words later on. But it's been really
delightful. I think sometimes in like the MLOps or the DataOps space, there's a little bit of this love-hate relationship with vendors.
And I 100% understand that beyond both sides of the table now.
But the reason why I joined Feature Forum was because I thought it was such a cool project.
We were trying to implement really similar stuff over at MailChimp and trying to figure out like, how do we actually make data
scientists more successful without like mandating this really constrictive, like single path to
production. And I felt like a lot of the solutions out there, you know, for better or worse, we're
not quite like meeting the gap that feature form sort of fills. And so for me, it was like, okay, I'm going to go,
I'm going to leave MailChimp to go join a project that I really want to like
help see and grow, not just on its own,
but also in creating with other like MLOps projects and like open source
communities and vendors.
Cause I feel like we need that unification, frankly.
Can you unpack the love hate relationship and kind of speak to it maybe from one side is like why is there the hate and
then maybe now that you're on the vendor side like how has maybe your perspective changed
yeah absolutely so so as a practitioner I know so I am a boot camp grad, right?
The thing that I think is really cool about Future Form that I only realized recently
is that all of us are like UC grads, University of California grads.
So we all went to public university, which is fantastic.
We also went to like the surfer schools, oddly enough, like UC Santa Cruz, where a bunch
of folks went.
I went to UC San Diego.
But we were all kind of bootstrappy, you know, and as a practitioner, I felt like a bunch of folks went. I went to UC San Diego. But we were all kind of bootstrappy.
And as a practitioner, I felt like a lot of times I had to kind of piece together my own stack.
And frankly, it just didn't help. A lot of vendors or project or open source tools were really kind of helping that. They were really good at just this one thing, but then they weren't
thinking about the broader workflow that I was getting involved in as a data scientist.
And then when I became a platform engineer, I'm like, oh, actually, this is very hard and very, like, this can be very complicated.
But as a platform engineer, to a certain degree, you have to be informed and be opinionated.
And you really have to understand the data science user.
And hopefully that's kind of the perspective and empathy I can kind of bring to feature form.
We already have a culture of empathy and user centricity.
I think that is very well supported by Shad and Simba, especially since Shad came from Slack, where they were all about that user centric experience.
But I hope I just bring more of that data scientist flavor to it and the awareness of their pain points, for sure.
Yeah, that's really cool.
I guess so.
So, all right.
Let's start with, I have a question about MLOps.
Okay, and I want to ask you, what's the difference between MLOps and DataOps?
And I'm asking this because I think we mentioned like both terms so far so
what's the difference why we need both of them yeah i like to frame the difference not between
data ops and ml also that's kind of what the vendors have called themselves i think that
there is a difference between metrics and features and a metric is something you know it's a metric
if you're using in a spreadsheet if you're using in if you're using it in a spreadsheet, if you're using it in slides, if you're using it in a BI tool, those are metrics.
And metrics have the characteristics that they're typically, there's a correct answer,
right?
Like there is your MRR last month.
That is like, you might not know what it is and you might have the wrong number, but there
is somewhere in the space of possible numbers, the right number.
They're typically relatively slow moving compared to features. So you're not, you're experimenting
to try to find the right goal. You're not like, oh, like, I wonder if I could just frame MRR this
way. Like there's kind of a right, you know, metric. So the problem space is very different.
And the tooling, like if you think of like a DBbt or something like it's all about templating it's very much about it's purposely putting guardrails while
also making it really easy to make forward momentum yeah if you think about features on
the other hand features have the characteristic we're very used in by models and so we're using
both training and perhaps an inference depending on where the model is being used when you're doing
feature engineering iteration is much more random for lack of a better word.
Like you're trying all kinds of different things.
There's no like straight line.
You'll do weird transformations like, hey, what's my MRR, but cut all users that pay us less than $1,000 a month.
Why? Because it makes our model better.
Why does it make our model better? Who knows?
It just does. And so there's a lot more movement and duration the characteristics you'd like are
different the use case of the person like a data scientist doing ml is this inherently
think different and their problem set is different you want a whole different application layer
than you would want for what we call data ops tools now i also think that in general like
this what like if you look at let's call it pick an orchestrator yep and you call it a data ops
tool i think in practice it's not necessarily like cleaning a data ops tool i think there's
almost like this whole topology that we haven't figured out yet so it's just like some tools will
kind of cross the chasm of it,
but they won't really be called data ops tools.
There's almost like an analytics ops that's missing,
and there's like MLOps and feature ops, which we're kind of a part of,
and there's almost like this generic thing that lives underneath.
So I just think we don't have the topology down.
I also think there's a bit of Conway's Law.
Like if we go back forward five, six years from now,
there will be some number of tools and startups
that still exist.
Tools and startups
will be the topology.
Was it the best one?
Doesn't matter.
It's like what's left,
you know?
Yeah, yeah.
Makes a lot of sense.
And what's in the MLOs
product space, right?
Like we have
obviously like
feature stores here
which we are going
to talk about,
but what else is there
which is unique about MLOs, right? And it's not like obviously like feature stores here which we are going to talk about but what else is there which
is unique about mlops right and yeah it's not like a crossover between data out and mlops so
i'm gonna badly paraphrase makiko's great talk which i'm sure we're gonna post soon
but there's four stages of the ml life cycle there's data let's just call it data there's
experimentation data or sorry data analysis there's feature engineering Let's just call it data. There's experimentation, data analysis.
There's feature engineering.
There's feature serving.
This kind of is one stage you'll be in.
And that's kind of the data stage.
The next stage we would call the training stage where you're very much training the models.
You are hyperparameter tuning.
You're really kind of experimenting and iterating on the model itself.
Once you have a trained model, you get to the deployment phase. And this is where, cool, I have a trained model I need to put in production. You
might do canary tests. You might do like kind of all these things we're used to doing for services.
Finally, it's in production, like the fourth stage and final stage. It's not final. I guess
you kind of go back and forth. But the fourth stage is evaluation. And in the evaluation stage, you're taking all the information on how the model is doing.
And you might have to go back into other parts of the process, make the model better.
You're constantly iterating.
It's a constant cycle.
The model is never perfect.
It might be good enough, but those things always exist.
On the MLOF side, we kind of have broken down into, let's call it three
abstraction layers. I'll go through all of them. There's actually five that we've called them. At
the very top, we call it the platform layer. At the platform layer, that is what a data scientist
interacts with. It kind of goes across everything. And you can think of with DataOps, like DataOps
has no idea what a model is. It has no idea what like evaluating a model looks like. So from the MLOps side, you need a platform
that really understands the ML lifecycle from end to end. And it's the one unified layer.
Underneath it, there's workflow tools for data, which would be kind of like the what we call a
virtual feature store, which is kind of orchestrating the infrastructure, almost the application layer
that ML data scientists uses to interact with
their spark redis you name it you need your experiment or your yeah your experiment tracker
think like ml flow weights and biases comet there is your deployment workflow which again does the
whole kind of you can even think of like spinnaker it's kind of like the cd for models and finally
there's the evaluation store which is taking the metrics that you're storing allowing you to
evaluate it's kind of the cabana in devops terms yeah so you almost have like the spinnaker the
cabana the data orchestrator and it isn't really a good thing for training i guess it would be like
a build tool maybe like the ci and then that, there's still like training services.
You need something that actually trains the model.
You need something that actually serves the model.
You need something that actually like stores data and transforms data.
That could be the Spark and Redis.
You need something that actually collects the logs.
So there's a lot.
This is a very long answer to say that there's like a very wide space and a lot of problems
to be.
And that's just one view of it,
which still like there's like probably a hundred vendors that don't fit into
the framework I said that are still very valuable tools.
A hundred percent.
So what is feature form?
Yeah.
Feature form,
we call ourselves a virtual feature store.
Okay.
We're an open source product.
So you can go check us out on GitHub.
We are a place for data scientists to define,
manage and serve our features. And to imagine that if you're place for data scientists to define, manage, and serve our features.
And to imagine that, if you're a data scientist listening to this, I bet you right now there
is a notebook that you have been using for work that's called Untitled 118, the iPyte
notebook, that you've been copying and pasting from.
You have some Google Docs full of useful SQL snippets.
You know that there's one person in the company that knows how training set N was built,
and you have to go Slack them tomorrow so that they can remind you where to find whatever data.
There's so much ad hoc-ness that comes in the process of features.
It's just completely made up.
What we've figured out, I think, and what a lot
of people have focused on is getting the right data infrastructure tools. Like we know how to
handle a lot of data. So what I would call the platform problem, like scaling subpute,
having low latency serving, having high throughput, we've solved that problem, I think, pretty well.
What we haven't solved is making that a product that is usable and valuable to data scientists having interacted
that layer. So the virtual feature store, you can think of as an application layer over your
existing data infrastructure, which provides that versioning, access control, governance,
a nice API for serving everything you would need around features, that whole workflow and
orchestration, feature form does using your existing improv.
Okay.
And, okay, feature stores is a term that has been around for a while now
in terms of the tech industry, at least.
So what is different between feature form
and what was before, right?
Like the other feature stores out there.
Okay.
I got a hint from you.
It's like virtual, but i think it would be
great like to understand what these like what it means like virtual compared to the existing
solutions out there right yeah so we've broken down feature stores into three categories we call
them the literal the physical and the virtual so i've talked about the virtual. Let me talk about the other two.
The literal feature store literally stores features.
So if you think there's a handful, I mean, even like SageMaker's feature store and Vertex's feature store kind of follow this architecture.
Or you create your features elsewhere.
And then you finally store them in the literal feature store.
The value prop is all your features are in one place. that place can serve for both training and inference okay now the thing that we believe is
missing in that is your untitled notebook is still there you're still attracting spark all that's
changed is rather than writing to red this and that's three you're just writing to one place
there's definitely some value there but we think it misses the main pain point that data scientists have.
Versus the physical, which kind of does it all. It does like what we're describing,
the transformations, the storage, all of it, but it does it on its own improv.
So it's really kind of this heavy tool, like they sometimes call themselves a feature engine.
And it's, this is kind of what we built our last company, you can look at some of
the internal like Michelangelo and Zipline, they kind of follow this really heavy to implement,
it kind of replaces existing infrastructure, it's really expensive. You have to write all your
like features in the new DSL that they've created. It's doing everything at the cost of super high
adoption costs, like sometimes impossibly high adoption costs.
Like it's impossible if you're like a large bank to get all of your data to get through one place and process it all in one place that some startup is running.
So that's one thing.
And the other thing is beyond adoption costs is there's kind of this lock-in.
Like, for example, if that physical feature store doesn't do streaming in the way you'd like or can't handle some transformation you want to do you're out of luck the virtual just to kind of complete it a little
bit of a repeat from before is what would happen if you took that physical feature store you chopped
out all the actual processing and storage and provided like nice clean puzzle pieces yeah for
you to plug in your existing app for a while. And what happens in reality is it's actually
like the physical feature store and much more.
Because we have customers who are very like, yeah,
we have multiple Spark clusters.
We have Redis for this team, and we have
Cassandra for this team. And you kind of get this
data meshy, like kind of like
heterogeneous infrastructure thing
that happens, which is more true to form for
enterprise, while having one unified
application layer.
And I actually think that's the future for a lot of this stuff.
Choose the right data providers to get the characteristics you need from your platform,
and you need the best-in-class API.
Feature form's name is actually not Terraform.
And we kind of have this idea where it's like, yeah, Terraform is in the cloud provider, right? Yeah.
What Terraform, or HashiC the cloud provider right yeah what terraform
and or hashicorp rather or yeah terraform i keep yeah it does is it is the best api would you ever
build terraform yourself no why would you right i mean it's kind of like the like it's come to this
like perfect api that fits well for the use case yeah so our goal of feature form is to build the
best possible apis that your data scientists love to use while giving you the flexibility to get all the other characteristics you need. Yeah, that makes total sense. And like
one last question for me, you mentioned like features and features of different metrics.
Do we also need like a different storage layer for like features, like the database systems,
different that we are using at the end, or it's let's say the same data lakes the same OLTP databases that or like things like Redis caching layers that we are using to also store
the features so we like underneath typically like you can plug in whatever you want we have people
use Postgres you have people use Oracle is a very common one probably one of the most common for
inference yeah for offline store it could be S3 it could be hfs it could be snowflake we think that those companies are amazing at solving those
problems of like you know being able to have really low latency reads or whatever latency
you need and balancing costs like that's what they do all day long that's what they think about
so why beat them try to play that game we're gonna lose we're gonna build a worse redis why
build the worst redis let's just like let you plug in redis and even managing it it's like yeah like we're not
going to manage redis as well as redis labs or as well as the company itself doing it so you know
that's yours we'll just do the application layer one thing i'm going to add to it's not what you
asked but i think it's interesting one other type of database that we think a lot about is the vector
database that would be my next question actually like what is a vector database and how it differs like with the feature store right yeah so to
understand the vector database you sort of have to understand this concept of an embedding and
you can think of an embedding as a whole it's a it's literally a vector like literally it's like
n floats that we treat as a vector and you actually think of it as plotted in space so if you have a
3d embedding it's literally like a point in 3d space yeah how we create that embedding i'll spare
you but it's what you can think of it is for let's say i'm building a recommender system this is
literally what we did i could go and look at process's buying history and my buying history
and then all of our buying histories,
and I can take it and magically, I have this thing that's called a transformer,
that magically takes that and turns it into a vector.
That vector is a holistic representation of our buying tastes.
I can actually plug it into other models and use it as a kind of, rather than putting your name down, your ID down,
I can put a vector and it's going to be a more dense representation of you i can also there's another interesting characteristic
that happens where users of similar buying behaviors are near each other in latent space
which is again it's a 3d embedding they would literally be close to each other in space
so that unique property of things that are similar or behavior whatever latent
space but you can think of as if it's users behavior if it's taxed it's kind
of like this taxes viewed similar as according to this transformer and you
need something that can quickly get nearest neighbors it's like hey I want
to make a recommendation I want to go and pull you know ten closest items to
this item so I can show similar
items here. So it's an index and these indexes have existed for a while because the actual
in-in problem is relatively slow. So you kind of have to brute force this. You can get approximate
nearest neighbor relatively quickly, faster for sure. And those indices have existed forever,
not forever, but for a long time. Like I remember building, I built a vector database myself,
probably four times in my career now. and what was missing was everything else that
makes the database a database right like a database isn't just like a an index or in memory index so
the vector database takes this index and builds everything else to fit it into being a database
and yeah i'll give one more example where embeddings get really interesting like for
youtube the way youtube works is it there's two models that do the recommendation the first model
is candidate generation it will grab the thousand thousand nearest neighbor videos to you
your user embedding and then they'll take those embeddings and feed them into a second model to
do ring so they'll actually use the embedding.
This multimodal kind of idea of like chaining models together and using embeddings as this kind of intermediary language is the future of ML.
It has been for many years, but now I think people are becoming more aware of it.
And I think that even prompts and LLMs and all this, eventually it's all going to break down to hey we
have transformers that generate embeddings that's everything is kind of going to look either
embedding to and whatever like images or image to embedding but everything falls back into that
space yeah okay that was like one of the best explanations of what's all these technologies
are i think like everyone everyone's a little bit confused
lately with all the noise around all these things, but that was amazing. Thank you so
much.
Of course.
Brooks, microphone is back to you again. It's all yours.
Yeah, one thing I know I can easily do is I fill in for Eric because I can say we are
at the buzzer, which he always says. But thank you all so much for coming and chatting with us.
I know it's a busy conference for you all.
So, yeah, thanks for coming and teaching us more about MLOps.
Last question before we kind of sign off here.
Folks want to find out more about Feature Form.
Where can they go?
You mentioned GitHub, but where else can they go? ask any questions. We share updates there, both around the product and webinars, meetups, events,
all the great content that the team produces. So yeah, join us.
Awesome. Yeah. Feature forum. Check it out. Thanks for joining the Data Stack Show.
Please subscribe to the show and we'll catch you next time.
Thank you.
Thank you. Thank you.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com
Thank you.