The Data Stack Show - 144: Explaining Features, Embeddings, and the Difference Between ML and AI with Simba Khadder of Featureform
Episode Date: June 28, 2023Highlights from this week’s conversation include:Simba’s background in the data space (3:05)Subscription intelligence (6:41)ML and Distributed Systems (9:09)The Brutal Subscription Industry (12:31...)Serendipity in Recommender Systems (16:31)Subscription as a Strategy (20:47)Customizing Content for Subscribers (22:19)Creating User Embeddings (25:53)Building Featureform (28:01)Embedding Projections (32:47)Spaces and similarity (35:53)User embeddings and transformer models (38:22)Vector Databases for AI/ML (45:05)Orchestrating Transformations in Featureform (51:00)Impact of new technologies on feature stores (56:17)Embeddings and the future of ML (59:20)The gap between ML and business logic (1:02:26)Final thoughts and takeaways (1:06:37)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack.
They've been helping us put on the show for years and they just launched an awesome new product called Profiles.
It makes it easy to build an identity graph and complete customer profiles right in your warehouse or data lake.
You should go check it out at ruddersack.com today. Welcome back to the Data Sack Show. Costas,
we are going to talk with Simba from Featureform. And boy, do I have a lot of questions. We actually did a lot of data science stuff last summer.
We talked with people building feature store stuff.
We talked with people building ML op stuff.
But Simba actually has a really interesting perspective
on the entire spectrum of problems in the space.
So I'm going to leave you to talk to him about the technical details.
I'm going to ask about the moment of serendipity. So he did a ton of subscription work,
and he figured out why people would subscribe to publications like the New York Times, etc.
And so I'm going to ask him about that. It's super interesting to me
because I think machine learning
can help us understand a lot about that.
But, you know, of course,
being a consumer behavior guy,
it can't answer everything.
So I want to know what he knows about that
and then understand how features relate to it.
So what are you going to ask him?
Yeah, you mentioned that
we did a few
episodes
in the summer,
a couple of months ago, right?
About
MLOps and the
tools and the technologies in this space.
But I think we are living right now
in a really completely different world
in terms of
the technology landscape,
especially because of all the LLMs
and open AIs of the
world and all that new technologies
that we are still trying to
figure out how they are going to
change things in technology.
So
I think we have the right person to discuss
about that. I'd love
to talk with him about more foundational things,
like what are embeddings, what are features,
what are feature stores.
Let's see how, let's revisit all these terms
that we know for a while now,
but how they have changed today
because of all the changes that have happened
in the past six months in the industry. So yeah yeah that's what i'm going to focus on and i'm sure because i know simba that's
it's going to be very interesting and captivating discussion all right let's dig in let's do it
simba welcome to the data sack show so great to have you. Hey, thanks for having me.
All right, so give us your background.
How did you get into data,
and what was your path that led you to FeatureFarm?
I started, I was at Google,
and I was at Google for a little while.
I can say I'm one of the few Googlers to have ever written both PHP and x86 at my time there.
Oh, wow.
I worked on a data store.
Yeah, it was kind of interesting. I truly went on both ends, written both PHP and x86 at my time there. Oh, wow. Data store. Yeah.
It's kind of interesting.
I went, you could, I truly went on both ends and I've learned that this kind of a horseshoe
on both ends.
It kind of sucks.
You kind of want to be in the middle.
Yeah.
And yeah.
On different teams.
Like what was the, can you tell us like, what did you start with PHP or.
Yeah, it definitely was like, I earned my strike to x86.
It just happened to be different projects.
The PHP, I worked on Solid Data Store.
I was working more on the API side.
I worked on a lot of different parts of it.
But one thing I worked on was fine-tuning some of the lesser-used APIs,
and one of them was PHP.
So I went out to do it, and it happened to be me.
And so I learned more about PHP than the guy I wanted to,
but I can officially say I've used it in prod at Google scale,
which I don't think many people, I mean, maybe Facebook people
can say that too, but definitely.
Moving on.
Okay, cool.
Moving on.
And the x86, I worked on Google-wide profiling.
So I worked on making Google go faster.
I worked specifically on Proto, a few other things.
But I mean, I was really focused on search.
So obviously a big piece of what Google does.
And so I worked on making it go faster.
And that was kind of the start of my career.
I was always kind of really, I liked hard technical problems.
I found that a lot of what I worked on, I mean,
this was like, I was pretty much right out of school. Most of what I worked on and most of my,
I guess, studies and what I guess made me happy was distributed systems up to that point.
I had clipped around with ML. I'd done a bit. Some of the stuff I was touching was search.
I actually got to interact with some of the
ML team briefly at Google
and just got to learn a bit.
It happened to be at the time when
TensorFlow was kind of coming out
was when I was there. So I got to really
see some of the early
iterations and really just
kind of got hooked. I think what drew me
to distributed systems to begin with
is how I guess messy it is um and even in an amount like i never liked your vision because
it was so i found the answer being typically binary so boring to me like it is or it's not
you know this thing i'm trying to classify it but it serves systems same way i really liked that
there was never really a right answer.
Resolves a given take and it kind of, there's a little bit of an art to
it of doing it well, I think.
And yeah.
And then I started after bat, I left Google.
I had a lot of product ideas at Google.
I had these ideas, how, you know, maybe cloud could, this is also
like Google clouds coming out. Um, AWS is like the behemoth but you know maybe
google cloud will eat their lunch you're kind of at the tipping point where people are like oh
maybe that's not gonna happen and i have all these ideas they're probably bad ideas i was like 20
but um i still wanted to go and learn and continue learning. So I left Google and started my first company, which was Triton.
At Triton, we took a lot of renditions.
I learned a lot.
I didn't start Triton with an idea.
All I had was a logo and a name and a co-founder.
I actually couldn't see any work on Friday.
Yeah, I had nothing.
So whenever people are like, oh, I'll leave when I have a good idea.
I'm like, it's not a prereq. I didn't have one. You can do it before that.
Yeah, we just figured it out. And honestly, I feel like you can
this pros and cons, but it definitely at least allowed us to build
something real because we weren't a solution in search of a problem.
We actually had to go find a problem and solve.
And we landed on, we called it subscription intelligence.
We did everything from personalization as a service.
We'd help people do recommendations.
But really the goal of it wasn't just recommendations for recommendation's sake.
We really were focused on driving subscriptions.
There was kind of this movement.
It's still kind of happening for B2C products and companies to move from ad-based models
to subscription.
Yeah.
I think it makes a lot of sense.
It's much more tied to value because you're not paying for like, you're not like, hey,
like it's almost like a baby switch.
Like I'm trying to get you to use this thing, but also I'm really just giving you value
because you're the product.
Subscripting always seemed to make more sense to me.
It's also much less wasteful for certain categories.
And anyway,
so I was helping drive that ship by helping companies who are still
treating things in the ad based way.
Like it's like,
if you look,
we worked with publishers and news companies and a lot of what they were
doing,
there's obviously teams that switch,
but I think as a whole,
they were still kind of taking the ad basedbased methodology but worked well for charging eyeballs and i was kind of trying to
help them use data to figure out how to change their strategy to drive more subscriptions
decrease churn understand why users subscribe that was really the whole tagline that was like
my easy one one line sales was like do you know why you need to subscribe? And the answer was almost always, not really. And so that's what we'd sell out.
Yeah, I love it. Okay. So many questions about sort of that moment of subscribing and, you know,
a number of things on recommendations, but let's rewind just a little bit. So you got exposed to distributed systems at Google.
I would love to hear about the moment or the epiphany of saying,
this is a real thing that is going to affect my job and data and all that sort of stuff.
When did that happen for you where you sort of, you know,
were going from working on, you know,
PHP stuff to, you know, realizing,
okay, distributed systems are going to be like
a really big deal.
Yeah.
So I think, so two points.
I think firstly, like at that point,
any distributed systems were going to be a big deal.
I think that was kind of already a very, I think what I learned was going to
become a big deal was ML and just really more data, I think just like this is, I
don't even remember if this is like Spark or had probably come out, but maybe it
wasn't like widely used yet, it was probably kind of still like a new era.
And I think what I really saw,
I think distributed systems,
Google has always been this king of distributed systems.
They've always been ahead of the curve.
They released the MapReduce paper.
They released Borg,
which is the early internal version of Kubernetes.
It was a Google thing.
I think they've always done that very well.
I mean, even going back to like having commodity hardware,
like I think there's like early stories of Google where one of their
innovations was to duct tape on hard drives.
So if a hard drive failed,
they could literally just like rip it off and place another one on without
doing anything.
And that was like an innovation that Google came up with. So I would say at that point that had kind of played out, but I was just
interested in the problem space. And I think there was kind of this adjacent problem space,
which is ML, which is obviously very different, but I think there's a lot of the same patterns and characteristics that
make the problem really hard to do that i think drew me for the same reasons i was doing to
distribute systems but i think there's an extra kicker of that it was i think you know everyone
talks about like ml you know it's gonna change the world but i think especially seeing it at google i really began to
understand how every interaction will have supposed to digital really everything is
gonna have very well you're like i feel if it would be an interesting subject to do is like
to see how many models you interact with on like a given day just
like doing your job it's probably a lot it's probably like it's definitely over 100 for an
average like tech worker i mean just every time you buy something there's fraud detection plus
like a handful of other models every email you get is there's a marketing model behind that
like there's just so much i mean mean, Google Search is a model.
So there are hundreds of models that we interact with just on a daily basis. Even when you go to the grocery store or buy something, everything in the supply chain is going to be models attached
to that. So I think I saw that and once I clicked, I realized that, hey, this is not only a space I
find something interesting, but it's a space that's still ripe for impact i think the distributed systems it played out so much that at that point
like you kind of had to be at phd level to really have an impact and even then it's like you're kind
of we're in the optimization stage of that trend and you know it was either we're gonna have a
whole paradigm shift which i don't think didn't happen i can't even stop and what that would look like surely will happen one day and
well if i'm out it was still kind of cream space no one really knew how to do it no one does it
well so i think that's what what drove me there okay so you went to this is fascinating to me
so you went to subscriptions and you worked with,
you know, media companies, which, I mean, you know, as a sort of a, someone with a marketing
background as sort of the most brutal, like bleeding edge of like, man, how do you get
a bunch of anonymous traffic to convert, small margins. So you go from like, okay,
distributed systems, this is interesting, like ML, this is going to have a big impact.
And so you go right for the bleeding edge of the hardest problem. First question is,
why did you go for the hardest problem i mean subscription for media is brutal like
is it because it's hard like hard problems that was definitely part of it i think again and we
were like in search of a problem and that's a huge one and i think we kind of landed there we
i actually kind of forget why we ended up in the media.
My co-founder, my old co-founder, he's out with PD now, actually,
but he's always kind of had a big tie to media.
He's always been attracted to it,
so I think there was a bit of kind of founder market fit there.
So I think we'll listen to that direction.
And the problem space was super fascinating
and it was very
untouched.
Like we weren't really
doing much ML there.
And the reason for that,
which is maybe why
it was hard to like
have a huge business there,
we ended up being
relatively successful,
was you don't really
make a lot of money
per user.
It's kind of,
it's a pretty brutal industry.
So, I mean, VCs weren't super stoked about us when we told them that we're selling the
media.
Yeah.
Yeah.
Dollar per month, you know, a dollar a month to get this access to this content.
I mean, another interesting point was like, I know what the number is now, but we found
that like, like, I think it was the guardian made like 50 of their revenue off of
like print still at that point no it wasn't that long ago i think i must i believe so like there's
like all these like really interesting points that we learned about how the industry works and
functions which was were really surprising to us especially since when you think of like like the
reason why if you go into an
airport and you see a lot of like the sunset news or like all the news names for like those stores
was because the news companies were kind of these big conglomerates and they were just like coming
up with like new business laws and it was for some weird reason someone i forget who it was
i think it was like cbs or something like came up with like maybe CNN but someone came up with like we're gonna do like you know these kiosks in airports and that's
gonna like drive a ton of revenues anyway there's just a lot of weird like things I learned about
news video and how it works it's just it's fun I mean I got to become like an expert in something
I never thought I'd be an expert in obviously it's like societally very relevant to like know
a lot about how media works but it it was just even technically, it was just a fascinating problem
space. Yeah, absolutely. And so you solved ML problems on the bleeding edge of like,
the most difficult, you know, sort of low margin
problem space. And that's obviously like relevant to all sorts of other spaces.
When we were talking before the show, you mentioned something called the serendipity moment
and how two users who look very similar can follow the same path and like maybe one subscribes and one doesn't and how that's like a
it's a very challenging machine learning problem can you break that down for us and like
start by describing what is the serendipity moment
yeah the serendipity moment we've all felt it that feeling when you do find something that you
weren't really looking for but it is kind of like awesome it's like exactly what you wanted
that moment that feeling you have it's kind of like a little dopamine hit that's a serendipity
moment and in recommender systems like let's say you go on i don't know like spotify youtube like you name
it whatever you're using to view let's say spotify and spotify recommends a song for you and you're
like i've never heard of a song i know who this artist is and you click it and at first you're
kind of skeptical because it's like this seems off but then like you know like when you second
stand you're just like this is my new favorite song like that moment favorite like is exactly that that moment is magic the problem is and this is
actually like a known even pre-digital like this is even the way like a grocery store sets up its
its aisles like it's all there's a bit of one thing that they consider is that serendipity
moment and what is most likely
to be. It's almost like you have to get this
gray area where it's not obvious.
If it's obvious, it's serendipitous.
If you love an artist,
you love Red Hot Chili Peppers
and I recommend another Red Hot Chili Peppers
song that's in the same album you really
like. You might like it
but chances are it's not serendipitous because it's yeah i mean that makes sense you kind of have to
to a little bit off target but if you go too off target it becomes written yeah it's like sure
like it's a discovery exercise right like it's exposing you to something that you're
likely to like but that is is not 100% like,
we know you already like this. And the hard thing and why I think I really got pulled to this
space is one, it's kind of immeasurable, very different than the papers or how you can try
to measure it. But it's essentially immeasurable We can't really measure like the serendipity of a recommendation.
And it only really makes sense
when you're doing things to humans
because computers are very like,
like most things have answers.
Serendipity is really hard
because it's a human effect.
And it's, you can't just like,
I mean, you could like attach
like something on someone's head
and look at every neuron and try to like to figure out serendipity of it.
But that's a true.
Until we can do that, you know, we're kind of in this hard place where we have to somehow try to use behavior to see if we captured serendipity.
And it's just really hard.
Like you mentioned something I'd said, you might have two users who are exactly the same and if you show
one of them a song they might be like oh my god this is my new favorite song completely changed
like their kind of course and the other user will be like this sucks why the hell did you recommend
this to me and is this that's just what happens it just it's impossible to know until after you
do it and it's just because you have incomplete information.
And with humans, you will kind of always have information because you can't like, you know,
plug into their brain wiring and like make it deterministic or whatever.
Yeah, for sure.
I mean, it's like, OK, you recommend a song and like someone's just been through a breakup,
but someone just had a good date, right?
Like they're the same until like there's a divergence
and like the song is going to mean different things.
Okay, I want to dig into feature form
and especially how features influence like this sort of stuff.
But one question before we dig into that,
how do you balance as someone who's building technology in this space,
the influence of like the commercial element, right? So you mentioned grocery stores. And so
it's like, okay, well, how do you create serendipity? But there are also people bidding
for the end cap, you know, you know, to put their cereal on the aisle that's most prominent. And,
you know, marketplaces are like that. And of course, like that's a business model for marketplaces where you can bid for space.
Like, how do you consider that as part of a recommender model?
Great question.
It actually comes back to a, this is why I think subscription is a very good strategy
for a lot of these companies
because it allows, it's essentially like a, hey, I'm trading value, but I'm averaging
it out over a month, let's say, probably over a little more time.
So it's more like, hey, you're kind of buying me in.
Once you're bought in, I don't have to play games anymore.
All I have to do is make sure you're getting as much value as I can give you to justify your $5 investment or whatever.
Sure.
And typically these things with subscriptions, it's, I mean, there's also like you have to balance the cost of goods.
But there's an equation there, I guess you could kind of think through.
But in general, we lean towards like, just provide, you just assume that the costs are costs.
If you can get people to subscribe, it's all worthwhile.
Just to make sure they're valuable because when using people subscribe, they kind of
get an all you need type experience, especially on media.
Oh sure.
Yeah.
So it's more like- And then you can customize that experience
subsequently to make sure that they're getting sort of, yeah.
Well, we also like had to like, but part of what the product would do
is also maybe recommend types of content that do well
and it might change the kind of content that you get created.
Like this is why like the information,
the athletic who have become very successful,
their subscription tiers have very different types of content
than other types of companies.
They were kind of built subscription first
and they came to the conclusion that having more like content,
you could only get
there like no one else has that content typically a bit more opinionated typically a bit longer
a bit more dense is more likely to drive subscriptions than kind of like headline
style content which is great for getting clicks and getting views but might not drive subscriptions
so anyway there's a lot of things that come into play.
And I think that the short answer is we're trying to measure value,
but how do you measure value when a recommendation like Spotify,
how do you measure the value of the song recommendation?
Well,
you can't,
you can only see what their behavior looks like after.
Right.
And then you would use that to be like,
well,
that must've been successful. I recommend this song from this artist they've never seen before or
heard of before and now like that's their top artist must have been a very successful recommendation
i provided value so very likely to share a subscriber because they had that magic moment
and so we tried to like make that happen and maybe going into like the, how we do it a bit and
diving in, just be in the shoes of a data scientist at a training, like you open up
this, let's just say it's a giant file.
Let's just make it simple.
It's a CSV and he's CSV.
It's like, Hey, this user may be anonymous, likely anonymous, actually
very less anonymous looked at or listened to this song at this time.
And maybe I'm some song metadata and some table somewhere.
Yeah.
And you just have like billions of those rows, like more, like you just
have this ridiculous number of these and you have you know like we're handling 100 million multi-active users
at rp you can't like go one by one in fact if you pick a thousand of them it's still like doesn't
matter like a thousand of those like most of them are just noise anyway so what became really
interesting and the other thing that we did which is is unique to us, is we have this recommender system. It wasn't like we were just recommending things to the service and i was having trouble closing a few
deals and i went to this prospect and actually flew out to new york and got a beer of them
and i brought a contract i printed out and i we just chatted not about we just you know chatted
and i brought it out and i was like i brought a contract and just give him a second to like look
at me funny and then i was like i'm not expecting you to sign it i just want to know why you won't like
i just want to know what went on for your head right now and attribution was the big problem
and that's kind of how i solved that and then would you sign and he his answer was literally
like forget the recommender system if you solve that problem you'll understand why my users
subscribe i'll pay you for that. Even as a
recommender. And what I realized we had to do was essentially the recommender understood why
people subscribed. It was actually the way the model was designed was actually to drive more
subscriptions. It wasn't just recommending 50 people click on. It's actually the loss function.
It wasn't exactly subscription, but it was much more correlated in that direction.
And so we actually had to build this almost explainability model
on how they recommend their work,
but display that not to a data scientist who knew what the model was.
You had to give someone who couldn't care less about what the model was,
but still gave them some value.
And the way we did that is real quick.
Now hand it back is we'd create these user embeddings.
They create these, then we can dive into what that means.
But generally what we would do is we would create these embeddings, which
were like holistic views of the user.
And we would cluster them together and to what we call personas so we'd have these like
n number of personas and we would then provide a more traditional vi like kind of like here's like
how often they come along here's what days they come you know like this very traditional metrics
but we would do it for these magical personas and those magical personas were generated by clustering,
essentially the recommender systems,
holistic view of the user.
And I think that,
yeah,
that was the magic of being able to capture and fit all those things
together.
And yeah,
so I'll pause there.
No,
I love that.
I love it.
So many questions, but I've been monopolizing.
So why don't we do this?
Can you give us the, which I should have asked you to do this at the beginning, but give
us the feature form pitch.
And then I want to hand it off to Costas to dig into how you actually do that from a technical
standpoint.
Yeah. So again, be me looking at this giant CSV of all these users.
And most of what I'm doing is coming up with features.
Those features will be, hey, how often did this user come to, let's say, Spotify?
What's this user's favorite song in the last seven days?
What's their favorite artist in the last 30 days?
I would generate all of these features.
I'd also generate embeddings, which we'll get into, I'm sure.
And
being able to do that alone was hard. It required
contracting of Spark. I'd have to materialize
things to Redis. There's a lot of
playing at a very low level.
But the worst part, which I'm sure any data
scientist listening to this can relate to,
is we would have these Google Docs full
of SQL snippets. We would have
untitled 118, the IPy notebook
that we'd be copying and pasting from.
We had no source approved, no versioning, nothing.
It was all ad hoc.
And we couldn't at any point in time look at a training set
and be clear about, hey, which features did we use
and how were they created exactly?
And could we do that again?
It was just not on the table.
It was just all done in such an ad hoc fashion.
So we built Feature Form to be this kind of framework that sat above the infrastructure.
So we could still take full advantage of our infra because we're, again, handling 100 million
MAU.
But it allowed the data scientists to define and manage and serve their features, their
training sets, their labels, everything.
And a framework that worked and let them write SQL, let them write data frames, let them
write what they're used to writing, but give them almost like the scaffolding to put all
that together so we can automate most of the low-level and kind of mundane tasks that aren't
just like me coming up with new features.
All right.
So Simba, you mentioned two very interesting terms, and I would like to tell about both of them.
You mentioned the word feature
and the term embedding, right?
So why do we need two different terms, first of all?
What's the difference there, right?
And help us understand a little bit,
let's say,
what came first?
What are the differences?
Or the similarities?
And get a little bit deeper into both of them.
Because I'm pretty sure that there's a lot of confusion around these terms out there.
Yeah, and sorry, the first term was which?
I heard the second term was embedding, but it was the first one.
Feature.
Feature, yeah.
And then embedding is a sub, it's like a type of feature.
So let's first talk about what a feature is.
So a feature is, well, let's talk about how an ML, a model works.
Like a model is this a black box function.
You might be able to understand it,
but you can think of it in this case as a black box.
It's a function.
It takes signals, inputs, generates an output, a prediction.
That's kind of how most models that we use work.
Now, those inputs are going to be things,
like in the Spotify example,
might be things like my favorite song in the last 30 days,
might be my favorite artist in the last seven days,
and might be a variety of different signals.
And I like using the word signals
because I think it's a better term.
It captures what a feature is better.
A feature is really just like signal from the raw data that you're providing into the
model.
Like in some computer vision cases, it might be just literally the raw image.
That's it.
Signal is just a raw image.
But in most situations, especially with NLP, especially with like any tabular data, which
is like fraud detection use cases, recommender systems, et cetera,
we take a lot of steps into taking that raw data,
taking our domain knowledge as a data scientist,
crossing that with some data transformations
to generate signal that we then feed into the model
to allow it to do the best job possible.
So that's what a feature is.
And let's call them traditional features.
You could imagine that feature pipeline to generate them are things like data frames or SQL.
They're kind of well understood concepts.
Now, an embedding is a very special type of feature.
So an embedding literally
is a
vector.
A vector in like the math sense
that it's like an
n-dimensional point.
And each of our values
are close, just like
floating point numbers.
Now, these
embeddings have
this interesting
characteristic where
you can embed
a lot of different concepts. Let's say, again,
I'm embedding users based on behavior.
So I have a user embedding, which
if I'm Spotify, maybe it's like this
holistic view of like
who you are as a user, what you like to listen to, and just all the trying to capture all the nuances that makes you unique.
I somehow take all of that and I turn you into a point in space.
Now, alone, that means nothing, right?
Because it's like, cool.
Like, I have this random vector.
Like, that's great.
Like, you told me this is CoSUS. I'll trust you. But like, I don't know what to do with this.
Where the magic comes in is when you have many points. When you're Spotify and you have many
millions of users, you end up with millions of points. And it's almost like the structure of
AppWorx. If you look, if you Google like embedding projections, you can see some of these structures,
they typically cluster. There's a lot of really cool
shapes. It used to be one of my favorite things to do
was look at the shapes that we
could form based on different types of embeddings.
There's all kinds of things
that get
injected into that
space, that n-dimensional space.
You typically will
visualize that as 3D space.
One thing, and the most obvious one that a lot of people are aware of
is users who are similar, like have similar music tastes in the Spotify sense
will be close together in space. So their vectors will have very similar values.
I want to dive into why that's hard for a second,
because let's say I have a text embedding. So I have two pieces of text, or let's say
I have three pieces of text, and two of them are really close and one is far away. The
two that are close, the common way to vectorize, this is again, like if you've ever done NLP
class, the first thing you'll learn is a technique called TFIDF, which is term
frequency, inverse document frequency, which means take the amount of times the term, a
word shows up divided by, or the inverse of the document frequency, how many documents
actually contain that term.
So it ends up working out where they kind of common terms end up with
like a high IDF or high document frequency rather because they show up all the time. And then,
you know, rare words are used often tend to like way really high. So this is the way to vectorize
a piece of text, but it's kind of dumb because it doesn't
really understand the words, right?
It just treated each word as a second identifier and it works great, but
it doesn't like understand like sarcasm.
Like you may have three documents and they all have very similar
words with one is sarcastic.
And that one's obviously very different now and embedding a good one
from a good transformer will actually capture all this nuance and it will really put that into the embedding space and the same way
user their listening behavior their listening behavior it's not just like oh like they listen
to you know they love katie perry and this user also loves katie perry so we're like near each
other it's a lot more nuanced than that now the, the final thing, and I kind of went into this,
is when I build a traditional feature, I use a SQL query.
When I build an embedding or something like an embedding,
I typically do a transformer model.
And it's literally a machine learning model
whose whole job it is to take features, actually inputs again, but to
take those and generate an embedding.
So you embed a concept, which is typically sparse data into vector space.
So anyway, there's a lot, but that's a very long answer to a very short question.
No, I actually, it's like, I think you did like an amazing job describing
what an embedding is and what the difference between the feature.
Hearing you describing an embedding, I keep two terms that you used. You mentioned about
like a high dimensional space. So there is the concept of a space there. And then you have, we have points in this space, right?
Where we can, let's say for people that they've done like basic algebra,
like it's not that different than what we were doing with like vectors in algebra, right?
Like actually we pretty much use like the same algorithms at the end,
like to calculate the similarity, right?
And I think that's like also like big part of the beauty of this whole thing
is that we can take something so complicated
and reduce it into a mathematical structure
where basic tools from algebra can be used
to answer questions about semantics at the end, right?
And that's what you were describing.
You were saying, yeah, we can do things with frequency, but there's no semantics at the end, right? And that's what you were describing. You were saying, yeah,
we can do things with frequency, but there's no semantics in there. We cannot really understand
what's the difference between the meaning of the words. And that's what we do with the bendings
today. But one of the things that you said, and you mentioned, for example, we have, say,
the user bending, right? We have the word bending. We have, I don't know, whatever bending, right? We have the word of bending. We have, I don't know, like
whatever bending, right? It seems like we need like a different space for each one of like these
things that we are trying like to model there, right? It's not like we will take like a model
that does word of bendings and we will use that like to generate user of bendings, if I understand
correctly, right? You can, and we did it. We would put user embeddings and
item embeddings in the same space.
So we could actually use that to find,
hey, if I take a user, I can
find the end items that are
closest to it, and those
would be items that have highest affinity
towards that user. So you
can embed things into the same space.
It's's yeah.
There's a way,
like,
I guess it can be done.
You like the generic models typically don't do that.
They have their own space that they're trained on,
but if you own the transformer,
it's very,
very much.
It's very doable to build things in the same space.
So I wouldn't say each thing has its own unique space.
You can't cross between them. If you
wanted to, I can't think of how to do this off the top of my
head, but I'm sure it's possible.
You can put images and
users somehow
in the same embedding space.
The spaces,
another thing is it's not like there's
one user space that exists
for all users.
The transformers themselves, the way the transformers work is they are trained.
I'll give you an example.
I'll tell you how one of the parts of our recommender system would create embeddings.
We would have this model and it would take all these like let's call them like user attributes okay what's their
favorite like just traditional features what's their you know what's their favorite thing last
30 days what did they just listen to with whatever just like a set of like their age
whatever other traditional features we would feed into this transformer model. And we would train the transformer
to solve
a surrogate problem.
And the surrogate problem
is really what
defines the latent space. So the surrogate
problem you're trained on is, hey, try
to predict what the user's going to listen to next.
Which is an impossible
problem. There's just no way
that given those features, you'll have a strong
model that will be able to guess of like 99% accuracy, like what
Jesus can listen to next.
So like, as we talked about, it's just an impossible promise space.
Now, by doing that, you will there's a way literally how you do it is you
essentially take the last hidden layer
of a deep neural network and that's embedding that's literally how you create the embedding
it's actually those values is the embedding and there are many different tricks and techniques
like one thing you could do is rather than using the chances of clicking rather you predict how long will they watch the item so not only like, you predict how long will they watch the item.
So not only like which item, but how long will they watch it?
That will actually change the embedding space.
And funny enough, that change in embedding space will actually typically result in higher
quality recommendations than using click space.
So there's a whole science and art.
And again, I love the art of machine learning.
I love problems where it's
creative and it's like fun and it's not like hey like is this a hot dog or not a hot dog
no offense to computer vision people i know i've been like kind of not talking well about it but
um i love to recommend their space embeddings in general because of that art of like i'm literally
building a model and it's a huge model typically
and it's really expensive to train and i'm using it entirely for its last hidden layer because
that's the only thing that's useful to me i actually don't care about the model itself beyond
that so to build a model that generates embeddings we start again from features yes yeah it's all and
actually a funny thing that you can train embeddings on the fly as you're training
your models, create embeddings, which is a whole other story, but we did that too.
So you can, anyway, yes, there's a lot of crazy things you can do, but I really want
to like highlight that embeddings are just a special type of feature.
And even in that world, you're still using features, like even to create the embeddings,
you have to create features.
So it's a signal, right?
I mean, like in general,
like, I mean,
as much as we like to imagine
like these types of models,
this is like magic wands
that you like, you know,
sprinkle on text
and it magically takes the text
and makes it do whatever.
That doesn't really work.
Like all these models
are you using
and trained on
very traditional features to create embeddings.
And typically, the really cutting-edge models, when you think of a GPT, you start actually putting both models on top of each other.
So embeddings that feed into other embeddings.
Like if you're using images and text in GPT-4, it supports that now. I imagine
I would be very surprised if
originally those things are processed by different
models to be embedded and
then fit into another model
to create a holistic embedding
of the whole thing.
Embeddings are just specialized species.
In fact, with feature form,
a lot of the problems like this with traditional features,
like, hey, how did I create this feature?
How is it defined?
What version is this?
Where is it being used?
Who owns it?
Governance.
All the traditional problems, lineage,
that you would expect of, let's call them,
traditional features, totally happen with embeddings.
I can't tell you how many times I've had like embeddings.
And I would have these like this.
And again, like I built my own vector database probably three or
four times in my career.
And there'd be all these times of like, man, I actually don't remember
how I create that embedding, but they don't remember which model I used
and how I trained it and where that model is so I actually can create
you embeddings from it because it's just somewhere in my untitled notebooks
that I have on my laptop.
And so, yeah, I mean, feature problem was built.
Even now, people associate features
with traditional ML.
It was actually built with this kind of new style ML originally.
And we, let's say, cut that stuff out
because it wasn't cool when I was doing embeddings.
It's cool now. But what I was doing wasn't cool. So I was like, cool, let's cut that out because stuff out because it wasn't cool when I was doing embeddings. It's cool now.
But what I was doing wasn't cool.
So I was like, cool, let's cut that out
because no one knows what that is.
And we'll focus on traditional stuff.
That's what we're using today.
But we're actually about to release a lot of stuff
in this space, just stuff we'd actually built before.
We turned it off and now we're going to turn it back on,
which is pretty exciting, actually.
I'm very excited, too, for this new stage.
Yeah, yeah.
And we will have the opportunity to talk about that.
But before we get there, you mentioned vector databases.
And I mean, naturally, they come as a way to interact and use the appendings.
That's how most people like to hear about them today.
So there is this concept of the vector database.
As you said, it's the embedding at the end.
Like from a representation standpoint,
it's just like a vector with float numbers in it.
And we need to do operations from that stuff, right?
Like somehow we get them and we need like to work with them
and be able to search them, like make comparisons, blah, blah.
Like pretty much like the stuff that's, let's say, we also do in a traditional database.
So my question is, why do we need to have this new thing that's called a vector database,
right?
And how does it fit in the overall workflow that we have when we are working and building data systems that have also
some kind of like AI, ML, whatever we want to call it, like element in there. Yeah. I'll start first
on, so I'll get into like how things, because LLM stuff kind of adds a new flavor for vector
databases. But I mentioned before, I built a vector database a few times in my career. We actually released one with feature form, which is kind of deprecated now just because there's
plenty of other great options, which you should look at. But the reason I built it, so when I,
so, okay, first, like the problem to be solved. Problem to be solved originally before even the
database part was I would have these embeddings. A very common problem is doing a nearest neighbor
lookup. It's a very easy way to make, again, if I have a user embedding, I want to find the
n items closest to it. Well, I just do a nearest neighbor lookup on this vector.
So that's how I would do it. Now, the problem is that doing a nearest neighbor lookup is a very expensive operation.
It's
essentially the only way you can
do it correctly,
100% correctly, is to brute force.
And so
there was
a variety of companies that
came up with approximate nearest
neighbor algorithms.
One of the most popular ones, which is funny
because it's kind of like no longer,
I think it's lost in time,
but it was the most popular one.
It was one called Annoy from Spotify.
And get it like Annoy, A-N-N.
Oh, yeah.
So approximate nearest neighbor index
called Annoy was in memory.
And so the problem with that was, it like, if I give you a B tree,
I'm like, here, this is your database.
Well, that's great.
Like you solve the really hard algorithmic part, but there's like all this stuff I
have to build around it to actually deploy this thing, like it was super common.
I still see it.
I mean, probably less than now, but for a long time, I would see, we used to do it.
To be honest,
is we would actually upload our embedded files
in the Docker container
over the model.
And we would, in that container,
read the file,
create the annoy index
on startup time.
That's actually how we would do it.
And then like there was no,
it made sense eventually
to create a service
that was kind of almost like
annoy as a service for us, which became our, it's a vector database.
I mean, essentially, I mean, there's more you have to do to that.
You also have to persist to disk and there's obviously more of it just putting a NOI.
But that was one of the key problems to solve.
There's never a key problem, which a NOI doesn't do.
And a lot of them don't do is like being able to distribute the search, being able to do filtering is a really hard problem, which Innoi doesn't do and a lot of them don't do is like being able to distribute the search.
Being able to do filtering is a really hard problem, which none of the open source indices
can do.
Now the proprietary, I think I believe Pinecone, VVA and Redis can all do it.
But there's a number of hard problems.
These are index problems.
These are like database problems. Now it's a specialized index, and you need to build a database around specialized index, or you have to fit the specialized index into existing databases.
The problem is that the existing algorithms, like the most common one now I see is HNSW, which was created at Facebook.
This doesn't really play well with how databases are architected.
So the algorithms kind of have to be tweaked.
So we have to find an algorithm that has similar quality,
but also like kind of has similar characteristics
in how you scale it out as you would find in like a B-tree or whatever.
So yeah, anyway, that's the long answer
of why vector DBs existed.
It definitely solving a real problem.
Now the question which remains,
for sure, like people will be using a betting.
It's not a question to me at least.
And the nearest neighbor lookup,
approximately nearest neighbor lookup,
that's not going away.
That's really common.
That needs a special index, no question.
I think it's something I've learned, a misconception the market has, which I learned recently, up that's not going away that's really common that needs a special index no question i think
something i've learned a misconception the market has which i learned recently is that people think
vector databases are just a place to like cash embeddings which is not true like it's not like
i mean you could do it that way but i mean at that point like it's just a it's a list like you
could put it in redis it doesn't really. The thing that makes it special is that index, that
nearest neighbor index.
So, yeah.
I'll pass it back on that note.
Yeah. No. Makes
total sense. And how does this
fit in a system
like feature form,
which is a feature store, right?
So we have feature stores.
We have also vector data bases. There has to be some kind of... I mean, features are store, right? So we have feature stores, we have also vector databases.
There has to be some kind of... I mean, features are everywhere, right?
That's like someone who has no idea
what we talk so far.
I think the first signal
that there is some relationship there
is that we have features everywhere.
We need them to build the embeddings themselves.
So how does feature stores work with them?
And let's stick with like feature form,
like the product that you built, right?
Like how is architected
and how a vector database fits into that?
Yeah, one thing that makes feature form unique,
so this is not true of literally any other feature store,
just how we work,
is we call ourselves a virtual feature store.
The virtual means that we sit on top
of your existing infrastructure.
We essentially turn what you have into a feature store.
And so we end up being this kind of framework layer
that data scientists love to use,
but also allow you to take full advantage
of all the infrastructure you have underneath.
Now, from our perspective,
it's very common for like some of our bigger clients to have have some features in Redis, some features in Cassandra, some features in Mongo, whatever.
They might have a variety of different places.
They might have some things built in Spark, some built in Snowflake.
And Feature Palm works really well in those situations because it sits on top of all of it.
And it provides one unified abstraction to define the features, to manage them, and to serve them.
Now, from our perspective, a vector database is just another kind of online store, is what we would call our inference store.
It's just another place to store features, which in embeddings happens to be, again, like a specific type of feature.
And it has this new operation for lookups,
which is nearest neighbor lookup.
So we just need to provide both of those operations.
The other thing that we do is we orchestrate transformations.
So we'll orchestrate.
As a data scientist, you will define your transformations
in our framework, which again will be 99% the same code
you would write, the same SQL query, the same PySpark, Pandas, whatever.
And then we have this kind of function that you'd put it in
and you might give us a metadata, like a name, a version,
a description, all that stuff, an owner,
and a lot more as possible.
And we would orchestrate that.
You could set a schedule and we can orchestrate that for you,
those transformations. Now, from You could set a schedule, and we can orchestrate that for you, those transformations.
Now, from our perspective, a transformer, especially a pre-trained transformer, even
like an LLM, is just a new type of transformation.
From our perspective, it takes an input, which is text, and it outputs an embedding, which
is just a feature, just a special type of feature, like I keep saying.
And all we know is that that's a special type of feature.
You typically want to store in a vector database if you want to do nearest neighbor lookup.
You don't have to.
You're just going to do key value lookups.
You can put it wherever.
But if you're doing a specifically
a nearest neighbor lookup, we put in a vector database.
So feature form is this workflow tool.
It's this tool that encodes the feature workflow
on top of existing
infrastructure. And the vector database is a tool that provides this new specialized index,
which happens to be extremities for embeddings. Transformers are the special type of transformation,
which happens to create an embedding and happen to be models themselves. And an embedding is a specialized type of feature
that has a lot of characteristics that I touched on.
So that's how they all kind of relate to each other.
I should create an embedding graph of all those concepts
and we can look at it.
Absolutely, yeah.
I think we should do that.
All right, cool.
And I, like, we've talked, whenever we talk, like, let's say, more of, like, the infrastructure side of things, right?
We talk a lot, like, and I hear you, like, mentioning a lot of, like, more traditional technologies that have been used.
Like, we're talking about, like, Cassandra, like, Spark, like, Redis, all these things.
Today we have also like all these craziness with let's say these huge,
large language models, like open AI, blah, blah, blah, they can do like all these crazy things.
How do you see, first of all, before that, like, is there like a
distinction between ML and AI?
And because we use, I don't know, I also was thinking about that about myself.
I use these terms in a very mixed way many times.
I have a distinction in my mind, what's the difference, but I don't think it's explicit.
And I'd love to hear your opinion on that.
And then we can continue on like how things will change because of that.
Yeah.
So there is a difference in what they mean.
But I think like most terms, it's really what they mean in practice.
And historically, pre-LM, like honestly, my take was if you said AI, you don't know anything about ML.
If you said ML, you knew about ML was kind of my take.
AI was the hand-wavy way of saying it, and then ML was much more, I guess, concrete.
I know what that means.
And over time, now, I think people have attributed or associated AI to be with the foundational models,
LMs, the GPTs, and ML with kind of what I'm calling traditional ML.
And I thought machine learning, and then is GPT intelligent,
is maybe another question for another time.
But I think in practice, like today, I think it's kind of an accepted thing where like AI just means that class of model and ML just means everything else.
And I think that's totally fine and fair because they're different.
I mean, it's just like the way you use them, the way you think about them, the way you interact with them is just so different.
It's not just ML.
It has to have its own term.
I prefer foundational models, but I mean, AI is fine. Okay. Yeah. Makes sense to have its own term. I prefer foundational models, but AI is fine.
Okay. Yeah. Makes sense.
I love the answer.
Very helpful for
myself also, to be honest.
All right. So how things will change
in the future?
First of all, do you see
feature stores and
again, feature forum specifically
changing its roadmap in a way, right?
Because of these foundational models.
If you have to reflect on how a year ago
you were thinking about how feature form
is going to be successful,
has this changed because of this?
And in what way?
How do you see things changing?
Yeah. of this and in what way? How do you see things changing? Yeah, I think things are changing very rapidly. It's kind of insane how fast it's kind of...
How I would frame this new class model, it's almost like the straw that broke the camel's back.
Like you could argue that there's a model that
most data scientists are aware of called BRRRR,
which also has a line
of models that came before it,
like ELMO, funny enough,
that really brought
this new type of transformer
into the hands of
most data scientists and would kind of
give anyone quickly
and easily the ability to have saved the
R NLP with this kind of genetic model. So for me, that was like a very magic moment. I can't
even remember when that came out. It must've been like five years ago now. And now with this GPT-4
stuff, GPT-3 and the chat GPT, I think what's happened
is even though we're using
a lot of the same techniques, obviously
much more specialized,
much more nuanced, better. We've gotten better.
We continue to get better.
But it's the same, let's
call it category of
solution as a BERT.
And we
but now the big difference is that as an end user like it's like
my like you know my grandma looking at it like it feels like it's past that like it feels like
real now it doesn't feel like a project that's interesting it has got finally crossed that like line where it's now good enough to use in way more
situations than like a kind of older like a bird could be used in like it's almost like
in many situations it's almost like i mean i know if it's like past the turning test or whatever but
like getting there in that it's like if you interact with chat gpt it feels like even if
you're talking to the ai it feels good like it feels like like like it's not obvious it's not
like wow this thing's really bad like it feels pretty solid and i think that's been a big change
and that's kind of because it's finally crossed that line we're finally at the point where you
know there's a lot of use cases and problem statements that
have been unreachable and unattainable that suddenly opened up. Now, I'm sure as you've seen,
with a lot of the application layer products built on top of GPT, they look very similar.
They're all solving very similar problems. that's because in my opinion prompts are kind of
the wrong tool for a job i made this joke earlier with someone where so there's one thing that
happened this is more like evolution and biology is that a crab like the animal has been created
evolutionarily like many different times from many different species and the joke is like
you know the crab is the global minimum like the best thing we can come up with is the crab it's
like perfect you know and so so that that was my joke and i joked the same thing like sequel
of like and like prompts are wrong and in the end like we're going to have this SQL-like language to interact with these models.
And I think that will happen.
I also think what's very likely to happen is embeddings, which we've talked about.
I talked about how I used to perform all these operations on them, but weren't just nearest
neighbor lookups.
And I think that embeddings are much more natural kind of intermediary to use and build upon
to build much more complicated applications.
I think the reason why lots of the AI applications people are saying all look the same is because
the API that they're all based on is so simple.
Like it's just text and has an output.
And all they're doing is like coming up with like interesting prompts.
So essentially like a template that they have together to try
to make it do what they want. And I think that won't exist soon. And so I think what will happen
is that they will take GPT, it will still have that interface, but it will likely be a lot of
the transformers that they have underneath the hood, but they will allow you to use for a cost.
And embeddings again, become this core piece of ML, allow you to use for a cost. And embeddings, again,
become this core piece of ML,
which has been true for a lot.
NLP, this has been true for years,
but I think this will make it true
for a lot of different parts of ML,
like recommender systems
and other places where we've been using them,
but this, I think, will become much more powerful.
And the vector database will have a core piece of that.
The feature store on top of that, it's a different workflow, but it's not that different.
Again, embeddings are just a specialized type of feature. Transformers is a specialized type
of transformation. You can fit them together. It's very common for us to take embeddings and
to use them as inputs for other models. Rather than using Coase tests as an identifier, like a
user ID, I would just take his embedding
and put it into the model to do like a
ranking step. So now I have these really
generic embeddings that work super
well. And so now I can just feed those
in as well. But traditionally, I won't
go away, especially if you start thinking about
kind of all the specialized use cases people
have, fraud detection, etc.
You can't just run that through an LLM. Like it just
doesn't really fit that way. You can use the embedding intermediaries that the LLM has, sprinkle
that on your fraud detection to make it better. And that's what I think we'll start to see happen.
So it'll kind of be this joining of the two. And the feature form remains,
there's still a data science workflow. It's not like data scientists are going away.
So we'd remain that workflow layer that data scientists interact with,
deductive of all of our features, both embeddings and non-embeddings.
Makes a lot of sense.
All right.
We could continue this conversation for like many hours,
which I think it means that you have to come back.
Of course. I'm happy to.
And we will do that.
But now I'll pass the microphone back to Eric,
because we are closer to the buzzer as you say i did that again eric i stole your wow costas yeah wow yeah you stole my line that's okay
though okay simba i this this will probably turn into to questions, but I believe what you're saying about the way that things will play out. But in most businesses, the business logic that is non-predictive, right? So we're talking about sort of like basic business logic, you know, say for like a key KPI or something, it doesn't rely on ML at all,
that leads the decision making. And when you were talking about the bleeding edge of
subscription machine learning models, it really seemed like know, knocking on the door of like machine learning helping drive
business logic.
But I really still see a gigantic gap there in that core KPIs are going to drive the business.
And machine learning is still really early.
And chat GBT is exciting, but there are all these other components, right?
So help me understand, like, from your perspective, I believe that there are recommendations, models and feature features that can really help lead the business logic.
But it seems like most businesses are going to like lag behind.
So how do we cover that gap? How do we cover the gap? business logic, but it seems like most businesses are going to lag behind.
So how do we cover that gap?
How do we cover the gap?
I think, I don't know if I can answer how to cover the gap just because I think it's a bit of like, it's almost like how old enterprise is modernized.
And I think that's kind of been like an age old question that I'm not going to have
an answer to.
I want to maybe dive into part of the question, I think, which is, I guess I just want to highlight necessarily, but lots of my engineers at Feature Forum use chat GPT a lot.
And where it's become really, and I use it too.
I'm writing a blog post and I'm just like, I'm trying to think of this.
I'm kind of stuck on this paragraph.
Take it to your job.
I will essentially ask chat gpt like
hey like describe this for me i'm like no that's not right but it's easier for me to like take
something and like kind of see why it's wrong than it is for me to just come up a bit out of
time space and so i think what it does is it enables
people to do especially decision like kind of people who are making decisions as their job
to do it better because they have this kind of uh uh machine to get drafted which isn't always
right but it always has an answer and the answer is always like it's not extremely stupid most of
the time it's usually like i'm directionally right and i can you can use that
and feed that in as far as your it's a feature to your own brain so that you can kind of make
the best possible decision i think that's what we'll see happen i think we're already kind of
seeing it happen and i think that the metrics piece like people are still doing these decisions
i don't believe like we're just not there we look at what you're saying we're like oh yeah like
we're all about job everyone like we can just like i had a podcast
and someone joked about like next time maybe we'll just have our lms like come on talk to each other
and we'll just like sit back and have a beer and so we're not we're not there yet but i do think
that we are definitely at a point where it's like it is a multiplier effect and data and i mean and
all has always been a multiplier
effect.
Like that's where the whole pitch like software was this multiplier effect of productivity
per person.
I could write one line of code and it automatically scaled across a hundred million users.
A hundred percent.
Yeah.
Now it's that, but it compounds, that's what data does and ML does.
And then LLMs is just this newest logger room that takes full advantage of that and maybe
creates other new kind of maybe third order growth activity. Yep. I love it. Well, Simba, this has been
a wonderful episode, as Kostas said, so much more to talk about. So we'd love to have you on,
but thanks for giving us some of your time. Of course. Thank you. This is a lot of fun.
Wow, Kostas, what a fascinating conversation with Simba from Featureform. giving us some of your time. Of course. Thank you. This was a lot of fun.
Wow, Costas, what a fascinating conversation with Simba
from Featureform.
I feel like the conversation
spanned such a larger
footprint than just
features and even
MLOps. I mean, we talked about
so many different things.
But I think what I'm going to
take away is that
his background in trying to understand
how to create a great moment for a user,
it's very clear that influences the way that he thinks
about building technology that ultimately materializes
into data points.
Of course, we can call those features.
There's embeddings and there's all sorts of technical stuff.
But it's very clear that Simba is building a technology that will enable teams to use
data points that create really great experiences.
And I think that comes from him facing the difficulty of trying to understand
why or why not, you know, of the millions of visitors, you know,
the handful of people will subscribe.
And that to me was really refreshing because MLOps is a very difficult space.
Feature stores and all of the surrounding technology can be very complicated.
There are a lot of players.
But it's clear that Simba just wants to help people understand how to drive a great experience
using a data point that happens to be derived, that happens to rely on a lot of, you know, data sources, and that happens
to, you know, need to be served like in a very real time way. But to him, those are consequences.
Yeah, a hundred percent. I mean, okay, Simba is like a person, first of all, he has a lot of
experience, right? Like he has been through many different, has experienced like many different, has experienced many different phases of what we call JML or AI.
And he has done that in a very production environment, right?
So he has seen how we can build actual systems and products and deliver value with all these technologies, which obviously it's something very important
for him today as he's
building his own company. And I think it's
like an incredible
advantage that he has.
We didn't talk that much about
and maybe this is something that we should
have as a
topic for another
conversation with him to talk more about
the developer experience
and, like, how, like, all this complicated infrastructure
with all these different, let's say, technologies
and all the stuff that we discussed together,
how we can deliver, like, an experience to the developer
that works with all that stuff to make him like more productive but
what I'll keep like from the conversation that we had with him I think he gave
like an amazing description
of what features are what the beddings are
how they relate to each other how we go from one
to the other and how we use them together.
And how, most importantly, all these will become some kind of like, let's say,
a universal API for all these ML or AI driven applications in the near future.
So I am going to say more about that because I won't let everyone listen to Simba.
He's much, much better than me
talking about that stuff.
But there's a wealth of very interesting information
around all the things that are happening today
in the industry
and will happen in the next couple of months
in the industry.
So, yep.
I agree.
I think if you want to learn about features,
there's actually way more in here.
And I think you'll learn about the future
of what it looks like for MLOps
and actually operationalizing a lot of this stuff.
So definitely take a listen.
If you haven't subscribed, definitely subscribe,
tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of the
Datastack Show. Be sure to subscribe on your favorite podcast app to get notified about new
episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.