The Data Stack Show - 144: Explaining Features, Embeddings, and the Difference Between ML and AI with Simba Khadder of Featureform

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack. They've been helping us put on the show for years and they just launched an awesome new product called Profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse or data lake. You should go check it out at ruddersack.com today. Welcome back to the Data Sack Show. Costas, we are going to talk with Simba from Featureform. And boy, do I have a lot of questions. We actually did a lot of data science stuff last summer.

Starting point is 00:00:47 We talked with people building feature store stuff. We talked with people building ML op stuff. But Simba actually has a really interesting perspective on the entire spectrum of problems in the space. So I'm going to leave you to talk to him about the technical details. I'm going to ask about the moment of serendipity. So he did a ton of subscription work, and he figured out why people would subscribe to publications like the New York Times, etc. And so I'm going to ask him about that. It's super interesting to me

Starting point is 00:01:25 because I think machine learning can help us understand a lot about that. But, you know, of course, being a consumer behavior guy, it can't answer everything. So I want to know what he knows about that and then understand how features relate to it. So what are you going to ask him?

Starting point is 00:01:41 Yeah, you mentioned that we did a few episodes in the summer, a couple of months ago, right? About MLOps and the tools and the technologies in this space.

Starting point is 00:01:57 But I think we are living right now in a really completely different world in terms of the technology landscape, especially because of all the LLMs and open AIs of the world and all that new technologies that we are still trying to

Starting point is 00:02:14 figure out how they are going to change things in technology. So I think we have the right person to discuss about that. I'd love to talk with him about more foundational things, like what are embeddings, what are features, what are feature stores.

Starting point is 00:02:34 Let's see how, let's revisit all these terms that we know for a while now, but how they have changed today because of all the changes that have happened in the past six months in the industry. So yeah yeah that's what i'm going to focus on and i'm sure because i know simba that's it's going to be very interesting and captivating discussion all right let's dig in let's do it simba welcome to the data sack show so great to have you. Hey, thanks for having me. All right, so give us your background.

Starting point is 00:03:09 How did you get into data, and what was your path that led you to FeatureFarm? I started, I was at Google, and I was at Google for a little while. I can say I'm one of the few Googlers to have ever written both PHP and x86 at my time there. Oh, wow. I worked on a data store. Yeah, it was kind of interesting. I truly went on both ends, written both PHP and x86 at my time there. Oh, wow. Data store. Yeah.

Starting point is 00:03:25 It's kind of interesting. I went, you could, I truly went on both ends and I've learned that this kind of a horseshoe on both ends. It kind of sucks. You kind of want to be in the middle. Yeah. And yeah. On different teams.

Starting point is 00:03:37 Like what was the, can you tell us like, what did you start with PHP or. Yeah, it definitely was like, I earned my strike to x86. It just happened to be different projects. The PHP, I worked on Solid Data Store. I was working more on the API side. I worked on a lot of different parts of it. But one thing I worked on was fine-tuning some of the lesser-used APIs, and one of them was PHP.

Starting point is 00:04:01 So I went out to do it, and it happened to be me. And so I learned more about PHP than the guy I wanted to, but I can officially say I've used it in prod at Google scale, which I don't think many people, I mean, maybe Facebook people can say that too, but definitely. Moving on. Okay, cool. Moving on.

Starting point is 00:04:20 And the x86, I worked on Google-wide profiling. So I worked on making Google go faster. I worked specifically on Proto, a few other things. But I mean, I was really focused on search. So obviously a big piece of what Google does. And so I worked on making it go faster. And that was kind of the start of my career. I was always kind of really, I liked hard technical problems.

Starting point is 00:04:43 I found that a lot of what I worked on, I mean, this was like, I was pretty much right out of school. Most of what I worked on and most of my, I guess, studies and what I guess made me happy was distributed systems up to that point. I had clipped around with ML. I'd done a bit. Some of the stuff I was touching was search. I actually got to interact with some of the ML team briefly at Google and just got to learn a bit. It happened to be at the time when

Starting point is 00:05:12 TensorFlow was kind of coming out was when I was there. So I got to really see some of the early iterations and really just kind of got hooked. I think what drew me to distributed systems to begin with is how I guess messy it is um and even in an amount like i never liked your vision because it was so i found the answer being typically binary so boring to me like it is or it's not

Starting point is 00:05:38 you know this thing i'm trying to classify it but it serves systems same way i really liked that there was never really a right answer. Resolves a given take and it kind of, there's a little bit of an art to it of doing it well, I think. And yeah. And then I started after bat, I left Google. I had a lot of product ideas at Google. I had these ideas, how, you know, maybe cloud could, this is also

Starting point is 00:06:02 like Google clouds coming out. Um, AWS is like the behemoth but you know maybe google cloud will eat their lunch you're kind of at the tipping point where people are like oh maybe that's not gonna happen and i have all these ideas they're probably bad ideas i was like 20 but um i still wanted to go and learn and continue learning. So I left Google and started my first company, which was Triton. At Triton, we took a lot of renditions. I learned a lot. I didn't start Triton with an idea. All I had was a logo and a name and a co-founder.

Starting point is 00:06:38 I actually couldn't see any work on Friday. Yeah, I had nothing. So whenever people are like, oh, I'll leave when I have a good idea. I'm like, it's not a prereq. I didn't have one. You can do it before that. Yeah, we just figured it out. And honestly, I feel like you can this pros and cons, but it definitely at least allowed us to build something real because we weren't a solution in search of a problem. We actually had to go find a problem and solve.

Starting point is 00:07:04 And we landed on, we called it subscription intelligence. We did everything from personalization as a service. We'd help people do recommendations. But really the goal of it wasn't just recommendations for recommendation's sake. We really were focused on driving subscriptions. There was kind of this movement. It's still kind of happening for B2C products and companies to move from ad-based models to subscription.

Starting point is 00:07:30 Yeah. I think it makes a lot of sense. It's much more tied to value because you're not paying for like, you're not like, hey, like it's almost like a baby switch. Like I'm trying to get you to use this thing, but also I'm really just giving you value because you're the product. Subscripting always seemed to make more sense to me. It's also much less wasteful for certain categories.

Starting point is 00:07:47 And anyway, so I was helping drive that ship by helping companies who are still treating things in the ad based way. Like it's like, if you look, we worked with publishers and news companies and a lot of what they were doing, there's obviously teams that switch,

Starting point is 00:08:02 but I think as a whole, they were still kind of taking the ad basedbased methodology but worked well for charging eyeballs and i was kind of trying to help them use data to figure out how to change their strategy to drive more subscriptions decrease churn understand why users subscribe that was really the whole tagline that was like my easy one one line sales was like do you know why you need to subscribe? And the answer was almost always, not really. And so that's what we'd sell out. Yeah, I love it. Okay. So many questions about sort of that moment of subscribing and, you know, a number of things on recommendations, but let's rewind just a little bit. So you got exposed to distributed systems at Google. I would love to hear about the moment or the epiphany of saying,

Starting point is 00:08:55 this is a real thing that is going to affect my job and data and all that sort of stuff. When did that happen for you where you sort of, you know, were going from working on, you know, PHP stuff to, you know, realizing, okay, distributed systems are going to be like a really big deal. Yeah. So I think, so two points.

Starting point is 00:09:20 I think firstly, like at that point, any distributed systems were going to be a big deal. I think that was kind of already a very, I think what I learned was going to become a big deal was ML and just really more data, I think just like this is, I don't even remember if this is like Spark or had probably come out, but maybe it wasn't like widely used yet, it was probably kind of still like a new era. And I think what I really saw, I think distributed systems,

Starting point is 00:09:46 Google has always been this king of distributed systems. They've always been ahead of the curve. They released the MapReduce paper. They released Borg, which is the early internal version of Kubernetes. It was a Google thing. I think they've always done that very well. I mean, even going back to like having commodity hardware,

Starting point is 00:10:10 like I think there's like early stories of Google where one of their innovations was to duct tape on hard drives. So if a hard drive failed, they could literally just like rip it off and place another one on without doing anything. And that was like an innovation that Google came up with. So I would say at that point that had kind of played out, but I was just interested in the problem space. And I think there was kind of this adjacent problem space, which is ML, which is obviously very different, but I think there's a lot of the same patterns and characteristics that

Starting point is 00:10:46 make the problem really hard to do that i think drew me for the same reasons i was doing to distribute systems but i think there's an extra kicker of that it was i think you know everyone talks about like ml you know it's gonna change the world but i think especially seeing it at google i really began to understand how every interaction will have supposed to digital really everything is gonna have very well you're like i feel if it would be an interesting subject to do is like to see how many models you interact with on like a given day just like doing your job it's probably a lot it's probably like it's definitely over 100 for an average like tech worker i mean just every time you buy something there's fraud detection plus

Starting point is 00:11:36 like a handful of other models every email you get is there's a marketing model behind that like there's just so much i mean mean, Google Search is a model. So there are hundreds of models that we interact with just on a daily basis. Even when you go to the grocery store or buy something, everything in the supply chain is going to be models attached to that. So I think I saw that and once I clicked, I realized that, hey, this is not only a space I find something interesting, but it's a space that's still ripe for impact i think the distributed systems it played out so much that at that point like you kind of had to be at phd level to really have an impact and even then it's like you're kind of we're in the optimization stage of that trend and you know it was either we're gonna have a whole paradigm shift which i don't think didn't happen i can't even stop and what that would look like surely will happen one day and

Starting point is 00:12:32 well if i'm out it was still kind of cream space no one really knew how to do it no one does it well so i think that's what what drove me there okay so you went to this is fascinating to me so you went to subscriptions and you worked with, you know, media companies, which, I mean, you know, as a sort of a, someone with a marketing background as sort of the most brutal, like bleeding edge of like, man, how do you get a bunch of anonymous traffic to convert, small margins. So you go from like, okay, distributed systems, this is interesting, like ML, this is going to have a big impact. And so you go right for the bleeding edge of the hardest problem. First question is,

Starting point is 00:13:21 why did you go for the hardest problem i mean subscription for media is brutal like is it because it's hard like hard problems that was definitely part of it i think again and we were like in search of a problem and that's a huge one and i think we kind of landed there we i actually kind of forget why we ended up in the media. My co-founder, my old co-founder, he's out with PD now, actually, but he's always kind of had a big tie to media. He's always been attracted to it, so I think there was a bit of kind of founder market fit there.

Starting point is 00:14:01 So I think we'll listen to that direction. And the problem space was super fascinating and it was very untouched. Like we weren't really doing much ML there. And the reason for that, which is maybe why

Starting point is 00:14:15 it was hard to like have a huge business there, we ended up being relatively successful, was you don't really make a lot of money per user. It's kind of,

Starting point is 00:14:24 it's a pretty brutal industry. So, I mean, VCs weren't super stoked about us when we told them that we're selling the media. Yeah. Yeah. Dollar per month, you know, a dollar a month to get this access to this content. I mean, another interesting point was like, I know what the number is now, but we found that like, like, I think it was the guardian made like 50 of their revenue off of

Starting point is 00:14:45 like print still at that point no it wasn't that long ago i think i must i believe so like there's like all these like really interesting points that we learned about how the industry works and functions which was were really surprising to us especially since when you think of like like the reason why if you go into an airport and you see a lot of like the sunset news or like all the news names for like those stores was because the news companies were kind of these big conglomerates and they were just like coming up with like new business laws and it was for some weird reason someone i forget who it was i think it was like cbs or something like came up with like maybe CNN but someone came up with like we're gonna do like you know these kiosks in airports and that's

Starting point is 00:15:30 gonna like drive a ton of revenues anyway there's just a lot of weird like things I learned about news video and how it works it's just it's fun I mean I got to become like an expert in something I never thought I'd be an expert in obviously it's like societally very relevant to like know a lot about how media works but it it was just even technically, it was just a fascinating problem space. Yeah, absolutely. And so you solved ML problems on the bleeding edge of like, the most difficult, you know, sort of low margin problem space. And that's obviously like relevant to all sorts of other spaces. When we were talking before the show, you mentioned something called the serendipity moment

Starting point is 00:16:14 and how two users who look very similar can follow the same path and like maybe one subscribes and one doesn't and how that's like a it's a very challenging machine learning problem can you break that down for us and like start by describing what is the serendipity moment yeah the serendipity moment we've all felt it that feeling when you do find something that you weren't really looking for but it is kind of like awesome it's like exactly what you wanted that moment that feeling you have it's kind of like a little dopamine hit that's a serendipity moment and in recommender systems like let's say you go on i don't know like spotify youtube like you name it whatever you're using to view let's say spotify and spotify recommends a song for you and you're

Starting point is 00:17:13 like i've never heard of a song i know who this artist is and you click it and at first you're kind of skeptical because it's like this seems off but then like you know like when you second stand you're just like this is my new favorite song like that moment favorite like is exactly that that moment is magic the problem is and this is actually like a known even pre-digital like this is even the way like a grocery store sets up its its aisles like it's all there's a bit of one thing that they consider is that serendipity moment and what is most likely to be. It's almost like you have to get this gray area where it's not obvious.

Starting point is 00:17:52 If it's obvious, it's serendipitous. If you love an artist, you love Red Hot Chili Peppers and I recommend another Red Hot Chili Peppers song that's in the same album you really like. You might like it but chances are it's not serendipitous because it's yeah i mean that makes sense you kind of have to to a little bit off target but if you go too off target it becomes written yeah it's like sure

Starting point is 00:18:16 like it's a discovery exercise right like it's exposing you to something that you're likely to like but that is is not 100% like, we know you already like this. And the hard thing and why I think I really got pulled to this space is one, it's kind of immeasurable, very different than the papers or how you can try to measure it. But it's essentially immeasurable We can't really measure like the serendipity of a recommendation. And it only really makes sense when you're doing things to humans because computers are very like,

Starting point is 00:18:52 like most things have answers. Serendipity is really hard because it's a human effect. And it's, you can't just like, I mean, you could like attach like something on someone's head and look at every neuron and try to like to figure out serendipity of it. But that's a true.

Starting point is 00:19:08 Until we can do that, you know, we're kind of in this hard place where we have to somehow try to use behavior to see if we captured serendipity. And it's just really hard. Like you mentioned something I'd said, you might have two users who are exactly the same and if you show one of them a song they might be like oh my god this is my new favorite song completely changed like their kind of course and the other user will be like this sucks why the hell did you recommend this to me and is this that's just what happens it just it's impossible to know until after you do it and it's just because you have incomplete information. And with humans, you will kind of always have information because you can't like, you know,

Starting point is 00:19:51 plug into their brain wiring and like make it deterministic or whatever. Yeah, for sure. I mean, it's like, OK, you recommend a song and like someone's just been through a breakup, but someone just had a good date, right? Like they're the same until like there's a divergence and like the song is going to mean different things. Okay, I want to dig into feature form and especially how features influence like this sort of stuff.

Starting point is 00:20:16 But one question before we dig into that, how do you balance as someone who's building technology in this space, the influence of like the commercial element, right? So you mentioned grocery stores. And so it's like, okay, well, how do you create serendipity? But there are also people bidding for the end cap, you know, you know, to put their cereal on the aisle that's most prominent. And, you know, marketplaces are like that. And of course, like that's a business model for marketplaces where you can bid for space. Like, how do you consider that as part of a recommender model? Great question.

Starting point is 00:20:56 It actually comes back to a, this is why I think subscription is a very good strategy for a lot of these companies because it allows, it's essentially like a, hey, I'm trading value, but I'm averaging it out over a month, let's say, probably over a little more time. So it's more like, hey, you're kind of buying me in. Once you're bought in, I don't have to play games anymore. All I have to do is make sure you're getting as much value as I can give you to justify your $5 investment or whatever. Sure.

Starting point is 00:21:28 And typically these things with subscriptions, it's, I mean, there's also like you have to balance the cost of goods. But there's an equation there, I guess you could kind of think through. But in general, we lean towards like, just provide, you just assume that the costs are costs. If you can get people to subscribe, it's all worthwhile. Just to make sure they're valuable because when using people subscribe, they kind of get an all you need type experience, especially on media. Oh sure. Yeah.

Starting point is 00:21:59 So it's more like- And then you can customize that experience subsequently to make sure that they're getting sort of, yeah. Well, we also like had to like, but part of what the product would do is also maybe recommend types of content that do well and it might change the kind of content that you get created. Like this is why like the information, the athletic who have become very successful, their subscription tiers have very different types of content

Starting point is 00:22:17 than other types of companies. They were kind of built subscription first and they came to the conclusion that having more like content, you could only get there like no one else has that content typically a bit more opinionated typically a bit longer a bit more dense is more likely to drive subscriptions than kind of like headline style content which is great for getting clicks and getting views but might not drive subscriptions so anyway there's a lot of things that come into play.

Starting point is 00:22:46 And I think that the short answer is we're trying to measure value, but how do you measure value when a recommendation like Spotify, how do you measure the value of the song recommendation? Well, you can't, you can only see what their behavior looks like after. Right. And then you would use that to be like,

Starting point is 00:23:04 well, that must've been successful. I recommend this song from this artist they've never seen before or heard of before and now like that's their top artist must have been a very successful recommendation i provided value so very likely to share a subscriber because they had that magic moment and so we tried to like make that happen and maybe going into like the, how we do it a bit and diving in, just be in the shoes of a data scientist at a training, like you open up this, let's just say it's a giant file. Let's just make it simple.

Starting point is 00:23:37 It's a CSV and he's CSV. It's like, Hey, this user may be anonymous, likely anonymous, actually very less anonymous looked at or listened to this song at this time. And maybe I'm some song metadata and some table somewhere. Yeah. And you just have like billions of those rows, like more, like you just have this ridiculous number of these and you have you know like we're handling 100 million multi-active users at rp you can't like go one by one in fact if you pick a thousand of them it's still like doesn't

Starting point is 00:24:15 matter like a thousand of those like most of them are just noise anyway so what became really interesting and the other thing that we did which is is unique to us, is we have this recommender system. It wasn't like we were just recommending things to the service and i was having trouble closing a few deals and i went to this prospect and actually flew out to new york and got a beer of them and i brought a contract i printed out and i we just chatted not about we just you know chatted and i brought it out and i was like i brought a contract and just give him a second to like look at me funny and then i was like i'm not expecting you to sign it i just want to know why you won't like i just want to know what went on for your head right now and attribution was the big problem and that's kind of how i solved that and then would you sign and he his answer was literally

Starting point is 00:25:17 like forget the recommender system if you solve that problem you'll understand why my users subscribe i'll pay you for that. Even as a recommender. And what I realized we had to do was essentially the recommender understood why people subscribed. It was actually the way the model was designed was actually to drive more subscriptions. It wasn't just recommending 50 people click on. It's actually the loss function. It wasn't exactly subscription, but it was much more correlated in that direction. And so we actually had to build this almost explainability model on how they recommend their work,

Starting point is 00:25:54 but display that not to a data scientist who knew what the model was. You had to give someone who couldn't care less about what the model was, but still gave them some value. And the way we did that is real quick. Now hand it back is we'd create these user embeddings. They create these, then we can dive into what that means. But generally what we would do is we would create these embeddings, which were like holistic views of the user.

Starting point is 00:26:21 And we would cluster them together and to what we call personas so we'd have these like n number of personas and we would then provide a more traditional vi like kind of like here's like how often they come along here's what days they come you know like this very traditional metrics but we would do it for these magical personas and those magical personas were generated by clustering, essentially the recommender systems, holistic view of the user. And I think that, yeah,

Starting point is 00:26:53 that was the magic of being able to capture and fit all those things together. And yeah, so I'll pause there. No, I love that. I love it. So many questions, but I've been monopolizing.

Starting point is 00:27:07 So why don't we do this? Can you give us the, which I should have asked you to do this at the beginning, but give us the feature form pitch. And then I want to hand it off to Costas to dig into how you actually do that from a technical standpoint. Yeah. So again, be me looking at this giant CSV of all these users. And most of what I'm doing is coming up with features. Those features will be, hey, how often did this user come to, let's say, Spotify?

Starting point is 00:27:36 What's this user's favorite song in the last seven days? What's their favorite artist in the last 30 days? I would generate all of these features. I'd also generate embeddings, which we'll get into, I'm sure. And being able to do that alone was hard. It required contracting of Spark. I'd have to materialize things to Redis. There's a lot of

Starting point is 00:27:53 playing at a very low level. But the worst part, which I'm sure any data scientist listening to this can relate to, is we would have these Google Docs full of SQL snippets. We would have untitled 118, the IPy notebook that we'd be copying and pasting from. We had no source approved, no versioning, nothing.

Starting point is 00:28:11 It was all ad hoc. And we couldn't at any point in time look at a training set and be clear about, hey, which features did we use and how were they created exactly? And could we do that again? It was just not on the table. It was just all done in such an ad hoc fashion. So we built Feature Form to be this kind of framework that sat above the infrastructure.

Starting point is 00:28:32 So we could still take full advantage of our infra because we're, again, handling 100 million MAU. But it allowed the data scientists to define and manage and serve their features, their training sets, their labels, everything. And a framework that worked and let them write SQL, let them write data frames, let them write what they're used to writing, but give them almost like the scaffolding to put all that together so we can automate most of the low-level and kind of mundane tasks that aren't just like me coming up with new features.

Starting point is 00:29:01 All right. So Simba, you mentioned two very interesting terms, and I would like to tell about both of them. You mentioned the word feature and the term embedding, right? So why do we need two different terms, first of all? What's the difference there, right? And help us understand a little bit, let's say,

Starting point is 00:29:28 what came first? What are the differences? Or the similarities? And get a little bit deeper into both of them. Because I'm pretty sure that there's a lot of confusion around these terms out there. Yeah, and sorry, the first term was which? I heard the second term was embedding, but it was the first one. Feature.

Starting point is 00:29:45 Feature, yeah. And then embedding is a sub, it's like a type of feature. So let's first talk about what a feature is. So a feature is, well, let's talk about how an ML, a model works. Like a model is this a black box function. You might be able to understand it, but you can think of it in this case as a black box. It's a function.

Starting point is 00:30:12 It takes signals, inputs, generates an output, a prediction. That's kind of how most models that we use work. Now, those inputs are going to be things, like in the Spotify example, might be things like my favorite song in the last 30 days, might be my favorite artist in the last seven days, and might be a variety of different signals. And I like using the word signals

Starting point is 00:30:38 because I think it's a better term. It captures what a feature is better. A feature is really just like signal from the raw data that you're providing into the model. Like in some computer vision cases, it might be just literally the raw image. That's it. Signal is just a raw image. But in most situations, especially with NLP, especially with like any tabular data, which

Starting point is 00:31:01 is like fraud detection use cases, recommender systems, et cetera, we take a lot of steps into taking that raw data, taking our domain knowledge as a data scientist, crossing that with some data transformations to generate signal that we then feed into the model to allow it to do the best job possible. So that's what a feature is. And let's call them traditional features.

Starting point is 00:31:31 You could imagine that feature pipeline to generate them are things like data frames or SQL. They're kind of well understood concepts. Now, an embedding is a very special type of feature. So an embedding literally is a vector. A vector in like the math sense that it's like an

Starting point is 00:31:56 n-dimensional point. And each of our values are close, just like floating point numbers. Now, these embeddings have this interesting characteristic where

Starting point is 00:32:11 you can embed a lot of different concepts. Let's say, again, I'm embedding users based on behavior. So I have a user embedding, which if I'm Spotify, maybe it's like this holistic view of like who you are as a user, what you like to listen to, and just all the trying to capture all the nuances that makes you unique. I somehow take all of that and I turn you into a point in space.

Starting point is 00:32:36 Now, alone, that means nothing, right? Because it's like, cool. Like, I have this random vector. Like, that's great. Like, you told me this is CoSUS. I'll trust you. But like, I don't know what to do with this. Where the magic comes in is when you have many points. When you're Spotify and you have many millions of users, you end up with millions of points. And it's almost like the structure of AppWorx. If you look, if you Google like embedding projections, you can see some of these structures,

Starting point is 00:33:04 they typically cluster. There's a lot of really cool shapes. It used to be one of my favorite things to do was look at the shapes that we could form based on different types of embeddings. There's all kinds of things that get injected into that space, that n-dimensional space.

Starting point is 00:33:21 You typically will visualize that as 3D space. One thing, and the most obvious one that a lot of people are aware of is users who are similar, like have similar music tastes in the Spotify sense will be close together in space. So their vectors will have very similar values. I want to dive into why that's hard for a second, because let's say I have a text embedding. So I have two pieces of text, or let's say I have three pieces of text, and two of them are really close and one is far away. The

Starting point is 00:33:58 two that are close, the common way to vectorize, this is again, like if you've ever done NLP class, the first thing you'll learn is a technique called TFIDF, which is term frequency, inverse document frequency, which means take the amount of times the term, a word shows up divided by, or the inverse of the document frequency, how many documents actually contain that term. So it ends up working out where they kind of common terms end up with like a high IDF or high document frequency rather because they show up all the time. And then, you know, rare words are used often tend to like way really high. So this is the way to vectorize

Starting point is 00:34:41 a piece of text, but it's kind of dumb because it doesn't really understand the words, right? It just treated each word as a second identifier and it works great, but it doesn't like understand like sarcasm. Like you may have three documents and they all have very similar words with one is sarcastic. And that one's obviously very different now and embedding a good one from a good transformer will actually capture all this nuance and it will really put that into the embedding space and the same way

Starting point is 00:35:13 user their listening behavior their listening behavior it's not just like oh like they listen to you know they love katie perry and this user also loves katie perry so we're like near each other it's a lot more nuanced than that now the, the final thing, and I kind of went into this, is when I build a traditional feature, I use a SQL query. When I build an embedding or something like an embedding, I typically do a transformer model. And it's literally a machine learning model whose whole job it is to take features, actually inputs again, but to

Starting point is 00:35:47 take those and generate an embedding. So you embed a concept, which is typically sparse data into vector space. So anyway, there's a lot, but that's a very long answer to a very short question. No, I actually, it's like, I think you did like an amazing job describing what an embedding is and what the difference between the feature. Hearing you describing an embedding, I keep two terms that you used. You mentioned about like a high dimensional space. So there is the concept of a space there. And then you have, we have points in this space, right? Where we can, let's say for people that they've done like basic algebra,

Starting point is 00:36:31 like it's not that different than what we were doing with like vectors in algebra, right? Like actually we pretty much use like the same algorithms at the end, like to calculate the similarity, right? And I think that's like also like big part of the beauty of this whole thing is that we can take something so complicated and reduce it into a mathematical structure where basic tools from algebra can be used to answer questions about semantics at the end, right?

Starting point is 00:37:01 And that's what you were describing. You were saying, yeah, we can do things with frequency, but there's no semantics at the end, right? And that's what you were describing. You were saying, yeah, we can do things with frequency, but there's no semantics in there. We cannot really understand what's the difference between the meaning of the words. And that's what we do with the bendings today. But one of the things that you said, and you mentioned, for example, we have, say, the user bending, right? We have the word bending. We have, I don't know, whatever bending, right? We have the word of bending. We have, I don't know, like whatever bending, right? It seems like we need like a different space for each one of like these things that we are trying like to model there, right? It's not like we will take like a model

Starting point is 00:37:36 that does word of bendings and we will use that like to generate user of bendings, if I understand correctly, right? You can, and we did it. We would put user embeddings and item embeddings in the same space. So we could actually use that to find, hey, if I take a user, I can find the end items that are closest to it, and those would be items that have highest affinity

Starting point is 00:37:58 towards that user. So you can embed things into the same space. It's's yeah. There's a way, like, I guess it can be done. You like the generic models typically don't do that. They have their own space that they're trained on,

Starting point is 00:38:14 but if you own the transformer, it's very, very much. It's very doable to build things in the same space. So I wouldn't say each thing has its own unique space. You can't cross between them. If you wanted to, I can't think of how to do this off the top of my head, but I'm sure it's possible.

Starting point is 00:38:30 You can put images and users somehow in the same embedding space. The spaces, another thing is it's not like there's one user space that exists for all users. The transformers themselves, the way the transformers work is they are trained.

Starting point is 00:38:51 I'll give you an example. I'll tell you how one of the parts of our recommender system would create embeddings. We would have this model and it would take all these like let's call them like user attributes okay what's their favorite like just traditional features what's their you know what's their favorite thing last 30 days what did they just listen to with whatever just like a set of like their age whatever other traditional features we would feed into this transformer model. And we would train the transformer to solve a surrogate problem.

Starting point is 00:39:30 And the surrogate problem is really what defines the latent space. So the surrogate problem you're trained on is, hey, try to predict what the user's going to listen to next. Which is an impossible problem. There's just no way that given those features, you'll have a strong

Starting point is 00:39:46 model that will be able to guess of like 99% accuracy, like what Jesus can listen to next. So like, as we talked about, it's just an impossible promise space. Now, by doing that, you will there's a way literally how you do it is you essentially take the last hidden layer of a deep neural network and that's embedding that's literally how you create the embedding it's actually those values is the embedding and there are many different tricks and techniques like one thing you could do is rather than using the chances of clicking rather you predict how long will they watch the item so not only like, you predict how long will they watch the item.

Starting point is 00:40:27 So not only like which item, but how long will they watch it? That will actually change the embedding space. And funny enough, that change in embedding space will actually typically result in higher quality recommendations than using click space. So there's a whole science and art. And again, I love the art of machine learning. I love problems where it's creative and it's like fun and it's not like hey like is this a hot dog or not a hot dog

Starting point is 00:40:51 no offense to computer vision people i know i've been like kind of not talking well about it but um i love to recommend their space embeddings in general because of that art of like i'm literally building a model and it's a huge model typically and it's really expensive to train and i'm using it entirely for its last hidden layer because that's the only thing that's useful to me i actually don't care about the model itself beyond that so to build a model that generates embeddings we start again from features yes yeah it's all and actually a funny thing that you can train embeddings on the fly as you're training your models, create embeddings, which is a whole other story, but we did that too.

Starting point is 00:41:29 So you can, anyway, yes, there's a lot of crazy things you can do, but I really want to like highlight that embeddings are just a special type of feature. And even in that world, you're still using features, like even to create the embeddings, you have to create features. So it's a signal, right? I mean, like in general, like, I mean, as much as we like to imagine

Starting point is 00:41:49 like these types of models, this is like magic wands that you like, you know, sprinkle on text and it magically takes the text and makes it do whatever. That doesn't really work. Like all these models

Starting point is 00:42:02 are you using and trained on very traditional features to create embeddings. And typically, the really cutting-edge models, when you think of a GPT, you start actually putting both models on top of each other. So embeddings that feed into other embeddings. Like if you're using images and text in GPT-4, it supports that now. I imagine I would be very surprised if originally those things are processed by different

Starting point is 00:42:29 models to be embedded and then fit into another model to create a holistic embedding of the whole thing. Embeddings are just specialized species. In fact, with feature form, a lot of the problems like this with traditional features, like, hey, how did I create this feature?

Starting point is 00:42:49 How is it defined? What version is this? Where is it being used? Who owns it? Governance. All the traditional problems, lineage, that you would expect of, let's call them, traditional features, totally happen with embeddings.

Starting point is 00:43:03 I can't tell you how many times I've had like embeddings. And I would have these like this. And again, like I built my own vector database probably three or four times in my career. And there'd be all these times of like, man, I actually don't remember how I create that embedding, but they don't remember which model I used and how I trained it and where that model is so I actually can create you embeddings from it because it's just somewhere in my untitled notebooks

Starting point is 00:43:27 that I have on my laptop. And so, yeah, I mean, feature problem was built. Even now, people associate features with traditional ML. It was actually built with this kind of new style ML originally. And we, let's say, cut that stuff out because it wasn't cool when I was doing embeddings. It's cool now. But what I was doing wasn't cool. So I was like, cool, let's cut that out because stuff out because it wasn't cool when I was doing embeddings. It's cool now.

Starting point is 00:43:46 But what I was doing wasn't cool. So I was like, cool, let's cut that out because no one knows what that is. And we'll focus on traditional stuff. That's what we're using today. But we're actually about to release a lot of stuff in this space, just stuff we'd actually built before. We turned it off and now we're going to turn it back on,

Starting point is 00:43:59 which is pretty exciting, actually. I'm very excited, too, for this new stage. Yeah, yeah. And we will have the opportunity to talk about that. But before we get there, you mentioned vector databases. And I mean, naturally, they come as a way to interact and use the appendings. That's how most people like to hear about them today. So there is this concept of the vector database.

Starting point is 00:44:26 As you said, it's the embedding at the end. Like from a representation standpoint, it's just like a vector with float numbers in it. And we need to do operations from that stuff, right? Like somehow we get them and we need like to work with them and be able to search them, like make comparisons, blah, blah. Like pretty much like the stuff that's, let's say, we also do in a traditional database. So my question is, why do we need to have this new thing that's called a vector database,

Starting point is 00:44:56 right? And how does it fit in the overall workflow that we have when we are working and building data systems that have also some kind of like AI, ML, whatever we want to call it, like element in there. Yeah. I'll start first on, so I'll get into like how things, because LLM stuff kind of adds a new flavor for vector databases. But I mentioned before, I built a vector database a few times in my career. We actually released one with feature form, which is kind of deprecated now just because there's plenty of other great options, which you should look at. But the reason I built it, so when I, so, okay, first, like the problem to be solved. Problem to be solved originally before even the database part was I would have these embeddings. A very common problem is doing a nearest neighbor

Starting point is 00:45:45 lookup. It's a very easy way to make, again, if I have a user embedding, I want to find the n items closest to it. Well, I just do a nearest neighbor lookup on this vector. So that's how I would do it. Now, the problem is that doing a nearest neighbor lookup is a very expensive operation. It's essentially the only way you can do it correctly, 100% correctly, is to brute force. And so

Starting point is 00:46:16 there was a variety of companies that came up with approximate nearest neighbor algorithms. One of the most popular ones, which is funny because it's kind of like no longer, I think it's lost in time, but it was the most popular one.

Starting point is 00:46:32 It was one called Annoy from Spotify. And get it like Annoy, A-N-N. Oh, yeah. So approximate nearest neighbor index called Annoy was in memory. And so the problem with that was, it like, if I give you a B tree, I'm like, here, this is your database. Well, that's great.

Starting point is 00:46:51 Like you solve the really hard algorithmic part, but there's like all this stuff I have to build around it to actually deploy this thing, like it was super common. I still see it. I mean, probably less than now, but for a long time, I would see, we used to do it. To be honest, is we would actually upload our embedded files in the Docker container over the model.

Starting point is 00:47:11 And we would, in that container, read the file, create the annoy index on startup time. That's actually how we would do it. And then like there was no, it made sense eventually to create a service

Starting point is 00:47:22 that was kind of almost like annoy as a service for us, which became our, it's a vector database. I mean, essentially, I mean, there's more you have to do to that. You also have to persist to disk and there's obviously more of it just putting a NOI. But that was one of the key problems to solve. There's never a key problem, which a NOI doesn't do. And a lot of them don't do is like being able to distribute the search, being able to do filtering is a really hard problem, which Innoi doesn't do and a lot of them don't do is like being able to distribute the search. Being able to do filtering is a really hard problem, which none of the open source indices

Starting point is 00:47:52 can do. Now the proprietary, I think I believe Pinecone, VVA and Redis can all do it. But there's a number of hard problems. These are index problems. These are like database problems. Now it's a specialized index, and you need to build a database around specialized index, or you have to fit the specialized index into existing databases. The problem is that the existing algorithms, like the most common one now I see is HNSW, which was created at Facebook. This doesn't really play well with how databases are architected. So the algorithms kind of have to be tweaked.

Starting point is 00:48:31 So we have to find an algorithm that has similar quality, but also like kind of has similar characteristics in how you scale it out as you would find in like a B-tree or whatever. So yeah, anyway, that's the long answer of why vector DBs existed. It definitely solving a real problem. Now the question which remains, for sure, like people will be using a betting.

Starting point is 00:48:55 It's not a question to me at least. And the nearest neighbor lookup, approximately nearest neighbor lookup, that's not going away. That's really common. That needs a special index, no question. I think it's something I've learned, a misconception the market has, which I learned recently, up that's not going away that's really common that needs a special index no question i think something i've learned a misconception the market has which i learned recently is that people think

Starting point is 00:49:09 vector databases are just a place to like cash embeddings which is not true like it's not like i mean you could do it that way but i mean at that point like it's just a it's a list like you could put it in redis it doesn't really. The thing that makes it special is that index, that nearest neighbor index. So, yeah. I'll pass it back on that note. Yeah. No. Makes total sense. And how does this

Starting point is 00:49:35 fit in a system like feature form, which is a feature store, right? So we have feature stores. We have also vector data bases. There has to be some kind of... I mean, features are store, right? So we have feature stores, we have also vector databases. There has to be some kind of... I mean, features are everywhere, right? That's like someone who has no idea what we talk so far.

Starting point is 00:49:52 I think the first signal that there is some relationship there is that we have features everywhere. We need them to build the embeddings themselves. So how does feature stores work with them? And let's stick with like feature form, like the product that you built, right? Like how is architected

Starting point is 00:50:13 and how a vector database fits into that? Yeah, one thing that makes feature form unique, so this is not true of literally any other feature store, just how we work, is we call ourselves a virtual feature store. The virtual means that we sit on top of your existing infrastructure. We essentially turn what you have into a feature store.

Starting point is 00:50:32 And so we end up being this kind of framework layer that data scientists love to use, but also allow you to take full advantage of all the infrastructure you have underneath. Now, from our perspective, it's very common for like some of our bigger clients to have have some features in Redis, some features in Cassandra, some features in Mongo, whatever. They might have a variety of different places. They might have some things built in Spark, some built in Snowflake.

Starting point is 00:50:55 And Feature Palm works really well in those situations because it sits on top of all of it. And it provides one unified abstraction to define the features, to manage them, and to serve them. Now, from our perspective, a vector database is just another kind of online store, is what we would call our inference store. It's just another place to store features, which in embeddings happens to be, again, like a specific type of feature. And it has this new operation for lookups, which is nearest neighbor lookup. So we just need to provide both of those operations. The other thing that we do is we orchestrate transformations.

Starting point is 00:51:35 So we'll orchestrate. As a data scientist, you will define your transformations in our framework, which again will be 99% the same code you would write, the same SQL query, the same PySpark, Pandas, whatever. And then we have this kind of function that you'd put it in and you might give us a metadata, like a name, a version, a description, all that stuff, an owner, and a lot more as possible.

Starting point is 00:51:59 And we would orchestrate that. You could set a schedule and we can orchestrate that for you, those transformations. Now, from You could set a schedule, and we can orchestrate that for you, those transformations. Now, from our perspective, a transformer, especially a pre-trained transformer, even like an LLM, is just a new type of transformation. From our perspective, it takes an input, which is text, and it outputs an embedding, which is just a feature, just a special type of feature, like I keep saying. And all we know is that that's a special type of feature.

Starting point is 00:52:30 You typically want to store in a vector database if you want to do nearest neighbor lookup. You don't have to. You're just going to do key value lookups. You can put it wherever. But if you're doing a specifically a nearest neighbor lookup, we put in a vector database. So feature form is this workflow tool. It's this tool that encodes the feature workflow

Starting point is 00:52:44 on top of existing infrastructure. And the vector database is a tool that provides this new specialized index, which happens to be extremities for embeddings. Transformers are the special type of transformation, which happens to create an embedding and happen to be models themselves. And an embedding is a specialized type of feature that has a lot of characteristics that I touched on. So that's how they all kind of relate to each other. I should create an embedding graph of all those concepts and we can look at it.

Starting point is 00:53:16 Absolutely, yeah. I think we should do that. All right, cool. And I, like, we've talked, whenever we talk, like, let's say, more of, like, the infrastructure side of things, right? We talk a lot, like, and I hear you, like, mentioning a lot of, like, more traditional technologies that have been used. Like, we're talking about, like, Cassandra, like, Spark, like, Redis, all these things. Today we have also like all these craziness with let's say these huge, large language models, like open AI, blah, blah, blah, they can do like all these crazy things.

Starting point is 00:53:57 How do you see, first of all, before that, like, is there like a distinction between ML and AI? And because we use, I don't know, I also was thinking about that about myself. I use these terms in a very mixed way many times. I have a distinction in my mind, what's the difference, but I don't think it's explicit. And I'd love to hear your opinion on that. And then we can continue on like how things will change because of that. Yeah.

Starting point is 00:54:34 So there is a difference in what they mean. But I think like most terms, it's really what they mean in practice. And historically, pre-LM, like honestly, my take was if you said AI, you don't know anything about ML. If you said ML, you knew about ML was kind of my take. AI was the hand-wavy way of saying it, and then ML was much more, I guess, concrete. I know what that means. And over time, now, I think people have attributed or associated AI to be with the foundational models, LMs, the GPTs, and ML with kind of what I'm calling traditional ML.

Starting point is 00:55:11 And I thought machine learning, and then is GPT intelligent, is maybe another question for another time. But I think in practice, like today, I think it's kind of an accepted thing where like AI just means that class of model and ML just means everything else. And I think that's totally fine and fair because they're different. I mean, it's just like the way you use them, the way you think about them, the way you interact with them is just so different. It's not just ML. It has to have its own term. I prefer foundational models, but I mean, AI is fine. Okay. Yeah. Makes sense to have its own term. I prefer foundational models, but AI is fine.

Starting point is 00:55:46 Okay. Yeah. Makes sense. I love the answer. Very helpful for myself also, to be honest. All right. So how things will change in the future? First of all, do you see feature stores and

Starting point is 00:56:01 again, feature forum specifically changing its roadmap in a way, right? Because of these foundational models. If you have to reflect on how a year ago you were thinking about how feature form is going to be successful, has this changed because of this? And in what way?

Starting point is 00:56:22 How do you see things changing? Yeah. of this and in what way? How do you see things changing? Yeah, I think things are changing very rapidly. It's kind of insane how fast it's kind of... How I would frame this new class model, it's almost like the straw that broke the camel's back. Like you could argue that there's a model that most data scientists are aware of called BRRRR, which also has a line of models that came before it, like ELMO, funny enough,

Starting point is 00:56:54 that really brought this new type of transformer into the hands of most data scientists and would kind of give anyone quickly and easily the ability to have saved the R NLP with this kind of genetic model. So for me, that was like a very magic moment. I can't even remember when that came out. It must've been like five years ago now. And now with this GPT-4

Starting point is 00:57:21 stuff, GPT-3 and the chat GPT, I think what's happened is even though we're using a lot of the same techniques, obviously much more specialized, much more nuanced, better. We've gotten better. We continue to get better. But it's the same, let's call it category of

Starting point is 00:57:40 solution as a BERT. And we but now the big difference is that as an end user like it's like my like you know my grandma looking at it like it feels like it's past that like it feels like real now it doesn't feel like a project that's interesting it has got finally crossed that like line where it's now good enough to use in way more situations than like a kind of older like a bird could be used in like it's almost like in many situations it's almost like i mean i know if it's like past the turning test or whatever but like getting there in that it's like if you interact with chat gpt it feels like even if

Starting point is 00:58:28 you're talking to the ai it feels good like it feels like like like it's not obvious it's not like wow this thing's really bad like it feels pretty solid and i think that's been a big change and that's kind of because it's finally crossed that line we're finally at the point where you know there's a lot of use cases and problem statements that have been unreachable and unattainable that suddenly opened up. Now, I'm sure as you've seen, with a lot of the application layer products built on top of GPT, they look very similar. They're all solving very similar problems. that's because in my opinion prompts are kind of the wrong tool for a job i made this joke earlier with someone where so there's one thing that

Starting point is 00:59:13 happened this is more like evolution and biology is that a crab like the animal has been created evolutionarily like many different times from many different species and the joke is like you know the crab is the global minimum like the best thing we can come up with is the crab it's like perfect you know and so so that that was my joke and i joked the same thing like sequel of like and like prompts are wrong and in the end like we're going to have this SQL-like language to interact with these models. And I think that will happen. I also think what's very likely to happen is embeddings, which we've talked about. I talked about how I used to perform all these operations on them, but weren't just nearest

Starting point is 00:59:58 neighbor lookups. And I think that embeddings are much more natural kind of intermediary to use and build upon to build much more complicated applications. I think the reason why lots of the AI applications people are saying all look the same is because the API that they're all based on is so simple. Like it's just text and has an output. And all they're doing is like coming up with like interesting prompts. So essentially like a template that they have together to try

Starting point is 01:00:25 to make it do what they want. And I think that won't exist soon. And so I think what will happen is that they will take GPT, it will still have that interface, but it will likely be a lot of the transformers that they have underneath the hood, but they will allow you to use for a cost. And embeddings again, become this core piece of ML, allow you to use for a cost. And embeddings, again, become this core piece of ML, which has been true for a lot. NLP, this has been true for years, but I think this will make it true

Starting point is 01:00:52 for a lot of different parts of ML, like recommender systems and other places where we've been using them, but this, I think, will become much more powerful. And the vector database will have a core piece of that. The feature store on top of that, it's a different workflow, but it's not that different. Again, embeddings are just a specialized type of feature. Transformers is a specialized type of transformation. You can fit them together. It's very common for us to take embeddings and

Starting point is 01:01:18 to use them as inputs for other models. Rather than using Coase tests as an identifier, like a user ID, I would just take his embedding and put it into the model to do like a ranking step. So now I have these really generic embeddings that work super well. And so now I can just feed those in as well. But traditionally, I won't go away, especially if you start thinking about

Starting point is 01:01:38 kind of all the specialized use cases people have, fraud detection, etc. You can't just run that through an LLM. Like it just doesn't really fit that way. You can use the embedding intermediaries that the LLM has, sprinkle that on your fraud detection to make it better. And that's what I think we'll start to see happen. So it'll kind of be this joining of the two. And the feature form remains, there's still a data science workflow. It's not like data scientists are going away. So we'd remain that workflow layer that data scientists interact with,

Starting point is 01:02:07 deductive of all of our features, both embeddings and non-embeddings. Makes a lot of sense. All right. We could continue this conversation for like many hours, which I think it means that you have to come back. Of course. I'm happy to. And we will do that. But now I'll pass the microphone back to Eric,

Starting point is 01:02:28 because we are closer to the buzzer as you say i did that again eric i stole your wow costas yeah wow yeah you stole my line that's okay though okay simba i this this will probably turn into to questions, but I believe what you're saying about the way that things will play out. But in most businesses, the business logic that is non-predictive, right? So we're talking about sort of like basic business logic, you know, say for like a key KPI or something, it doesn't rely on ML at all, that leads the decision making. And when you were talking about the bleeding edge of subscription machine learning models, it really seemed like know, knocking on the door of like machine learning helping drive business logic. But I really still see a gigantic gap there in that core KPIs are going to drive the business. And machine learning is still really early. And chat GBT is exciting, but there are all these other components, right?

Starting point is 01:03:48 So help me understand, like, from your perspective, I believe that there are recommendations, models and feature features that can really help lead the business logic. But it seems like most businesses are going to like lag behind. So how do we cover that gap? How do we cover the gap? business logic, but it seems like most businesses are going to lag behind. So how do we cover that gap? How do we cover the gap? I think, I don't know if I can answer how to cover the gap just because I think it's a bit of like, it's almost like how old enterprise is modernized. And I think that's kind of been like an age old question that I'm not going to have an answer to.

Starting point is 01:04:21 I want to maybe dive into part of the question, I think, which is, I guess I just want to highlight necessarily, but lots of my engineers at Feature Forum use chat GPT a lot. And where it's become really, and I use it too. I'm writing a blog post and I'm just like, I'm trying to think of this. I'm kind of stuck on this paragraph. Take it to your job. I will essentially ask chat gpt like hey like describe this for me i'm like no that's not right but it's easier for me to like take something and like kind of see why it's wrong than it is for me to just come up a bit out of

Starting point is 01:04:56 time space and so i think what it does is it enables people to do especially decision like kind of people who are making decisions as their job to do it better because they have this kind of uh uh machine to get drafted which isn't always right but it always has an answer and the answer is always like it's not extremely stupid most of the time it's usually like i'm directionally right and i can you can use that and feed that in as far as your it's a feature to your own brain so that you can kind of make the best possible decision i think that's what we'll see happen i think we're already kind of seeing it happen and i think that the metrics piece like people are still doing these decisions

Starting point is 01:05:40 i don't believe like we're just not there we look at what you're saying we're like oh yeah like we're all about job everyone like we can just like i had a podcast and someone joked about like next time maybe we'll just have our lms like come on talk to each other and we'll just like sit back and have a beer and so we're not we're not there yet but i do think that we are definitely at a point where it's like it is a multiplier effect and data and i mean and all has always been a multiplier effect. Like that's where the whole pitch like software was this multiplier effect of productivity

Starting point is 01:06:09 per person. I could write one line of code and it automatically scaled across a hundred million users. A hundred percent. Yeah. Now it's that, but it compounds, that's what data does and ML does. And then LLMs is just this newest logger room that takes full advantage of that and maybe creates other new kind of maybe third order growth activity. Yep. I love it. Well, Simba, this has been a wonderful episode, as Kostas said, so much more to talk about. So we'd love to have you on,

Starting point is 01:06:39 but thanks for giving us some of your time. Of course. Thank you. This is a lot of fun. Wow, Kostas, what a fascinating conversation with Simba from Featureform. giving us some of your time. Of course. Thank you. This was a lot of fun. Wow, Costas, what a fascinating conversation with Simba from Featureform. I feel like the conversation spanned such a larger footprint than just features and even

Starting point is 01:06:57 MLOps. I mean, we talked about so many different things. But I think what I'm going to take away is that his background in trying to understand how to create a great moment for a user, it's very clear that influences the way that he thinks about building technology that ultimately materializes

Starting point is 01:07:23 into data points. Of course, we can call those features. There's embeddings and there's all sorts of technical stuff. But it's very clear that Simba is building a technology that will enable teams to use data points that create really great experiences. And I think that comes from him facing the difficulty of trying to understand why or why not, you know, of the millions of visitors, you know, the handful of people will subscribe.

Starting point is 01:08:00 And that to me was really refreshing because MLOps is a very difficult space. Feature stores and all of the surrounding technology can be very complicated. There are a lot of players. But it's clear that Simba just wants to help people understand how to drive a great experience using a data point that happens to be derived, that happens to rely on a lot of, you know, data sources, and that happens to, you know, need to be served like in a very real time way. But to him, those are consequences. Yeah, a hundred percent. I mean, okay, Simba is like a person, first of all, he has a lot of experience, right? Like he has been through many different, has experienced like many different, has experienced many different phases of what we call JML or AI.

Starting point is 01:08:47 And he has done that in a very production environment, right? So he has seen how we can build actual systems and products and deliver value with all these technologies, which obviously it's something very important for him today as he's building his own company. And I think it's like an incredible advantage that he has. We didn't talk that much about and maybe this is something that we should

Starting point is 01:09:18 have as a topic for another conversation with him to talk more about the developer experience and, like, how, like, all this complicated infrastructure with all these different, let's say, technologies and all the stuff that we discussed together, how we can deliver, like, an experience to the developer

Starting point is 01:09:41 that works with all that stuff to make him like more productive but what I'll keep like from the conversation that we had with him I think he gave like an amazing description of what features are what the beddings are how they relate to each other how we go from one to the other and how we use them together. And how, most importantly, all these will become some kind of like, let's say, a universal API for all these ML or AI driven applications in the near future.

Starting point is 01:10:19 So I am going to say more about that because I won't let everyone listen to Simba. He's much, much better than me talking about that stuff. But there's a wealth of very interesting information around all the things that are happening today in the industry and will happen in the next couple of months in the industry.

Starting point is 01:10:41 So, yep. I agree. I think if you want to learn about features, there's actually way more in here. And I think you'll learn about the future of what it looks like for MLOps and actually operationalizing a lot of this stuff. So definitely take a listen.

Starting point is 01:11:02 If you haven't subscribed, definitely subscribe, tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of the Datastack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 144: Explaining Features, Embeddings, and the Difference Between ML and AI with Simba Khadder of Featureform

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 144: Explaining Features, Embeddings, and the Difference Between ML and AI with Simba Khadder of Featureform

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.