The Data Stack Show - Data Council Week (Ep 2) - The Convergence of MLops and DataOps With Team Featureform

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. All right, what's up everybody? This is Brooks. I am usually behind the scenes on the show running things for Eric and Kostas, but we are live at Data Council this week. And Eric, unfortunately, he's fine, but got in a biking accident and wasn't

Starting point is 00:00:40 able to make it. So I am filling in for Eric, and we have an awesome group of folks here. We have Shabnam, Makiko, and Simba from Featureform. And we are super excited to chat with you guys here live in person at Data Council. So would love just quickly to go around. Shabnam, we can start with you. But we'd love to just go around and hear a little bit about your background.

Starting point is 00:01:03 Yeah, I guess I have an interesting background in tech. Well, I'm currently COO at Futureform, but got my first job in tech working at Slack. I joined them in 2015 when they were still pretty small, working in global biz ops. I then decided to blow up my career and pivot to software engineering and did that for a couple years before meeting the wonderful Simba and here I am at Futureform. Hey yeah so I joined Futureform last October as head of ML Ops so my focus is on entirely on number one like helping users specifically data scientists and ML platform engineers, develop and deploy the platforms and systems to make sure that production ML is going well for them. A lot of times that means working from everyone from early stage startups to...

Starting point is 00:01:57 We have some pretty large international enterprise users now at this point. But prior to joining Feature Forum, I was actually an ML platform engineer over at Mailchimp, which was acquired by Intuit earlier than October of last year. And I've held various roles either as a data scientist or as a platform engineer, even as an analytics engineer way before the term came into vogue. So really working with folks up and down the data science machine learning value chain. Cool. So you're up. Yeah.

Starting point is 00:02:30 And they didn't both mention, but they're both bootcampers. We just realized right now, a lot of our execs are out of bootcamp. So yeah, myself, I am maybe a little more boring in that I went to UCC Anchor or CS when I was at Google for not very long immediately was like is this what I'm supposed to do for 40 years and realized that's not the life I wanted for

Starting point is 00:02:51 myself and left to start my first company without any idea without any like like no product and that was my last company so I ended up being modestly successful we were handling over 100 million MAU at our peak doing personalization predicting subscription predicting churn and we built all this now we call it an MLOps in-house and one thing we built which nowadays we call a feature store was like so pivotal to us and I realized that there was such an opportunity there so that kind of became the foundation what is now a feature farm cool what's it been like going from kind of doing the work to helping others do the work? Has there been any like surprises there?

Starting point is 00:03:31 Yeah. And it was the irony, of course, was last year when I was at, there are certain points in time, like, especially with the, you know, the acquisition of MailChimp, I, that was like the third or fourth acquisition I had gone through at a company. Not that company, but just across my career. So I had been kind of figuring out what my options were. I knew I wanted to continue to build stuff. I knew I wanted to have an impact.

Starting point is 00:03:57 Something that I was a little bit frustrated with and that I kind of sympathize with a lot of our enterprise users was the feeling of like your work not mattering like you having no impact on like either the overall architecture or the direction of your company so I was thinking about like wait do I go join a startup do I go work as a consultant for like Google for cloud services or AWS. And I was famous for saying I am never ever going to go work for an MLOps company. I'm certainly not going to become the feature store girl. And I, you know, I had to eat my words later on. But it's been really

Starting point is 00:04:36 delightful. I think sometimes in like the MLOps or the DataOps space, there's a little bit of this love-hate relationship with vendors. And I 100% understand that beyond both sides of the table now. But the reason why I joined Feature Forum was because I thought it was such a cool project. We were trying to implement really similar stuff over at MailChimp and trying to figure out like, how do we actually make data scientists more successful without like mandating this really constrictive, like single path to production. And I felt like a lot of the solutions out there, you know, for better or worse, we're not quite like meeting the gap that feature form sort of fills. And so for me, it was like, okay, I'm going to go, I'm going to leave MailChimp to go join a project that I really want to like

Starting point is 00:05:30 help see and grow, not just on its own, but also in creating with other like MLOps projects and like open source communities and vendors. Cause I feel like we need that unification, frankly. Can you unpack the love hate relationship and kind of speak to it maybe from one side is like why is there the hate and then maybe now that you're on the vendor side like how has maybe your perspective changed yeah absolutely so so as a practitioner I know so I am a boot camp grad, right? The thing that I think is really cool about Future Form that I only realized recently

Starting point is 00:06:08 is that all of us are like UC grads, University of California grads. So we all went to public university, which is fantastic. We also went to like the surfer schools, oddly enough, like UC Santa Cruz, where a bunch of folks went. I went to UC San Diego. But we were all kind of bootstrappy, you know, and as a practitioner, I felt like a bunch of folks went. I went to UC San Diego. But we were all kind of bootstrappy. And as a practitioner, I felt like a lot of times I had to kind of piece together my own stack. And frankly, it just didn't help. A lot of vendors or project or open source tools were really kind of helping that. They were really good at just this one thing, but then they weren't

Starting point is 00:06:41 thinking about the broader workflow that I was getting involved in as a data scientist. And then when I became a platform engineer, I'm like, oh, actually, this is very hard and very, like, this can be very complicated. But as a platform engineer, to a certain degree, you have to be informed and be opinionated. And you really have to understand the data science user. And hopefully that's kind of the perspective and empathy I can kind of bring to feature form. We already have a culture of empathy and user centricity. I think that is very well supported by Shad and Simba, especially since Shad came from Slack, where they were all about that user centric experience. But I hope I just bring more of that data scientist flavor to it and the awareness of their pain points, for sure.

Starting point is 00:07:26 Yeah, that's really cool. I guess so. So, all right. Let's start with, I have a question about MLOps. Okay, and I want to ask you, what's the difference between MLOps and DataOps? And I'm asking this because I think we mentioned like both terms so far so what's the difference why we need both of them yeah i like to frame the difference not between data ops and ml also that's kind of what the vendors have called themselves i think that

Starting point is 00:07:55 there is a difference between metrics and features and a metric is something you know it's a metric if you're using in a spreadsheet if you're using in if you're using it in a spreadsheet, if you're using it in slides, if you're using it in a BI tool, those are metrics. And metrics have the characteristics that they're typically, there's a correct answer, right? Like there is your MRR last month. That is like, you might not know what it is and you might have the wrong number, but there is somewhere in the space of possible numbers, the right number. They're typically relatively slow moving compared to features. So you're not, you're experimenting

Starting point is 00:08:29 to try to find the right goal. You're not like, oh, like, I wonder if I could just frame MRR this way. Like there's kind of a right, you know, metric. So the problem space is very different. And the tooling, like if you think of like a DBbt or something like it's all about templating it's very much about it's purposely putting guardrails while also making it really easy to make forward momentum yeah if you think about features on the other hand features have the characteristic we're very used in by models and so we're using both training and perhaps an inference depending on where the model is being used when you're doing feature engineering iteration is much more random for lack of a better word. Like you're trying all kinds of different things.

Starting point is 00:09:11 There's no like straight line. You'll do weird transformations like, hey, what's my MRR, but cut all users that pay us less than $1,000 a month. Why? Because it makes our model better. Why does it make our model better? Who knows? It just does. And so there's a lot more movement and duration the characteristics you'd like are different the use case of the person like a data scientist doing ml is this inherently think different and their problem set is different you want a whole different application layer than you would want for what we call data ops tools now i also think that in general like

Starting point is 00:09:46 this what like if you look at let's call it pick an orchestrator yep and you call it a data ops tool i think in practice it's not necessarily like cleaning a data ops tool i think there's almost like this whole topology that we haven't figured out yet so it's just like some tools will kind of cross the chasm of it, but they won't really be called data ops tools. There's almost like an analytics ops that's missing, and there's like MLOps and feature ops, which we're kind of a part of, and there's almost like this generic thing that lives underneath.

Starting point is 00:10:17 So I just think we don't have the topology down. I also think there's a bit of Conway's Law. Like if we go back forward five, six years from now, there will be some number of tools and startups that still exist. Tools and startups will be the topology. Was it the best one?

Starting point is 00:10:30 Doesn't matter. It's like what's left, you know? Yeah, yeah. Makes a lot of sense. And what's in the MLOs product space, right? Like we have

Starting point is 00:10:39 obviously like feature stores here which we are going to talk about, but what else is there which is unique about MLOs, right? And it's not like obviously like feature stores here which we are going to talk about but what else is there which is unique about mlops right and yeah it's not like a crossover between data out and mlops so i'm gonna badly paraphrase makiko's great talk which i'm sure we're gonna post soon

Starting point is 00:10:55 but there's four stages of the ml life cycle there's data let's just call it data there's experimentation data or sorry data analysis there's feature engineering Let's just call it data. There's experimentation, data analysis. There's feature engineering. There's feature serving. This kind of is one stage you'll be in. And that's kind of the data stage. The next stage we would call the training stage where you're very much training the models. You are hyperparameter tuning.

Starting point is 00:11:21 You're really kind of experimenting and iterating on the model itself. Once you have a trained model, you get to the deployment phase. And this is where, cool, I have a trained model I need to put in production. You might do canary tests. You might do like kind of all these things we're used to doing for services. Finally, it's in production, like the fourth stage and final stage. It's not final. I guess you kind of go back and forth. But the fourth stage is evaluation. And in the evaluation stage, you're taking all the information on how the model is doing. And you might have to go back into other parts of the process, make the model better. You're constantly iterating. It's a constant cycle.

Starting point is 00:11:56 The model is never perfect. It might be good enough, but those things always exist. On the MLOF side, we kind of have broken down into, let's call it three abstraction layers. I'll go through all of them. There's actually five that we've called them. At the very top, we call it the platform layer. At the platform layer, that is what a data scientist interacts with. It kind of goes across everything. And you can think of with DataOps, like DataOps has no idea what a model is. It has no idea what like evaluating a model looks like. So from the MLOps side, you need a platform that really understands the ML lifecycle from end to end. And it's the one unified layer.

Starting point is 00:12:33 Underneath it, there's workflow tools for data, which would be kind of like the what we call a virtual feature store, which is kind of orchestrating the infrastructure, almost the application layer that ML data scientists uses to interact with their spark redis you name it you need your experiment or your yeah your experiment tracker think like ml flow weights and biases comet there is your deployment workflow which again does the whole kind of you can even think of like spinnaker it's kind of like the cd for models and finally there's the evaluation store which is taking the metrics that you're storing allowing you to evaluate it's kind of the cabana in devops terms yeah so you almost have like the spinnaker the

Starting point is 00:13:14 cabana the data orchestrator and it isn't really a good thing for training i guess it would be like a build tool maybe like the ci and then that, there's still like training services. You need something that actually trains the model. You need something that actually serves the model. You need something that actually like stores data and transforms data. That could be the Spark and Redis. You need something that actually collects the logs. So there's a lot.

Starting point is 00:13:37 This is a very long answer to say that there's like a very wide space and a lot of problems to be. And that's just one view of it, which still like there's like probably a hundred vendors that don't fit into the framework I said that are still very valuable tools. A hundred percent. So what is feature form? Yeah.

Starting point is 00:13:55 Feature form, we call ourselves a virtual feature store. Okay. We're an open source product. So you can go check us out on GitHub. We are a place for data scientists to define, manage and serve our features. And to imagine that if you're place for data scientists to define, manage, and serve our features. And to imagine that, if you're a data scientist listening to this, I bet you right now there

Starting point is 00:14:11 is a notebook that you have been using for work that's called Untitled 118, the iPyte notebook, that you've been copying and pasting from. You have some Google Docs full of useful SQL snippets. You know that there's one person in the company that knows how training set N was built, and you have to go Slack them tomorrow so that they can remind you where to find whatever data. There's so much ad hoc-ness that comes in the process of features. It's just completely made up. What we've figured out, I think, and what a lot

Starting point is 00:14:46 of people have focused on is getting the right data infrastructure tools. Like we know how to handle a lot of data. So what I would call the platform problem, like scaling subpute, having low latency serving, having high throughput, we've solved that problem, I think, pretty well. What we haven't solved is making that a product that is usable and valuable to data scientists having interacted that layer. So the virtual feature store, you can think of as an application layer over your existing data infrastructure, which provides that versioning, access control, governance, a nice API for serving everything you would need around features, that whole workflow and orchestration, feature form does using your existing improv.

Starting point is 00:15:26 Okay. And, okay, feature stores is a term that has been around for a while now in terms of the tech industry, at least. So what is different between feature form and what was before, right? Like the other feature stores out there. Okay. I got a hint from you.

Starting point is 00:15:44 It's like virtual, but i think it would be great like to understand what these like what it means like virtual compared to the existing solutions out there right yeah so we've broken down feature stores into three categories we call them the literal the physical and the virtual so i've talked about the virtual. Let me talk about the other two. The literal feature store literally stores features. So if you think there's a handful, I mean, even like SageMaker's feature store and Vertex's feature store kind of follow this architecture. Or you create your features elsewhere. And then you finally store them in the literal feature store.

Starting point is 00:16:25 The value prop is all your features are in one place. that place can serve for both training and inference okay now the thing that we believe is missing in that is your untitled notebook is still there you're still attracting spark all that's changed is rather than writing to red this and that's three you're just writing to one place there's definitely some value there but we think it misses the main pain point that data scientists have. Versus the physical, which kind of does it all. It does like what we're describing, the transformations, the storage, all of it, but it does it on its own improv. So it's really kind of this heavy tool, like they sometimes call themselves a feature engine. And it's, this is kind of what we built our last company, you can look at some of

Starting point is 00:17:06 the internal like Michelangelo and Zipline, they kind of follow this really heavy to implement, it kind of replaces existing infrastructure, it's really expensive. You have to write all your like features in the new DSL that they've created. It's doing everything at the cost of super high adoption costs, like sometimes impossibly high adoption costs. Like it's impossible if you're like a large bank to get all of your data to get through one place and process it all in one place that some startup is running. So that's one thing. And the other thing is beyond adoption costs is there's kind of this lock-in. Like, for example, if that physical feature store doesn't do streaming in the way you'd like or can't handle some transformation you want to do you're out of luck the virtual just to kind of complete it a little

Starting point is 00:17:49 bit of a repeat from before is what would happen if you took that physical feature store you chopped out all the actual processing and storage and provided like nice clean puzzle pieces yeah for you to plug in your existing app for a while. And what happens in reality is it's actually like the physical feature store and much more. Because we have customers who are very like, yeah, we have multiple Spark clusters. We have Redis for this team, and we have Cassandra for this team. And you kind of get this

Starting point is 00:18:16 data meshy, like kind of like heterogeneous infrastructure thing that happens, which is more true to form for enterprise, while having one unified application layer. And I actually think that's the future for a lot of this stuff. Choose the right data providers to get the characteristics you need from your platform, and you need the best-in-class API.

Starting point is 00:18:38 Feature form's name is actually not Terraform. And we kind of have this idea where it's like, yeah, Terraform is in the cloud provider, right? Yeah. What Terraform, or HashiC the cloud provider right yeah what terraform and or hashicorp rather or yeah terraform i keep yeah it does is it is the best api would you ever build terraform yourself no why would you right i mean it's kind of like the like it's come to this like perfect api that fits well for the use case yeah so our goal of feature form is to build the best possible apis that your data scientists love to use while giving you the flexibility to get all the other characteristics you need. Yeah, that makes total sense. And like one last question for me, you mentioned like features and features of different metrics.

Starting point is 00:19:13 Do we also need like a different storage layer for like features, like the database systems, different that we are using at the end, or it's let's say the same data lakes the same OLTP databases that or like things like Redis caching layers that we are using to also store the features so we like underneath typically like you can plug in whatever you want we have people use Postgres you have people use Oracle is a very common one probably one of the most common for inference yeah for offline store it could be S3 it could be hfs it could be snowflake we think that those companies are amazing at solving those problems of like you know being able to have really low latency reads or whatever latency you need and balancing costs like that's what they do all day long that's what they think about so why beat them try to play that game we're gonna lose we're gonna build a worse redis why

Starting point is 00:20:03 build the worst redis let's just like let you plug in redis and even managing it it's like yeah like we're not going to manage redis as well as redis labs or as well as the company itself doing it so you know that's yours we'll just do the application layer one thing i'm going to add to it's not what you asked but i think it's interesting one other type of database that we think a lot about is the vector database that would be my next question actually like what is a vector database and how it differs like with the feature store right yeah so to understand the vector database you sort of have to understand this concept of an embedding and you can think of an embedding as a whole it's a it's literally a vector like literally it's like n floats that we treat as a vector and you actually think of it as plotted in space so if you have a

Starting point is 00:20:46 3d embedding it's literally like a point in 3d space yeah how we create that embedding i'll spare you but it's what you can think of it is for let's say i'm building a recommender system this is literally what we did i could go and look at process's buying history and my buying history and then all of our buying histories, and I can take it and magically, I have this thing that's called a transformer, that magically takes that and turns it into a vector. That vector is a holistic representation of our buying tastes. I can actually plug it into other models and use it as a kind of, rather than putting your name down, your ID down,

Starting point is 00:21:22 I can put a vector and it's going to be a more dense representation of you i can also there's another interesting characteristic that happens where users of similar buying behaviors are near each other in latent space which is again it's a 3d embedding they would literally be close to each other in space so that unique property of things that are similar or behavior whatever latent space but you can think of as if it's users behavior if it's taxed it's kind of like this taxes viewed similar as according to this transformer and you need something that can quickly get nearest neighbors it's like hey I want to make a recommendation I want to go and pull you know ten closest items to

Starting point is 00:22:03 this item so I can show similar items here. So it's an index and these indexes have existed for a while because the actual in-in problem is relatively slow. So you kind of have to brute force this. You can get approximate nearest neighbor relatively quickly, faster for sure. And those indices have existed forever, not forever, but for a long time. Like I remember building, I built a vector database myself, probably four times in my career now. and what was missing was everything else that makes the database a database right like a database isn't just like a an index or in memory index so the vector database takes this index and builds everything else to fit it into being a database

Starting point is 00:22:40 and yeah i'll give one more example where embeddings get really interesting like for youtube the way youtube works is it there's two models that do the recommendation the first model is candidate generation it will grab the thousand thousand nearest neighbor videos to you your user embedding and then they'll take those embeddings and feed them into a second model to do ring so they'll actually use the embedding. This multimodal kind of idea of like chaining models together and using embeddings as this kind of intermediary language is the future of ML. It has been for many years, but now I think people are becoming more aware of it. And I think that even prompts and LLMs and all this, eventually it's all going to break down to hey we

Starting point is 00:23:25 have transformers that generate embeddings that's everything is kind of going to look either embedding to and whatever like images or image to embedding but everything falls back into that space yeah okay that was like one of the best explanations of what's all these technologies are i think like everyone everyone's a little bit confused lately with all the noise around all these things, but that was amazing. Thank you so much. Of course. Brooks, microphone is back to you again. It's all yours.

Starting point is 00:23:56 Yeah, one thing I know I can easily do is I fill in for Eric because I can say we are at the buzzer, which he always says. But thank you all so much for coming and chatting with us. I know it's a busy conference for you all. So, yeah, thanks for coming and teaching us more about MLOps. Last question before we kind of sign off here. Folks want to find out more about Feature Form. Where can they go? You mentioned GitHub, but where else can they go? ask any questions. We share updates there, both around the product and webinars, meetups, events,

Starting point is 00:24:47 all the great content that the team produces. So yeah, join us. Awesome. Yeah. Feature forum. Check it out. Thanks for joining the Data Stack Show. Please subscribe to the show and we'll catch you next time. Thank you. Thank you. Thank you. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.

Starting point is 00:25:14 We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com Thank you.

Pet Camera - EBO Air 2

The Data Stack Show - Data Council Week (Ep 2) - The Convergence of MLops and DataOps With Team Featureform

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - Data Council Week (Ep 2) - The Convergence of MLops and DataOps With Team Featureform

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.