Software Misadventures - Emmanuel Ameisen - On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more - #11

Starting point is 00:00:00 in the data is literally like what you're teaching your model. And so no matter how good your model is, you know, if like you haven't looked at your data in a while, it's very likely that you have like a bunch of garbage in there, like a bunch of let's say like log events, there are test logs and they're just like, you know, nobody wanted to filter them out because whatever. But then you remove those and you see huge performance gains.

Starting point is 00:00:21 Or, you know, recently I had a project where like, literally just changing how we define like what label we use. In many ways, if you if you enjoy working with data, I find that like, it could be just a very powerful thing to do. Because it's usually more informative about the actual business application, you kind of get to see like, oh, like, you know, what is the outcome that we're trying to model, And you get a lot of performance gains. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps experts to hear their stories from the trenches

Starting point is 00:00:56 about how software breaks in production. We are your hosts, Ronak, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect

Starting point is 00:01:13 and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hello everyone, this is Guang.

Starting point is 00:01:34 Our guest for this episode is Emmanuel Amison. Emmanuel is a machine learning engineer at Stripe working on fraud prevention. And before Stripe, he led more than 100 ML projects at Insight, helping fellows from academia or engineering transition into ML, which is actually where I met Manu. It was super nice catching up with him and get his stories and pro tips on things like the common mistakes people make when starting a new ML project, what's similar and different about the lifecycle of ML systems compared to traditional software

Starting point is 00:02:05 and writing a book? Please enjoy this very educational and fun conversation with Emmanuel Ameson. Hey, Emmanuel. It's great to have you with us today. Welcome to the show. Hello. Hello. How's it going?

Starting point is 00:02:21 Good, good, good. So, Emmanuel, we met at Insideight Data Science a couple years back, which is also how I met my co-hosts, Austin and Ronak. How did you end up at Insight? I ended up at Insight initially as a fellow. I was a data scientist for a couple years, and I got really interested in deep learning and newer approaches to ML. And so I joined Insight as a fellow because I was like, it'd be a really nice way to change jobs.

Starting point is 00:02:52 And I ended up liking Insight so much that instead of going towards an ML role, I stayed there for a couple years. Nice, nice. So I guess for people that are not familiar with Insight, maybe a little bit about what does Insight do and what you do when you're there? Yeah, so Insight is a professional education company. So the idea is you'll have people that either have PhDs or postdocs or that are engineers or that were data scientists like me that want to transition to mainly roles in data, like data engineers and data scientists. And they'll come to Insight.

Starting point is 00:03:27 And then it's a project-based learning approach. And so you do a project for about a month. And then you use that project as a portfolio piece to sort of like go to prospective employers and get interviews and then, you know, transition into your new career. And so what I did there is i led the artificial intelligence program which was mostly initially it was a lot about kind of deep learning research and you know kind of trying to apply cutting edge research and then it pivoted to being some of that and then a big focus on machine learning engineering which was sort of like what

Starting point is 00:03:59 many companies really needed at the time and and kind of there was no traditional path into it. It was a bit of a hybrid role in between data science and engineering. And so spent a couple of years leading that program, which was super fun. And ended up through that seeing over 100 different ML projects. A lot of those projects that we did were in partnership with companies. So it was really fun to see just a vast array of companies doing ML and helping them out. That's an interesting transition because I imagine NeurIPS were getting a lot of activities and I think a lot of people in the field

Starting point is 00:04:39 talked about modeling the more research aspect. Was it a difficult decision trying to switch more towards like, you know, the ML, like the engineering aspect of it, like even during the program itself? That's super interesting. Yeah. Kind of for two reasons. It's risky, right? Because you, it's sort of a goal. Yeah. Well, so actually what was happening is that there's, It's risky, right? Because it's sort of a goal against the trend.

Starting point is 00:05:15 So actually what was happening is that there was these like, I remember very clearly this being sort of like two waves where kind of companies we're talking to got really excited about the sort of research that you'd see at NeurIPS. And they'd be like, oh, we need to hire people that know how to do this. We're going to integrate this and, you know, every single part of our products's going to be great. And then kind of basically lagging by maybe like a year or two, you had like the rest of the internet, maybe like all of the medium posts and like everybody got really excited as well. But by the time that was happening, the companies we were talking to were saying like, well, you know, like this is kind of great, but we actually have a bunch of researchers and they're producing this research and it's really good research, but it's incredibly hard

Starting point is 00:05:44 to do anything other than publish it. Like we've tried to good research, but it's incredibly hard to do anything other than publish it. Like we've tried to integrate it, but the researchers, we've partnered them with software engineers, but getting them to work together to speak the same language is actually really challenging

Starting point is 00:05:55 because they have different backgrounds and making product work is really challenging. So there was a huge need. I remember sitting down and saying like, wow, we've know 20 conversations in the past month with with different companies saying the exact same thing which is we actually are good on the ita we just need um engineers and so that side in one way was like uh became pretty clear it was hard because it was a pivot but the part that then was was the hardest i think was that

Starting point is 00:06:22 then that meant that we were misaligned with the hype cycle where like as everybody was hearing about like oh my god ai deep learning all this has to be great uh they would apply to insight and then you know they'd come to interview to say like well what i really want to do is you know like this very specific like computer vision deep learning and we would have just come off of like again 20 calls with companies that were like we will never hire someone like this we have too many of them you know like or not never of course but but, but, you know, that that was really wasn't their need. And so there was a bit of a mismatch.

Starting point is 00:06:49 Like we're kind of a, we were kind of a two-sided marketplace and there was a bit of a mismatch between kind of like what, what roles people thought were out there and then what employees actually were looking for. And so that was maybe the trickiest part there was, was kind of like just getting the messaging out, to be honest, like, here's what people are actually looking for. And so we did a lot of work around blog posts and putting white papers out and trying to explain, like, actually, this is probably a promising career direction if you're interested in it.

Starting point is 00:07:16 Do you think that hype has settled now? Or do you think that is still the case where, like, a lot of people who aren't in the field or in the weeds think, oh, look at this amazing research on AI and deep learning models. But actually, when you talk to the companies, they're like, oh, we need someone who does more on the engineering side with machine learning, but not so much on the research side. Do you think that hype has settled or that it has balanced out over time?

Starting point is 00:07:40 So hard to tell. I feel like I'm not the right person to ask in a sense that anybody that works at a Silicon Valley tech company has such a biased view of what actually is probably the real distribution of use cases. But I would say that even here, which is probably pretty advanced compared to the market, it feels like there's still, in my opinion, a bit too much hype. Which is challenging because there's a lot of really cool applications of genuinely new technologies. It's not like it's all hype. There's a lot of incredible, like I'm deploying around computer vision,

Starting point is 00:08:15 but also just really great advances around NLP and language understanding and big companies have shit really cool things um but it still feels like if you ask i feel like uh the like average graduate in that field to tell you like which proportion of jobs are actually doing that um that's sort of like cutting edge modeling i think they would almost certainly like overestimate it still at least that's my bet yeah i think i would agree with you there based on what i've seen but again uh i am also not the right person to ask it's getting better though there was a time where it was genuinely like 100 of people wanted to train deep learning models and that was you know two percent of jobs and and that was really hard yeah yeah um so manu you let you

Starting point is 00:09:02 actually led the session that i participated in back in 2018 2019 but so what impressed me was sort of how diverse the backgrounds of the people that were in my cohort so some were more academic they come from you know CS or like PhDs in physics but then you also have people that are you know software engineers that want to get more into ML so I'm to lead with a very bad question, but I wanted to get your thoughts around this, which is to say that would it be easier for software engineers to pick up ML skills or would it be easier for AI research or for PhDs to pick up software skills? I mean, I feel like it's actually a really good question because I feel like at least it's a question that comes up really often um and it was definitely something that

Starting point is 00:09:49 we asked ourselves you know at a program that tries to recruit people so that they can then be hired by companies they're like you know uh kind of which kind of backgrounds would be the most successful which is a natural question to ask. Initially, I had the impression that it was easier for people coming from an experimental background to learn the engineering skills. And that's sort of because, you know, for a lot of the ML work, the experimental side is more of a mindset. And it can sometimes be harder to teach a mindset than it is to teach tools. You know, where it's like the mindset is like you formulate a hypothesis.

Starting point is 00:10:31 And then you say, okay, my hypothesis is this. And then I'm going to test it out. And then you design carefully an experiment. And you run your experiment. And then you analyze the results very carefully. And you say, okay, based on the results, this is my next step. That's sort of like a thing that you can do sometimes in engineering, but engineering is much more deterministic,

Starting point is 00:10:48 where you make a concrete design and then you implement it. And you take a person like this and you teach them Python and they can kind of do a good job. I think as the field has progressed, my opinion has changed quite a bit, actually, because as ML gets deployed in production a lot more of of the scope of what people end up doing is everything around ml uh there was like a a paper by google around ml systems and in that paper they have a graph and it's like ml systems

Starting point is 00:11:21 and there's you know like i think like i don't know like maybe 25 boxes that represent like monitoring and like training and and whatever and there's like one very very small box that's like training a good model uh and that's you know that's probably like i think if you've if you've worked in that domain you've seen that like that's a small um proportion of the task and so even if you're like a really good engineer and maybe it'll take you longer to get that mindset it's certainly possible to learn it it'll take you longer to get a mindset, it's certainly possible to learn it. It'll take you longer to get the engineering, the experimentation mindset. There's so much youthful work,

Starting point is 00:11:50 right? Engineers that are interested in ML that strategically, it might be an easier learning path where it's like, you'll find a role, you'll get on a team and then you'll just like kind of like learn by osmosis. And so I feel like nowadays it might've flipped where it's easier for engineers, but of course,

Starting point is 00:12:07 you know, there's no hard and fast rules. It's kind of anybody can do it if they're motivated enough, but it's changed, I think. Nice. And so I remember when I was trying to kind of start my ML project, it's very much, you know, trying to, I think it's almost kind of like a startup. You're trying to build a product. You're trying to show it to potential companies at the end to kind of pique their interest. And so maybe it's not exactly customers buying a product, but still you want to have something that's sort of cool, that's shiny, but also hopefully add value, right? So that people see. see um but like you know the the show is about misadventures and i'm kind of curious you know what are some of the common mistakes uh that people make when they kind of go about starting their like ml project yeah i was trying to think about about this ahead of uh ahead of chatting

Starting point is 00:12:58 with you both i think there's a few and and i've given up on trying to categorize them all um but i think a really common one that used to happen all the time at Insight that kind of happens everywhere and like I've definitely done it as well is that the fun part of machine learning or one of the fun parts can be like you have your data set,

Starting point is 00:13:18 you have your training data, you have your test data and then like you're just trying your models, you're building new features and you have like this number which is like maybe your accuracy or your loss or whatever. There's a number you care about. That's your performance metric.

Starting point is 00:13:30 And you're at the casino. You're playing slot machines. You're like, all right, I'm going to try this. Oh, 0.82. All right. Oh, let me try this. Oh, 0.85. OK, great.

Starting point is 00:13:38 And you just keep going and you do your thing. And it truly sometimes, I feel it was like a gambling addiction where like you'd see fellows and you know they'd say like well i haven't really figured out you know what my product is or if it's useful or if anybody cares about it or really even like any of the flow around like what this would be but i found like a data set of a million text documents and like my accuracy is like 0.99 and i spent like the last seven days on it and you'd be like but like but why what are you trying to be like i don't really know but like look at this score it is all the way it is peaked like i've i've solved machine learning i think that that was

Starting point is 00:14:16 like and it's still one of the biggest challenges of ml is like when you start oftentimes you have to have this kind of like delicate dance where you go between like, what are we trying to do? Like if you're trying to like, I don't know, you know, like predict something new that you're going to show to users. You probably want to say like, okay, well, if we're going to predict this new category, you're like, how accurate do we need to be? Okay, we need to be this accurate. Let's try. And then you fail your first try. And then you like maybe change the product a little bit and go back and forth

Starting point is 00:14:48 and have this little dance. But the biggest failure mode is you give a data scientist a data set, they're gone for a week and they'll come back with a really high number that is usually absolutely useless. And I think that that was one of the most common things that we have to bring back people from the edge. Like, hey, hey, we have to bring back people from the edge. Like, hey, hey, hey, hey, get away from the laptop. You don't need one more training run. Let's take a step back and think about this for a second.

Starting point is 00:15:15 Going deeper on that, was it usually maybe, if you think about it as a product, you don't actually need that sort of level of performance? Or is it just plainly wrong? Like, you know, the test site is polluted, or, you know, they're, how do you go about convincing the fellows, hey, even though, you know, you spent seven days on this, but you should look at this other thing that's probably would do, you know, more important things for your project? Yeah, I mean, it can come in a variety of flavors i'll give you like a kind of example that happened multiple times is like so we would do consulting projects for um for companies and the idea is like they'd come with a problem and you know they'd like present to the fellow and the fellow would help out and like a common shape of that would be a company

Starting point is 00:16:00 comes in and says like well you know we've tried we've done this model for this thing and we've gotten 80 percent whatever precision uh we'd love it if like you know your fellow could get that number up and multiple times you know it's like the the fellow would like do some work and like get the numbers like 99 like almost immediately like in a couple days and you'll be like oh like you know have i it? Am I done with this project? And multiple times, as you alluded to, we looked at the test set, and I remember one case where it was predicting some sort of outcome from patient data, and it's like the same patients

Starting point is 00:16:38 were in the training set in the test set. And the patient ID was a feature, and so if you just use the patient ID, you get 100%. And in fact, it was kind of surprising that the company didn't have 100% initially. I don't really know what they were doing. But you have more subtle ways of that. It doesn't have to be patient ID, but you can have data leakage.

Starting point is 00:16:56 And so if you don't ask yourself, is this too good to be true? Why is this performing so well? Your model is essentially, even though it's got 100%, is useless. Like, if you were to ship it in a medical device, it'd be, you know, like, a dangerous and bad thing to do. And so it's worth thinking about it. That's one. But to go back to your question,

Starting point is 00:17:13 like, another common failure mode, I would say, is, like, let's say you're trying to build for a search, like, you're trying to build some, like, search model where I type a query, and you help me, you know, with what my query is going to be and you do like Google autocomplete. One approach is like you literally try to guess the sequence of characters that I'm going to output. And you can do that. And in fact, I think like at some point it makes sense to do that. But it's pretty hard. There's like a wide space of what I could write. But, you know, you could spend weeks and

Starting point is 00:17:46 weeks like just iterating on the sequence to sequence models, like some complex models, like predict the sequence of characters. Whereas in fact, maybe you could take a much simpler approach where instead, like I start typing my characters, and you just suggest like one of five categories, like maybe you're searching about books, maybe it's about mattresses. And like, that is much easier. And that's something you could do in a couple of days. But it might be 90% as useful. And so sometimes it's just about thinking back and saying, my first results were bad. Could I change the product?

Starting point is 00:18:13 Could I change the modeling approach so that it's much easier? And that's definitely a very common failure mode, where you think way too many times on the machine learning, whereas you could save yourself a lot of time by changing your product slightly. I think these learnings from leading these projects, I think it's really cool because I feel they proxy a lot of times where big companies, they want to leverage ML.

Starting point is 00:18:39 And it's usually someone maybe from the product side that are like, hey, is this something that we can do to make our product better? So then you get someone that's new either to the dataset or to the deployment environment and then trying to build this thing out end-to-end.

Starting point is 00:18:58 I'm kind of curious, do you have any tips or how you think about how to help non-ML people that want to use ML to be more effective, I think in terms of like what they should expect, right. Cause a lot of times maybe, you know, some of the, I think the sanity checks that you mentioned are very much like, Hey, let's, you know, make sure that this thing,

Starting point is 00:19:19 maybe we can solve it by rules first before we actually try ML. But yeah, I wanted to get your thoughts around that. Yeah. I mean, that was one of the patterns that we kept seeing that companies would invest in a lot of ML engineers and they just had to struggle to reap the rewards. I think the rules that you talked about is a big one where you kind of want to call out ideas that are – that there's machine learning and there's magic.

Starting point is 00:19:53 And some projects don't need machine learning. They need magic. Because if one of you goes to my website, I just kind of automatically know what kind of devices you like and everything about you. And I automatically just show you a page. And then I automatically know how much money you have. And I can give you the exact price that will give me as much money. But also you are 100% likely to buy it. And that's just not going to happen. That's something that maybe over years and years and years you'll get an incremental system that does something like this. But it's just not going to happen. That's something that maybe over years and years and years

Starting point is 00:20:25 you'll get an incremental system that does something like this but it's just not possible. And so I think rules or at least writing out a set of heuristics is a very good first step for anyone that doesn't even know ML. For the vast majority of ML systems you can at least get something off the ground with rules. This is not always true. There's some things with like, kind of like advanced computer vision, where, you know,

Starting point is 00:20:50 maybe that's, that's less true. But for many, like, at least, concrete problems, which are usually on tabular data, you know, certainly things like some of the things that we do, it's like fraud detection, or, you know, like predicting clicks, predicting, like recommending videos, you can get pretty far with like heuristics of, well, if you've liked 10 videos of this category, maybe we give you another one. And that doesn't mean that you have to necessarily build that system first, but it can be a good heuristic to know whether you're asking for too much of your ML system.

Starting point is 00:21:18 If you can't express it that way, it might be that you just want something magic. I think that's the first thing is scoping well. I think the only other pattern I've seen is just that iteration speed is really the key. It's like the projects that are the riskiest are the ones where you want to build the world and you have this kind of holistic,

Starting point is 00:21:44 here's here's everything that we need uh in order to like fully solve this this problem and and that's that's usually challenging because at the point where you're doing your first demo solution you don't have enough information about your product or like what you're trying to do to make the like correct design decision usually like usually what you want trying to do to make the correct design decision. Usually what you want, again, is an experimental mindset of, oh, we think that people will like this new feature where we recommend videos on YouTube. We think this will be nice.

Starting point is 00:22:19 Let's just try to throw something there and see what the uptake of it is like. And then once you've thrown your very simple model, it's taken you a couple of weeks, then you can iterate really fast. And that usually works well because if it was a bad idea, you find out in a couple of weeks, as opposed to doing a nine-month project that then fails and then kind of scares people people away from ml i see speaking of iterating fast the insight project sort of is very fast right like five four or five weeks you

Starting point is 00:22:52 gotta give something up and i feel like i will be doing myself a disservice if i don't ask you uh what was the most hilarious misadventure that you've seen um when you were leading these uh projects and you can't use mine obviously that would be uh you know unprofessional so all right well i only had one example um i think like listen i i don't want to pick on on in particular, but I do think that there's like the kind of funniest ones, to me at least, where, you know, again, like Insights Model was iterate fast, start with something simple.

Starting point is 00:23:36 And sometimes, you know, fellows decided to not do that. And they take, you know, let's say a classification problem, like where they would, like a common one was classifying support requests. Like somebody writes in as a ticket and they say they have a problem and you want to efficiently route it to say like, you know, is this about, I don't know if it was like an ISP like connection or is this about like billing? Is this about something else? And they would start and say, all right, well,

Starting point is 00:24:03 the first thing I'm going to do is I'm going to start with BERT. And they do something super complicated. But they get their pipeline and they're like, boom, BERT, 85% accuracy. And then usually the feedback we'd give them is like, it's great. We're happy you got it. But in order to contextualize this for someone, you probably want to just compare it to a simple baseline or some simple thing you do yourself. Quick thing for our listeners. Can you just explain quickly BERT? What is it?

Starting point is 00:24:31 Oh, BERT is an advanced natural language processing model where essentially it was pre-trained by Google on a very, very large corpus of data. And it performs really well when you take this very large model and kind of fine tune it on your data set, usually get some pretty good results pretty fast. But it is still pretty heavy and pretty unwieldy. And at least, you know, as of this podcast, usually not the fastest thing that you could do, but that might change. But so they'd start with this kind of like advanced

Starting point is 00:25:01 method and also like pretty, pretty heavy duty method. And then you said like, all right, well, do you want to try like a baseline maybe you could try some simple machine learning model or maybe like you could just write you know like a like a if else like a switch uh function just say like how would you do it and then you know they would get 95 and i just like blow their very company and like i mean that's really a brutal situation to be in because then you know you're like well i guess like either you have to lie about how you got to this result and say like no no i started with a simple model you know and then i tried something complicated but didn't work out

Starting point is 00:25:34 but like it's really not a good look if you're like this person that's like well you know i like used this like massive weapon uh and then i realized that like a fly swatter would have been enough um that was those kind of mistakes were usually like pretty brutal and and honestly pretty disheartening to see like i think it's also demoralizing because you're kind of like oh why did i do this um so that's why we insisted on that so much is that we've seen that pattern really a lot of times it's interesting that uh like from i didn't for instance, when fellows go through this process, it's unfortunate and, as you mentioned, disheartening. But again, the cost is not that high.

Starting point is 00:26:12 In many cases, they could change the project or pivot or do something else. Do you think it's also applicable to many teams and companies who are trying to do this where at least for someone who is not who doesn't work on machine learning i don't think about machine learning as a solution to a problem that i want to solve first i'm like how can i do it brute force or like through rules for instance like the switch cases that's what i think about but you feel like sometimes teams fall into this trap as well where they use machine learning to solve a problem where it's not the

Starting point is 00:26:48 best solution or it's not the best tool to solve that problem. If that happens, how could one go about figuring out whether ML is the right approach for something? Or whether for something you just need something much simplistic that satisfies the spec? Yeah.

Starting point is 00:27:07 I think if you manage to write a flow chart that solves that you'll make millions because i think that's a really hard problem uh there's there's a paper again i think it's it's by google that um i think it's titled if i remember correctly you know machine learning um the high interest credit card of technical debt. Oh, I think I've seen that. Yeah, yeah, I've read that. It's a very catchy title. It's a catchy title, and it's an excellent paper, because, like, machine

Starting point is 00:27:34 learning systems, the issue with them is, like, they can be super valuable again. I don't think we need to make that argument here, but, like, you know, ML systems that are deployed sometimes can, like, make or break a company or certainly make millions and millions of dollars, and

Starting point is 00:27:48 usually a good ML system will outperform rules or switch case or heuristics, almost always, but there's a huge upfront cost to getting it right or oftentimes a large upfront cost, and a large cost of maintenance, which we can get into that after

Starting point is 00:28:05 but like keeping your models running well and behaving well is really hard um i guess you yeah you just had a previous podcast episode about that so uh you know all about it but um i think that the like heuristic in a sense of i guess like heuristic for you a person to know whether machine learning is is a good, whether you should do rules. It's kind of like, it depends a lot on your company environment initially. There's some companies where, you know, you have infrastructure teams

Starting point is 00:28:33 and machine learning platform teams that expose to you really nice primitives when you can say, oh, like, we already have this feature store where we have a bunch of features, you know, I can use, like, you know, various attributes that I can use for a model, and then I can just kind of even maybe have a UI where I can click and trade this model and I can see if it's good,

Starting point is 00:28:54 and it doesn't cost me anything to deploy it, and maybe I can have some data scientist look at it and tell me if I missed something, but it's pretty self-serve. I think that's the dream, and some companies are there. If you're there, then I actually would encourage, in those companies, I think the success stories we hear

Starting point is 00:29:10 is everybody can just experiment and then most people have at least a few good ideas and so it's great. That's what you want. If that's not the case, then I think you want to be, basically do an ROI calculation.

Starting point is 00:29:26 For teams that currently have models in production, how much time does it take them? On their roadmap, how much of it is just keep our model alive or refresh our model? And how much is that in terms of salary costs? And is your application worth at least that many dollars, I think would be the number one thing. And if the answer is like, there's no team at the company that has ML models production, then like, you should almost certainly not do it if you're not like an ML engineer or like, or somebody that has experience with it. So I think it depends a lot on the company. No, that's a really good answer. I mean, that's really good advice for people thinking about machine learning in general and thinking of those tradeoffs and costs because, well, it's not free for sure.

Starting point is 00:30:10 And I definitely want to get into that lifecycle of machine learning models in general. Like once you deploy it, how do you saw a lot of teams or machine learning teams structured in many different ways, who also worked in many different ways. And then you have been at Stripe for a little over a year now, I think. Almost two now. Oh, almost two years. So what made you choose Stripe? And can you talk a little bit about the team you're on and what you do yeah for sure um yeah so I mean I guess first of all before I say why I chose Stripe it's like I uh why I left

Starting point is 00:30:53 Insight is is also because so at Insight I was leading these projects and it was super fun and I got really to see the world of ML and I'm extremely thankful for it because like it's very rare that you can be in a role where like you know uh you're at the same time working on like helping an nlp project and then like trying to be like a cutting edge uh you know like reinforcement learning like a like research thing and then you're also doing uh production like ml engineer for companies it was super fun but i missed being an ic myself and so after uh after uh insight i really wanted to get back to doing machine learning myself because I really missed it. And so then I looked at various companies and I think there's,

Starting point is 00:31:31 I mean, I don't know how much of that we should get into in this conversation, but there's a lot of different ways to structure ML companies and to have different employees on an ML team have different roles, and how you do that partnership, that's a whole realm in and of itself, and I think Stripe does that pretty well. But one of the main reasons that attracted me to Stripe is that I felt like one of the biggest challenges of ML is actually, I mean, the zero-to-one aspect we talked about, there's many failure modes, but it's actually the sort of like, oh, you've built a model and now it's serving, you

Starting point is 00:32:12 know, millions, billions of users, like however many, like now it's popular, now it's used, like now what? Like, do you just keep that model there forever? How do you know if it starts doing something horribly wrong? Like there's, you know know so many like um examples of models for example exhibiting bias right in search results like how do you even detect that uh ahead of time how would you be aware of it um if you retrain your model um and it kind of like looked better on your on your test set your number went up like should you ship it how do you you know? That aspect of the engineering around MLOps people call it MLOps, but it's almost just

Starting point is 00:32:51 quality assurance for models, right? MLOps sounds definitely much better than MLOps. I know, I know. But yeah, it's good branding I guess. That's why I wasn't in charge of coming up with the name. But yeah, to to me it actually sounded way more fascinating and maybe because

Starting point is 00:33:10 I had been a lot in the zero to one world but I really wanted to like go at a company like Stripe which sort of like has you know like a really good engineering reputation and like really strong engineers and has also is like at a crucial

Starting point is 00:33:25 uh uh point in the in in like uh our our customers our merchants workflows right it's like it's not like we're this tool that like if we break you know whatever we'll see what happens it's like stripe processes uh their payments and so it's an extremely sort of like high bar if you're going to have yeah i work on the on the at Stripe. And so the models I ship are the ones that decide whether a payment is fraudulent or not, and blocks it or not. And so, you know, that's kind of a, like a very compelling reason, let's say, to get that right. And to make that like a really important part of your workflow. And so yeah, I was just fascinated with that. And wanted to both learn more about it at a company that, you know, has seen like immense growth and has like a really important part of your workflow. And so, yeah, I was just fascinated with that and wanted to both learn more about it

Starting point is 00:34:05 at a company that, you know, has seen like immense growth and has like a really good engineering reputation and try to contribute. Because like when I was looking around for resources, there wasn't much, it's not something that seems figured out. And so it was attractive for that reason.

Starting point is 00:34:18 I see, that makes sense. I mean, I think the fraud team that you mentioned, it's like the cost of getting a model or an incorrect model deployed in production is extremely high in that case. So the engineering rigor you need around the quality assurance is much needed. Talking about that, so you've touched a little on that exploration phase of going from zero to one. Can you touch on the life cycle of a machine learning model once it's deployed in production?

Starting point is 00:34:47 Like when I think about a software or when I think about a web application, for instance, that is doing, let's just say serving cat videos, in my head, I can think about, okay, this is how you would go about running it in production, keeping it alive. This is how the storage would look like and everything that goes into running it in production.

Starting point is 00:35:07 For some of the aspects that you mentioned, let's just figure out, once you train a model, how do you decide that this should be deployed in production? Let's just start there and then we can build on top of that. Yeah.

Starting point is 00:35:20 I mean, it should be easy, right? If it's better, you deploy it. Or rather, what I meant to ask is, how do you know if it's better? When you train a model and there is one running in production, how do you know the one you just trained is better than that one? Like, how do you validate that? Yeah, just to be clear, I said it should be easy.

Starting point is 00:35:42 It is definitely not. I got the sarcasm. Ron, I can take that invitation. No, for people not in machine learning. I'll play that hack. Yeah, there you go. No, it's a good question because it's hard. I think there's a few things.

Starting point is 00:35:59 So I guess the easy part is all the way at the start where you have your model in production, and hopefully you can use it to score some set of transactions offline, let's say, for Stripe, because we're talking about broad. Or you can score offline, and you score the same data set with your new model, and then you have some performance metrics. So for Stripe, it might be like, how many while, you know, maintaining the same level of false positives? And you say, oh, well, it's higher, so, you know, it's better. But then, like, what usually happens, and this kind of, like, multiplies if your use case is more complicated, if your user base is more diverse, is that you have the top-line performance metric, and that's good, but that does not tell you whether a model is good enough to deploy

Starting point is 00:36:52 for a few reasons. One example is, let's say I train this new model, and it does catch more fraud than the old model, and it doesn't have many more false positives in aggregate. It's the same level. But it doesn't have, you know, many more false positives in aggregate. It's the same level. But it turns out that I've blocked, here, I'll use France as an example

Starting point is 00:37:10 because I'm French. I've blocked every person in France. You know, and somehow, like, I've lowered the block rate in every other country. So, like, in aggregate, you don't see it. But I've just blocked all of France. And, you know, and yet the model performs better in aggregate.

Starting point is 00:37:23 So, like, if you were just looking at the aggregate metric, you would ship it. And so I think, like, usually what, like, the terminology is here is, like, you'll have guardrails metric, guardrail metrics, where, you know, there's a set of, you might also have sort of bias guardrails. So you're like trying to make sure that, yeah, maybe your new model is better at like improving the click through rate of, you know, people click on the results more often. But also, you know, it doesn't start, you know, being supervised or like when you type like something, it promotes suggestive content that it should not or that sort of stuff um what's really hard about those is that the set of potential things that could go wrong is almost infinite it's like writing sort of all of those right it's like oh well it doesn't block all of france also it doesn't happen to block you know everybody that's like over 25 or whatever like like there are about like an infinite uh ways that could fail and so you kind of just have to think of a representative sample of them

Starting point is 00:38:26 and then carry on. And so, you know, sorry, go ahead. And I feel like for that, a lot of times, and I feel like that's what I find interesting about ML is that it does require a deep partnership with the product because, right, like, how do you go about, like, doing those slices? A lot of times it's very specific to the domain that you're in right like maybe age uh you know geographic information that's

Starting point is 00:38:50 pretty general but once you go deeper into like what kind of transaction it is um i think right like that has that been your experience actually sorry maybe go ahead with your original thought but uh no no 100 yeah and i'll even like there's even one more which is that once you start adding enough metrics just through laws of statistics alone for every model that you want to release it'll be worse on at least you know a couple of those

Starting point is 00:39:15 and then like you have to have a product conversation like okay well if this model is better for everyone you know but people that happen to live in Sunnyvale get results that are like 1% worse are we okay with that you know, but people that happen to live in Sunnyvale get results that are, like, 1% worse. Are we okay with that? You know, and, like, usually you'll have sort of, like, these criteria will say, well, this is how much worse it can be for, like, one of the slices we care about, because we trust that sort of, like, over multiple model releases, like, you know, the same, it's sort of, like, we won't pick the same card multiple times, and so on average,

Starting point is 00:39:42 it'll be better for everyone. But, yeah, you 100% need to have that product conversation and to define which slices you care about because the failure mode of not having any of those guard rate metrics or of having everything as a guard rate metric are both bad. In both cases, you'll end up in a pretty bad situation. Does it happen in cases where you're having this conversation with product and some of the metrics, like you mentioned, which the model doesn't perform well on,

Starting point is 00:40:14 that becomes a blocker of sorts to roll out that model? So when that happens, how do you go about figuring out, okay, we should retrain this model or do something else to to move forward like how does one go about that yeah i mean that's uh that's so fun because uh there's no way that doesn't exist either to my knowledge uh so i'll say more you're not stuck but like okay so let's say you know you you have one of those government metrics and it's like yeah we end up blocking 50 percent of france like all right well you, let's try to not do that. And the question is like, well, how do you do that?

Starting point is 00:40:49 So one, you know, you've done the first step of, like, measuring it. So that's good. So you can either have, like, you know, that broken out in your test metrics, or you can have explicit tests. There's more and more test libraries, which is kind of cool, that, like, say, like, for this model, here's, like, a couple examples. And, like, there's, like, let's say, like, France is a country of origin. And there's, let's say, France, the country of origin. And let's verify that it doesn't just block them.

Starting point is 00:41:08 But so you've detected it. And the question that you're asking is, how do you fix it? And there's not really a super easy way to give an ML model a human preference. You know what I'm saying? Like, hey, you're doing great. There's just this one area where you're being really silly. And we'd love for you to not do that.

Starting point is 00:41:24 And the way that you do it is very experimental, There's just this one area where you're being really silly, and we'd love for you to not do that. And the way that you do it is very experimental, where you'll say, okay, what we're going to do, for example, a common technique is we're going to take all of the examples that are from France in our data set, and we're going to either upweigh them, which essentially tells the model they're really important, or literally duplicate them. Literally say, all right, we're sampling

Starting point is 00:41:47 one person from each country, but for France, it's going to be 50. Then we're going to train our model and we're just going to measure whether it's still as good in general and whether it stops being silly for France. The other alternative, of course, is that you could say, well, for people from France, we'll just overrule the model and say, no, no, they're good. But that is kind of a road that is extremely dangerous because you do this for one model release, and then you release your new model,

Starting point is 00:42:12 and you're like, oh, now it's people that are 18 or under. And then before you know it, you just have a horrible if-else, and it's terrible. Going back to the Switch statements all over the place. The Switch case, yep. That's right. Definitely, I feel like I see that at work as well where right it's like if you want to actually change something right there's two places either you change the model or you change the data

Starting point is 00:42:35 and a lot of times it's easier to change the data especially if your model is more complicated right like if it's neural nets it's almost like yeah it's going to be much much Like, if it's neural nets, it's almost, like, yeah, it's going to be much, much harder. But if it's really simple, logistic regression, or maybe even some rules, then, you know, you can just, like, tack that on. Has that been the case for you as well? Just easier to fiddle stuff on the data side? Yeah, I think easier is right, right? Because you could argue, and I'm sure, like, might have people in the comments that would, right? That you could say, like, you could change your model objective so that, like, you have an the comments that would, right? That you could say, you could change your model objective so that you have an additional term that says, oh, and if like, you know, like a term could be,

Starting point is 00:43:12 it doesn't have to be about France. If your concern is like a certain country is getting impacted, you could say like, we're going to measure accuracy, but also we're going to have a regularization, or well, let's just like say like an additional term that lowers the model's performance score if there's a difference in performance between countries. And I would say that there's really interesting papers in that direction and some applications, but it's definitely harder currently to get that right.

Starting point is 00:43:35 I think that might not always be true. Actually, I'm pretty bullish on the idea that in the future, hopefully near future, that will be more of a thing that you can just kind of specify your concerns. But for now, yeah, playing with the data is definitely the easiest way to do that. But it definitely feels unsatisfying as well, right? It feels very happy. On one hand, if you're like fixing the model, you're like, oh, I'm a proper scientist versus if you're just fixing the data, it's like, oh man, this is literally QA. But yeah, which is very important. At the same time, yeah, yeah. And I was going to say, at the same time, I don't know, it is a data scientist.

Starting point is 00:44:15 It's not a model scientist. And so there's a reason why that is, and it's because it's very much like garbage in, garbage out. And I found in my career, before Insight insight is a car at stripe that you can get very large gains from just looking at the data and it kind of makes sense because like your model is in many ways a black box that optimizes for a thing and you tell it like this is the input just like get the output right and that's great and you can improve that black box so like it runs slightly better and you know like you you make the engine of your car better that's great um but in the data is literally like what you're teaching your model and so no matter how good your model is you know

Starting point is 00:44:56 if like you haven't looked at your data in a while it's very likely that you have like a bunch of garbage in there like a bunch of let's say like log events there are pest logs and they're just like you know nobody wanted to filter them out, because whatever. But then you remove those, and you see huge performance gains. Or, you know, recently I had a project where, like, literally just changing how we define, like, what

Starting point is 00:45:16 label we use for a given transaction also give, like, very large gains, and it's sort of like, in many ways, if you enjoy working with data, I find that it could be just a very powerful thing to do

Starting point is 00:45:29 because it's usually more informative about the actual business application. You kind of get to see like, oh, what is the outcome that we're trying to model? And you get a lot of performance gains usually. And I really love that quote, you're a data scientist,

Starting point is 00:45:44 not a model scientist. I think that's very, very well said. So one thing, there are a couple of things which I want to dig into in what you said. One aspect of the validation itself. So if I'm thinking about a new version of a software, one way to validate whether it's performing well or not is, and again, this doesn't have machine learning model in it. It's just a web server. We're releasing a new version.

Starting point is 00:46:08 One common way to validate is to dark canaries, for instance, where you have one instance of your app on a new version that's taking traffic but not responding back. And you have logs and metrics to capture whether it's performing

Starting point is 00:46:24 as well as the old one or not, or likely you want it to perform better. You're also looking for performance regressions and things like that. This requires a lot of infrastructure to exist to validate. I assume, or I'm curious actually, what kind of infrastructure does one need to do this on the ML side? So like you have this model in production you want to validate like does a team capture all traffic replay it through tests do you have live testing going on like what are some of the things that teams do to get the signal early in the process yeah um you call this dark canary is that what you said yeah dark canary which is uh you you have a version of a product

Starting point is 00:47:03 uh just one instance or a few instances. They're just not responding back. They're just taking traffic. Yeah. Yeah. No, it's funny because so similar. There's a very similar. I mean, I imagine a very stolen idea in ML, which is the same thing.

Starting point is 00:47:18 But usually I've seen it called shadow, which is amazing to me because it feels like a slightly nerdier, more World of Warcraft version. It is a better term. I agree. Yes. But yeah, so you could do that. But before we get there,

Starting point is 00:47:35 okay, so I would say that, you know, the first way that you do it, again, is like, hopefully, when your production model is scoring, you're kind of like logging

Starting point is 00:47:44 both its scores and the values of the features and like anything that you could log. Because then what you can do is that you have those logs and when you train a new model, you can say, cool, let's take the logs of the last three days. You know, we haven't trained on that

Starting point is 00:47:55 and let's evaluate and see how the model does. And that gives you sort of like an early estimate of performance. It's pretty good. But there's a bunch of things that can go wrong, especially if like your model is using new features and you have to redefine them. You can kind of like. It's pretty good. But there's a bunch of things that can go wrong, especially if your model is using new features and you have to redefine them. You can kind of leak data from the future. There also could

Starting point is 00:48:10 be some differences between your online and offline scoring system that can make that an imperfect evaluation. You would do exactly what you said, which is you would do shadow where you basically send every request, you fork it. You send it to your main code path

Starting point is 00:48:27 that usually also has a tighter SLA. And then you also duplicate it and you send it to Shadow. And Shadow scores it and you log that somewhere. And at the end of the day, you compare production and Shadow and you say, okay, this is how they stack up. And that's a pretty common pattern. And in fact, I'm a huge proponent of it. I think that a lot of machine learning is about data, as we mentioned,

Starting point is 00:48:56 and getting consistency between the first log system I talked about. Making that represent online data is so hard. And it is, like, very hard to get right. And so it's much easier, especially if you, like, don't have that many resources, to just build the, like, infrared to do shadow, which I actually argue, like, at least in the case of machine learning, isn't that much. Because you literally just need another server that can expose an endpoint and that just, like, hosts your model and, like, you just log.

Starting point is 00:49:24 And, like, that is a great way to measure model performance. And so I think, I don't know. I actually only remember that you mentioned that there was a similar process in normal engineering. I don't remember what your question was, but I'm a fan of Shadow. That's a great answer. I think I remember reading a post about it,

Starting point is 00:49:44 and I got very inspired and I was trying to figure out how to do it. And then I think the setup at our place is a little bit trickier. But I remember like, oh crap, I got to set this up, set this up and then have to connect this to this. And I was like, all right, maybe let's wait a quarter or two.

Starting point is 00:50:04 So for you guys, is that something that you guys maintain yourselves in terms of that entire shadow mode infrastructure? Or is there a different infra team or tooling team that helps? So we have an ML platform team that builds great tooling. And part of the tooling that they build is sort of like this infrastructure to call models. And so that's something that we can leverage. And then we can sort of like have our main production model and then have another model in shadow.

Starting point is 00:50:36 And for us, it's not much work on our end, which is really nice. So yeah, it's a lucky position to be in for sure. Nice, nice. So yeah, it's a lucky position to be in for sure. Nice, nice. Nice. The other aspect of running machine learning models in production is also, like when something goes wrong,

Starting point is 00:50:54 you want to be able to debug it. And again, if I'm comparing it to software, it's like, well, attach a debugger to it, worst case, and see what it is doing. How do you do that with a machine learning model? Like, what are some of the techniques that help debug a model in production? Oh, man.

Starting point is 00:51:12 Well, yeah, because I was going to say, it's already hard when it's not in production, right? So if you give a data scientist a model, I mean, to take our previous examples, right? If you give a data scientist a model, and you tell them, like, well, this model somehow has learned that all French people are fraudsters. Why? I've looked at the data and it doesn't seem clear that that's a pattern.

Starting point is 00:51:37 Machine learning is not magic, so if your model has learned that, there's something in your data set that leads to that behavior and maybe some set of correlations. Or it could just be honestly you know a data issue right maybe like just like somehow forgot to fill in that column and it just got misinterpreted but there's no easy path from an observation from a sort of like high level observation of what your model is doing to a resolution other than inspecting the data. Again, you can often look at, in some models, you can look at the model itself.

Starting point is 00:52:11 There's some explainability to it. But it's rare that it'll answer questions as nuanced as this one. Usually explainability will give you, let's say, globally, this feature is important, like country is important which which is good but it's not exactly what you're looking for there's some methods for local explainability but they also come with like a bunch of caveats and you know and like it's not debugging i guess what i'm saying is it's a separate problem right because like when you say debugging you're saying like

Starting point is 00:52:38 there's a problem i know this is wrong fix it whereas explainability is a lot of like oh this model is doing something really weird interesting you know let's study it it whereas explainability is a lot of like oh this model is doing something really weird interesting you know let's study it for a while which is not not at all what you want when your model is breaking in production um i think that there's uh it's such an interesting question i the way i look at it is that it's mostly about monitoring um and even that is hard where it's like your model how do I say this I'll give an example from Stripe which is we have a model

Starting point is 00:53:13 that tells us whether transactions look fraudulent or legitimate and so that model it scores a bunch of transactions and then the question is how would you know if that model is performing poorly due anything, like maybe some data pipeline breaks, and it's not getting the correct features anymore. And the answer is, it's really hard, because while you can tell that at a certain point in time, the model is doing something it shouldn't do, like

Starting point is 00:53:38 compared to another model, when you're in production, like, how do you tell the difference between our model is broken and you know one merchant that's using stripe uh just opened up in a new country that just happens to have an extremely high fraud rate like it'll probably look very similar for that merchant like there's a bunch of like you're like oh this is really weird like you know we should alert on this um but your model is kind of doing the right thing and so I think like before you get to even debugging getting alerting right is really hard and it's something where like most alerting systems and monitoring systems I've had like struggle really hard

Starting point is 00:54:16 to not have too many false positives because just trends are crazy and defining what's normal is hard I guess I've been speaking for a while, but one idea there that I find promising that I haven't battle-tested as much as I want, but that I think is interesting, is you can try to compare your current model in production

Starting point is 00:54:36 to your model in shadow, and do that not just as you're trying to deploy a new model. You could try to do that and say, are these models reacting similarly? And if one model is going crazy but not the other one, then maybe you should learn on that. But then as far as debugging it, then you still have the problem of what do you do about it?

Starting point is 00:54:58 Usually you roll back. No, it's interesting because there's the data coming in. That could be the issue, as you described. And then you could also be the model itself. So then how do you distinguish the two? Because the data coming in is maybe almost like a, well, I guess both are kind of like an A-B testing thing where you also need to kind of set up sort of timeframe, right?

Starting point is 00:55:19 Because as you said, it's not a point in time. You actually need to worry about sort of like what are just statistic outliers versus what's actually it being broken. Yeah, exactly. You mentioned alerting. So do you also take care of alerting when you deploy a model?

Starting point is 00:55:39 I know ML teams at different companies are structured differently, but at Stripe for your team, let's say you did all of the validation that had to happen before you, say, enabled this thing in production. When you do that, let's say there was a new constraint you need to add or you need to ensure it works fine. Is alerting a part of that process that you work on

Starting point is 00:56:03 or is it a separate team that takes care of that um yeah let me just see how other companies do this as well but i so for stripe the way it would work is i think it depends at which level of alerting so if um something's happening to the features where maybe the features aren't being populated or something and it's like an infra failure, then we would usually not design that alerting system or carry the pager for it. If it's something around maybe model throughput, that seems infra related as well. But if it's something around the ways

Starting point is 00:56:42 that the model itself is producing predictions, then that is something that our team owns. And I think that that's, that kind of makes sense, because it's something that's extremely tied to the product, right? Like, really, the product team can tell you, because because I guess, we're not only like, you don't only alert on anomalies, you alert on undesirable behavior, like going beyond beyond, oh, is the model performing well or not? You could say, well, we expect that this model would have this many fraudsters a day, and we've seen that it's much lower.

Starting point is 00:57:14 And so we kind of know what that limit is, whereas an NFL team might not. And so while there's kind of room to provide maybe services that team can like make, just construct these alerts easily, which is the type we're lucky enough to have to like on an infra level, I think kind of makes sense that for the,

Starting point is 00:57:36 like is your model really doing something crazy? It should be on the product team because that is just a different definition for each product team, right? Yeah, that makes sense. In terms of just team structure, who are the partner teams that you work with most frequently? You mentioned product is one aspect.

Starting point is 00:57:54 I would imagine infra team would be another, but can you tell us more about how the team is structured? Yeah, yeah, yeah. So yeah, obviously, close collaboration with products. I mean, I wasn't even thinking of it as a different team. So, you know, like a product team with some embedded ML folks on it. But in some ways, it can be thought of as a different team as well. But, yeah, very close collaboration with product.

Starting point is 00:58:19 Then, you know, other than just collaborating, I think, I think similar to other engineers, you have just foundation teams that manage servers, infrastructure, developer tools. That's, I guess, much less of a collaboration and more like we're really thankful for their services and we use them to save so much time. We collaborate with platform machine learning teams, or machine learning platform teams, I should say. And there are kind of two aspects to that, which is serving models. And we have some blog posts around sort of, you know, our model training and serving framework,

Starting point is 00:58:59 where like many companies, we have sort of like an in-house way to define a model, you know, train it, and then deploy it and be kind of confident that that model is idempotent and will – or sorry, I mean immutable. And will kind of do the same thing online that it did when you train it offline. And then there's a feature competition side of it, where we collaborate closely with that team that produces systems that help us define features that again, will be the same offline and online that won't, where we won't have some like time traveling where we're seeing data from the future when we're training models. And they also own the feature store aspect of it, where we get features very quickly during transactions. I see. That makes sense. One thing that you mentioned, it kind of reminded me of something we were discussing before we started recording,

Starting point is 00:59:54 which was versioning of the models, or rather the code itself that trained the model. If I'm thinking about, again, the regular software development lifecycle, there comes a point where you decide to deprecate a system because it has served its purpose. And you're like, yeah, we got to build a new one. Does that ever happen with ML models? Where you say, oh, this thing has been running really well for the last X number of years.

Starting point is 01:00:18 And now it's time to replace it. And if so, well, like you said, in some cases, either you don't have the code that trained that model or the data that trained that model. So like, how do you think about that? Yeah, I mean, ML is crazy for that reason. Again, like for more context,

Starting point is 01:00:36 yeah, I guess like what we're talking about is it's very possible, you know, if you have a good model hosting service that like all you need is like you train the model and then, you know, you serialized it and then the model serving service just takes care of it and it can take in requests for pretty much forever and so if you've done that four years ago and your code base your training code base has changed and like you trained it on a version of scikit-learn that's

Starting point is 01:00:59 deprecated and you know like maybe it doesn't even work on your current infrastructure, you can still have this model that works well and know that it's impossible for you to retrain it. And as much as possible, I think Stripe tries pretty hard for that never to happen. And so when you train a model, the idea is that as part of that, you want to serialize and keep like,

Starting point is 01:01:23 what's the data set, train and test what are all of the features what's the version you know like the git hash that you trained it on and so at least you know maybe there could still be factors that could make you unable to retrain that model but like at least you can reconstruct almost uh exactly what it was um but i think it's a hard pattern to get right. And I'll say this. I think the main failure mode is that if I'm an ML engineer and I'm on a project, and let's say I'm building this new model for this new use case, it's never been done before,

Starting point is 01:01:58 I build my model, it's good, I ship it, and I move on to something else. And it's possible that retraining this model every month would like be very valuable or maybe every year. But it's like, I'm not going to do it. You know, it's like I don't have other things to do or like usually it'll just fall below the cut line of the projects that most people have. And most models don't have that baked in. And so, you know, it's obviously like an opportunity at the infra level where you could say, like, okay, well, maybe it's something that we could kind of set up where if a model is too old, we like to say, hey, hey, we need to retrain it because we're afraid that we're going to, like, lose the ability to do so. But then you go back to what we were talking about, which is it's not trivial to automate. It's like you retrain it and then you block all of friends again.

Starting point is 01:02:44 And then if there's not an engineer to look at it like you're gonna auto ship it that's gonna be terrible and so i don't know i i honestly think that's why this world is fascinating to me is it feels like it feels like we just don't have almost the right vocabulary for it yet like we are you know kind of mad scientists that create these like magic black boxes that really do good things for us we're like oh this black box it's great let's like put it on a server and keep it there and hopefully it keeps being kind to us over the years and and like i i think there's like some aspect of seeing like not only like you know model lineage and like model development but kind of like almost like the tree of life for models,

Starting point is 01:03:25 or how you evolve from one model to the next, which is just not... I don't know, I haven't read anything great about it, and I think every team is kind of figuring it out for itself. It's fascinating and challenging. Yeah, on one hand, I feel like it's just a matter of time for this whole CICD thing to maybe kind of permeate to other

Starting point is 01:03:45 areas like ML. But on the other hand, exactly what you were talking about, where having to have product to take a look whenever you want to deploy something when you do block half of France, like we're coming up with those criterias. I feel like it is a lot trickier to automate, even if your tooling keeps on getting better. But yeah, I guess we'll see. Yeah. I mean, the other interesting part is that when something is working well,

Starting point is 01:04:10 we don't tend to look at it. At least I don't. Because when something's working, it's like, yeah, don't touch it. It's doing its thing. And unintentionally, we kind of abandon. I mean, it's a strong word, but yeah, we kind of abandon that specific, in this case, model or another case of software in general. Yeah. Oh, that's a really good point. In fact, if you think about it, you have

Starting point is 01:04:33 asymmetrical outcomes, right? Where it's like you have this model and you could retrain it and maybe you'd be like 0.5% better. And maybe that's great. Maybe that's a million dollars a year, but there's also like a 1% chance that you break everything and you know is that really worth it and at least like even if it's not at the company level to you as like the person that's going to do it are you going to make that decision of taking that gamble most people are pretty risk averse and so they'll say like to your point yeah you know it's it's it's working well enough we we don't need that extra extra million dollars yeah yeah well uh we're getting towards the end of the conversation. And I would do this podcast or this service if we didn't mention that you wrote a successful book, Building Machine Learning

Starting point is 01:05:15 Powered Applications. We'll link it in our show notes. And we highly recommend our listeners go check out the book. Can you tell our listeners a little bit about what the book is and what kind of audience it would be most relevant to at this point? Yeah, thanks so much. Yeah, I'll just do a 25-minute overview, if that's okay. Go for it. Perfect. No, I think it talks a lot about what we've talked about here.

Starting point is 01:05:46 I think specifically it talks a lot about that first part, which is how, from my experience as a data scientist and leading projects at Insight, I've seen a lot of good and bad ways to quickly build an ML project for a product application. So it's really for that zero to one. You have a product idea. Maybe you have some experience with ML. for a product application. So it's really for that, like, zero to one. You know, you have a product idea. Maybe you have some experience with ML.

Starting point is 01:06:08 Maybe you don't. You have some... It requires a little bit of Python experience, or at least there's some Python code to read. And you want to kind of bridge that gap between, like, product idea or, you know, just, like, having done some data science on the side and kind of making it useful.

Starting point is 01:06:24 And the goal is, yeah, to focus more on the process of machine learning and less about the theory of it. There's a lot of excellent books about the theory or how you use the frameworks. And I went to write one that was about that experimental mindset of like, so you want to ship an ML product. What do you do next? And so, yeah, that was the goal of the book, and so

Starting point is 01:06:46 far, from the feedback I've gotten, that seems to be the target demographic that really benefits from it the most. So if that's something that resonates with anyone, I would suggest it. There's also, I should say, a free chapter available online that maybe I could send to you, and you can add it to the

Starting point is 01:07:01 show notes for people that want to check it out. Yeah, for sure. What would be a good website for people to go check out this book? Oh, if you just go to mlpowered.com, you will have that book. That's machine learning, mlpowered.com. And if you go to that slash book, that's where the free chapter is. Awesome. We'll certainly link that in our show notes and we highly encourage our listeners to check it out. I have a slightly different question about the book.

Starting point is 01:07:31 I'm actually a lot more curious about the process that you kind of went through to write the book itself. And what I mean by that is there might be some engineers out there who have some ideas or want to put out a book. How does one start? In this case, O'Reilly is the publisher of the book. For people who want to either find a publisher or just think about like, hey, I have this idea, should this be a book? And if it should be, what do I do next? Can you talk a little bit about what that process looked like for you?

Starting point is 01:08:03 Oh, talk about a short, you know, one minute answer here that I can give you. Okay. So just to summarize, the questions were, how should you know if you can write a book? How would you go about writing a book? How would you know if it's a useful book to write? Is that right? Yes. Yeah.

Starting point is 01:08:21 Easy peasy. But I will say that O'Reilly is a great publisher and they have the simplest answer is they have a submission template that if you want to write a book you can send them a template or fill it out and tell them

Starting point is 01:08:37 I want to write a book about this and it has three or four pages of really good questions that you definitely should ask yourself before you write a book who's my audience? Like, why don't they know about this? Is there another book? Are there no resources?

Starting point is 01:08:52 If there are no resources, is it because no one's interested in it? You know, kind of like, just kind of market sizing, but also like, why are you the right person to talk about this? I think that's a great process, even if you don't end up submitting it to just kind of working through it yourself. The other thing I'll say is, for me, the way I got started is I started by writing blog posts. And I recommend that to anyone that wants to write a book. Because writing a book is a lot of work. It takes a very long time. People warned me that it would take a long time. And it took much longer than I thought it would. Whereas writing a blog post is already hard, actually.

Starting point is 01:09:24 But sort of like could tell you like, first, do you like this? If you hate writing a blog post, writing a book might not be for you. But also, it can help you gauge your audience size. And I guess that's the last point for me, which is I started by writing blog posts because I developed strong opinions about machine learning from doing all these projects. And after a few blog posts, I wrote a blog post that was extremely popular that was around sort of like how to do nlp projects i think that blog post now has like half a million reads um and and that uh particular blog post was the reason that o'reilly reached out because they're like oh you like the right you're okay at it uh and you have an audience so it's probably

Starting point is 01:10:00 a good idea for you to write a book if i think that that's why it's not like a natural thing to do like try that for a while see if you like it um and if you do go forth with the book oh that's that's great advice thanks for sharing that uh and i know we're getting towards the end and there is one question which we ask almost everyone on the show uh what is the last tool you discovered and really liked you know what's really tragic is that I remember listening to the most recent episode of your podcast, hearing the question and thinking, Hmm,

Starting point is 01:10:31 I should think of a great answer for this. And then I forgot about it. Oh man. The last tool. I, value oh man the last tool uh i i'm gonna cheat because it's not the last tool i used but it did come back up uh recently um it's something i'd use actually it's actually used in the book uh but i hadn't thought about it for a while and it came up at work because i it it uh happened to be something that really helped. Grid Expectations is a library that essentially tries to make tests for machine learning models,

Starting point is 01:11:12 which is not an easy thing to do because, again, it's not clear how you test them. But I truly think that that's one of the promising areas going forward. And the reason I really liked it is that it forces you to like, you know, you build your machine learning model and it's kind of like, you know, test-driven development advocates

Starting point is 01:11:31 would say it forces you to kind of maybe be a little bit more thoughtful. You're like, wait a minute. Like, what do I actually want to guarantee? You know, like to give you an example, if it's a fraud, a system that catches fraud, it's like, what's an example of a transaction that's not fraud? Like, how do you write that of like, here's an example of this transaction.'s not fraud like how do you write that of like here's an example of this transaction this transaction is a flawless transaction no fraud you know and like kind of like working through that process is is both really useful and really

Starting point is 01:11:53 productive so for anyone that's thinking of doing machine learning or or you know does it already great expectations it's an awesome library yeah Well, you shared a lot in this conversation. Is there anything else you would like to share with our listeners? Yes. My deepest thanks and appreciation. This was awesome. Thanks for hosting this podcast. It's so fun.

Starting point is 01:12:16 Oh, yeah. Thank you so much for taking the time. Like, this was awesome. I have learned a lot talking to you on this show, and I'm sure many of our listeners will as well. Thank you so much for coming on the show, Manu. Really appreciate it. My pleasure. Thanks, Ananyu. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. We would love write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

Software Misadventures - Emmanuel Ameisen - On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more - #11

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.