Software Misadventures - Emmanuel Ameisen - On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more - #11
Episode Date: June 11, 2021Having led 100+ ML projects at Insight and built ML systems at Stripe scale, Emmanuel joins the show to chat about how to build useful ML products and what happens next when the model is in production.... Throughout the conversation, Manu shares stories and advice on topics like the common mistakes people make when starting a new ML project, what’s similar and different about the lifecycle of ML systems compared to traditional software, and writing a technical book.
Transcript
Discussion (0)
in the data is literally like what you're teaching your model.
And so no matter how good your model is,
you know, if like you haven't looked at your data in a while,
it's very likely that you have like a bunch of garbage in there,
like a bunch of let's say like log events,
there are test logs and they're just like,
you know, nobody wanted to filter them out because whatever.
But then you remove those and you see huge performance gains.
Or, you know, recently I had a project where like,
literally just changing
how we define like what label we use. In many ways, if you if you enjoy working with data,
I find that like, it could be just a very powerful thing to do. Because it's usually more informative
about the actual business application, you kind of get to see like, oh, like, you know, what is
the outcome that we're trying to model, And you get a lot of performance gains.
Welcome to the Software Misadventures podcast,
where we sit down with software and DevOps experts to hear their stories from the trenches
about how software breaks in production.
We are your hosts, Ronak, Austin, and Guang.
We've seen firsthand how stressful it is
when something breaks in production,
but it's the best opportunity
to learn about a system more deeply.
When most of us started in this field,
we didn't really know what to expect
and wish there were more resources
on how veteran engineers overcame
the daunting task of debugging complex systems.
In these conversations,
we discuss the principles and practical tips
to build resilient software,
as well as advice to grow as technical leaders.
Hello everyone, this is Guang.
Our guest for this episode is Emmanuel Amison.
Emmanuel is a machine learning engineer at Stripe working on fraud prevention.
And before Stripe, he led more than 100 ML projects at Insight,
helping fellows from academia or engineering transition into ML,
which is actually where I met Manu.
It was super nice catching up with him and get his stories and pro tips on things like the common mistakes people make when starting a new ML project,
what's similar and different about the lifecycle of ML systems
compared to traditional software
and writing a book?
Please enjoy this very educational and fun conversation with Emmanuel Ameson.
Hey, Emmanuel.
It's great to have you with us today.
Welcome to the show.
Hello.
Hello.
How's it going?
Good, good, good.
So, Emmanuel, we met at Insideight Data Science a couple years back,
which is also how I met my co-hosts, Austin and Ronak.
How did you end up at Insight?
I ended up at Insight initially as a fellow.
I was a data scientist for a couple years,
and I got really interested in deep learning and newer approaches to ML.
And so I joined Insight as a fellow because I was like, it'd be a really nice way to change jobs.
And I ended up liking Insight so much that instead of going towards an ML role,
I stayed there for a couple years.
Nice, nice.
So I guess for people that are not familiar with Insight,
maybe a little bit about what does Insight do and what you do when you're there?
Yeah, so Insight is a professional education company.
So the idea is you'll have people that either have PhDs or postdocs or that are engineers or that were data scientists like me that want to transition to mainly roles in data, like data engineers and data scientists.
And they'll come to Insight.
And then it's a project-based learning approach.
And so you do a project for about a month.
And then you use that project as a portfolio piece to sort of like go to prospective employers
and get interviews and then, you know, transition into your new career.
And so what I did there is i led the artificial
intelligence program which was mostly initially it was a lot about kind of deep learning research
and you know kind of trying to apply cutting edge research and then it pivoted to being
some of that and then a big focus on machine learning engineering which was sort of like what
many companies really needed at the time and and kind of there was no traditional path into it. It was a bit of a hybrid role in between data science and engineering.
And so spent a couple of years leading that program, which was super fun.
And ended up through that seeing over 100 different ML projects.
A lot of those projects that we did were in partnership with companies.
So it was really fun to see just a vast array of companies doing ML
and helping them out. That's an interesting transition because
I imagine NeurIPS were getting a lot
of activities and I think a lot of people in the field
talked about modeling the more research aspect.
Was it a difficult decision trying to switch more towards like,
you know, the ML, like the engineering aspect of it,
like even during the program itself?
That's super interesting. Yeah.
Kind of for two reasons.
It's risky, right? Because you, it's sort of a goal.
Yeah. Well, so actually what was happening is that there's, It's risky, right? Because it's sort of a goal against the trend.
So actually what was happening is that there was these like, I remember very clearly this being sort of like two waves where kind of companies we're talking to got really excited about the sort of research that you'd see at NeurIPS.
And they'd be like, oh, we need to hire people that know how to do this.
We're going to integrate this and, you know, every single part of our products's going to be great. And then kind of basically lagging by maybe like a year or two,
you had like the rest of the internet, maybe like all of the medium posts and like everybody got really excited as well.
But by the time that was happening, the companies we were talking to
were saying like, well, you know, like this is kind of great,
but we actually have a bunch of researchers and they're producing
this research and it's really good research, but it's incredibly hard
to do anything other than publish it. Like we've tried to good research, but it's incredibly hard to do anything
other than publish it.
Like we've tried to integrate it,
but the researchers,
we've partnered them with software engineers,
but getting them to work together
to speak the same language
is actually really challenging
because they have different backgrounds
and making product work is really challenging.
So there was a huge need.
I remember sitting down and saying like,
wow, we've know 20 conversations in the
past month with with different companies saying the exact same thing which is we actually are good
on the ita we just need um engineers and so that side in one way was like uh became pretty clear
it was hard because it was a pivot but the part that then was was the hardest i think was that
then that meant that we were misaligned with the hype cycle where like as everybody was hearing about like oh my god ai deep learning
all this has to be great uh they would apply to insight and then you know they'd come to
interview to say like well what i really want to do is you know like this very specific like
computer vision deep learning and we would have just come off of like again 20 calls with companies
that were like we will never hire someone like this we have too many of them you know like or
not never of course but but, but, you know,
that that was really wasn't their need.
And so there was a bit of a mismatch.
Like we're kind of a, we were kind of a two-sided marketplace and there was a bit of a mismatch
between kind of like what, what roles people thought were out there and then what employees
actually were looking for.
And so that was maybe the trickiest part there was, was kind of like just getting the messaging
out, to be honest, like, here's what people are actually looking for.
And so we did a lot of work around blog posts and putting white papers out
and trying to explain, like, actually,
this is probably a promising career direction if you're interested in it.
Do you think that hype has settled now?
Or do you think that is still the case where, like,
a lot of people who aren't in the field or in the weeds think, oh,
look at this amazing research on AI and deep learning models.
But actually, when you talk to the companies, they're like,
oh, we need someone who does more on the engineering side with machine learning,
but not so much on the research side.
Do you think that hype has settled or that it has balanced out over time?
So hard to tell.
I feel like I'm not the right person to ask in a sense that
anybody that works at a Silicon Valley tech company has such a biased view of what actually is probably the real distribution of use cases.
But I would say that even here, which is probably pretty advanced compared to the market, it feels like there's still, in my opinion, a bit too much hype.
Which is challenging because there's a lot of really cool applications
of genuinely new technologies.
It's not like it's all hype.
There's a lot of incredible, like I'm deploying around computer vision,
but also just really great advances around NLP and language understanding
and big companies have shit really cool things um but it still feels
like if you ask i feel like uh the like average graduate in that field to tell you like which
proportion of jobs are actually doing that um that's sort of like cutting edge modeling i think
they would almost certainly like overestimate it still at least that's my bet yeah i think i would
agree with you there based on what i've seen but again uh i am also not the right person to ask it's getting better though there
was a time where it was genuinely like 100 of people wanted to train deep learning models and
that was you know two percent of jobs and and that was really hard yeah yeah um so manu you let you
actually led the session that i participated in back in 2018 2019 but
so what impressed me was sort of how diverse the backgrounds of the people that were in my cohort
so some were more academic they come from you know CS or like PhDs in physics but then you also
have people that are you know software engineers that want to get more into ML so I'm to lead with a very bad question, but I wanted to get your thoughts around this,
which is to say that would it be easier for software engineers to pick up ML skills
or would it be easier for AI research or for PhDs to pick up software skills?
I mean, I feel like it's actually a really good question because I feel like
at least it's a question that comes up really often um and it was definitely something that
we asked ourselves you know at a program that tries to recruit people so that they can then
be hired by companies they're like you know uh kind of which kind of backgrounds would be the
most successful which is a natural question to ask. Initially, I had the impression that it was easier for people coming from an experimental background
to learn the engineering skills.
And that's sort of because, you know, for a lot of the ML work,
the experimental side is more of a mindset.
And it can sometimes be harder to teach a mindset than it is to teach tools.
You know, where it's like the mindset is like you formulate a hypothesis.
And then you say, okay, my hypothesis is this.
And then I'm going to test it out.
And then you design carefully an experiment.
And you run your experiment.
And then you analyze the results very carefully.
And you say, okay, based on the results, this is my next step.
That's sort of like a thing that you can do sometimes in engineering,
but engineering is much more deterministic,
where you make a concrete design and then you implement it.
And you take a person like this and you teach them Python
and they can kind of do a good job.
I think as the field has progressed,
my opinion has changed quite a bit, actually,
because as ML gets deployed in production
a lot more of of the scope of what people end up doing is everything around ml uh there was like a
a paper by google around ml systems and in that paper they have a graph and it's like ml systems
and there's you know like i think like i don't know like maybe 25 boxes that represent like monitoring and like training and and whatever and there's like one very very small
box that's like training a good model uh and that's you know that's probably like i think if
you've if you've worked in that domain you've seen that like that's a small um proportion of the task
and so even if you're like a really good engineer and maybe it'll take you longer to get that
mindset it's certainly possible to learn it it'll take you longer to get a mindset, it's certainly possible to learn it.
It'll take you longer to get the engineering,
the experimentation mindset.
There's so much youthful work,
right?
Engineers that are interested in ML that strategically,
it might be an easier learning path where it's like,
you'll find a role,
you'll get on a team and then you'll just like kind of like learn by osmosis.
And so I feel like nowadays it might've flipped where it's easier for
engineers,
but of course,
you know, there's no hard and fast rules. It's kind of anybody can do it if they're motivated enough, but it's changed, I think. Nice. And so I remember when I was trying
to kind of start my ML project, it's very much, you know, trying to, I think it's almost kind of
like a startup. You're trying to build a product.
You're trying to show it to potential companies at the end to kind of pique their interest.
And so maybe it's not exactly customers buying a product, but still you want to have something that's sort of cool, that's shiny, but also hopefully add value, right?
So that people see. see um but like you know the the show is about misadventures and i'm kind of curious you know
what are some of the common mistakes uh that people make when they kind of go about starting
their like ml project yeah i was trying to think about about this ahead of uh ahead of chatting
with you both i think there's a few and and i've given up on trying to categorize them all um but
i think a really common one
that used to happen all the time at Insight
that kind of happens everywhere
and like I've definitely done it as well
is that the fun part of machine learning
or one of the fun parts can be like
you have your data set,
you have your training data,
you have your test data
and then like you're just trying your models,
you're building new features
and you have like this number
which is like maybe your accuracy or your loss or whatever.
There's a number you care about.
That's your performance metric.
And you're at the casino.
You're playing slot machines.
You're like, all right, I'm going to try this.
Oh, 0.82.
All right.
Oh, let me try this.
Oh, 0.85.
OK, great.
And you just keep going and you do your thing.
And it truly sometimes, I feel it was like a gambling addiction where like
you'd see fellows and you know they'd say like well i haven't really figured out you know what
my product is or if it's useful or if anybody cares about it or really even like any of the
flow around like what this would be but i found like a data set of a million text documents
and like my accuracy is like 0.99 and i spent like the last seven days on it and you'd be like
but like but why what are you trying to be like i don't really know but like look at this score
it is all the way it is peaked like i've i've solved machine learning i think that that was
like and it's still one of the biggest challenges of ml is like when you start oftentimes you have
to have this kind of like delicate dance where you go between like, what are we trying to do?
Like if you're trying to like, I don't know, you know, like predict something new that you're going to show to users.
You probably want to say like, okay, well, if we're going to predict this new category, you're like, how accurate do we need to be?
Okay, we need to be this accurate.
Let's try.
And then you fail your first try.
And then you like maybe change the product a little bit and go back and forth
and have this little dance. But the biggest failure mode is you give a data scientist
a data set, they're gone for a week and they'll come back with a really high number that is
usually absolutely useless.
And I think that that was one of the most common things that we have to bring back people
from the edge. Like, hey, hey, we have to bring back people from the edge.
Like, hey, hey, hey, hey, get away from the laptop.
You don't need one more training run.
Let's take a step back and think about this for a second.
Going deeper on that, was it usually maybe, if you think about it as a product,
you don't actually need that sort of level of performance?
Or is it just plainly wrong? Like, you know, the test site is polluted, or, you know, they're,
how do you go about convincing the fellows, hey, even though, you know, you spent seven days on
this, but you should look at this other thing that's probably would do, you know, more important
things for your project? Yeah, I mean, it can come in a variety of flavors i'll give you like a kind of example that happened multiple times is like so we would do consulting projects for um
for companies and the idea is like they'd come with a problem and you know they'd like present
to the fellow and the fellow would help out and like a common shape of that would be a company
comes in and says like well you know we've tried we've done this model for this thing and we've gotten 80 percent whatever precision uh we'd love it if like you know your fellow could
get that number up and multiple times you know it's like the the fellow would like do some work
and like get the numbers like 99 like almost immediately like in a couple days and you'll
be like oh like you know have i it? Am I done with this project?
And multiple times, as you alluded to,
we looked at the test set,
and I remember one case where it was predicting some sort of outcome from patient data,
and it's like the same patients
were in the training set in the test set.
And the patient ID was a feature,
and so if you just use the patient ID,
you get 100%.
And in fact, it was kind of surprising that the company didn't have 100% initially.
I don't really know what they were doing.
But you have more subtle ways of that.
It doesn't have to be patient ID, but you can have data leakage.
And so if you don't ask yourself, is this too good to be true?
Why is this performing so well?
Your model is essentially, even though it's got 100%, is useless.
Like, if you were to ship it in a medical device,
it'd be, you know, like, a dangerous and bad thing to do.
And so it's worth thinking about it.
That's one.
But to go back to your question,
like, another common failure mode, I would say,
is, like, let's say you're trying to build for a search,
like, you're trying to build some, like, search model
where I type a query,
and you help me, you know, with what my query is going to be and you do like Google autocomplete.
One approach is like you literally try to guess the sequence of characters that I'm going to output.
And you can do that. And in fact, I think like at some point it makes sense to do that. But it's pretty hard. There's like a wide space of what I could write.
But, you know, you could spend weeks and
weeks like just iterating on the sequence to sequence models, like some complex models,
like predict the sequence of characters. Whereas in fact, maybe you could take a much simpler
approach where instead, like I start typing my characters, and you just suggest like one of five
categories, like maybe you're searching about books, maybe it's about mattresses. And like,
that is much easier. And that's something you could do in a couple of days.
But it might be 90% as useful.
And so sometimes it's just about thinking back and saying, my first results were bad.
Could I change the product?
Could I change the modeling approach so that it's much easier?
And that's definitely a very common failure mode, where you think way too many times on
the machine learning, whereas you could save yourself a lot of time
by changing your product slightly.
I think these learnings from leading these projects,
I think it's really cool
because I feel they proxy a lot of times
where big companies, they want to leverage ML.
And it's usually someone maybe from the product side
that are like, hey, is this something that we can do to make our product better?
So then
you get someone
that's new either to the dataset
or to the
deployment environment and then trying to build this
thing out end-to-end.
I'm kind of curious,
do you have any tips or how you think
about how to help non-ML
people that want to use ML to be more effective,
I think in terms of like what they should expect, right.
Cause a lot of times maybe, you know, some of the,
I think the sanity checks that you mentioned are very much like, Hey,
let's, you know, make sure that this thing,
maybe we can solve it by rules first before we actually try ML.
But yeah, I wanted to get your thoughts around that.
Yeah.
I mean, that was one of the patterns that we kept seeing
that companies would invest in a lot of ML engineers
and they just had to struggle to reap the rewards.
I think the rules that you talked about is a big one
where you kind of want to call out ideas that are – that there's machine learning and there's magic.
And some projects don't need machine learning.
They need magic. Because if one of you goes to my website, I just kind of automatically know what kind of devices you like and everything about you.
And I automatically just show you a page.
And then I automatically know how much money you have.
And I can give you the exact price that will give me as much money.
But also you are 100% likely to buy it.
And that's just not going to happen.
That's something that maybe over years and years and years you'll get an incremental system that does something like this. But it's just not going to happen. That's something that maybe over years and years and years
you'll get an incremental system that does something like this
but it's just not possible.
And so I think rules or at least writing out a set of heuristics
is a very good first step for anyone that doesn't even know ML.
For the vast majority of ML systems
you can at least get something off the ground with rules.
This is not always
true. There's some things with like, kind of like advanced computer vision, where, you know,
maybe that's, that's less true. But for many, like, at least, concrete problems, which are
usually on tabular data, you know, certainly things like some of the things that we do,
it's like fraud detection, or, you know, like predicting clicks, predicting, like recommending
videos, you can get pretty far with like heuristics of, well, if you've liked 10 videos of this category,
maybe we give you another one.
And that doesn't mean that you have to necessarily build that system first,
but it can be a good heuristic to know
whether you're asking for too much of your ML system.
If you can't express it that way,
it might be that you just want something magic.
I think that's the first thing is scoping well.
I think the only other pattern I've seen
is just that iteration speed is really the key.
It's like the projects that are the riskiest
are the ones where you want to build the world
and you have this kind of holistic,
here's here's
everything that we need uh in order to like fully solve this this problem and and that's
that's usually challenging because at the point where you're doing your first demo solution you
don't have enough information about your product or like what you're trying to do to make the like
correct design decision usually like usually what you want trying to do to make the correct design decision.
Usually what you want, again, is an experimental mindset of, oh, we think that people will
like this new feature where we recommend videos on YouTube.
We think this will be nice.
Let's just try to throw something there and see what the uptake of it is like.
And then once you've thrown your very simple model,
it's taken you a couple of weeks, then you can iterate really fast.
And that usually works well because if it was a bad idea,
you find out in a couple of weeks,
as opposed to doing a nine-month project that then fails
and then kind of scares people people away from ml i see speaking of
iterating fast the insight project sort of is very fast right like five four or five weeks you
gotta give something up and i feel like i will be doing myself a disservice if i don't ask you
uh what was the most hilarious misadventure that you've seen um when you were leading these uh projects
and you can't use mine obviously that would be uh you know unprofessional so
all right well i only had one example
um i think like listen i i don't want to pick on on in particular, but I do think that there's like the kind of funniest ones,
to me at least, where, you know, again,
like Insights Model was iterate fast,
start with something simple.
And sometimes, you know, fellows decided to not do that.
And they take, you know, let's say a classification problem,
like where they would, like a common one was classifying support requests.
Like somebody writes in as a ticket and they say they have a problem
and you want to efficiently route it to say like, you know, is this about,
I don't know if it was like an ISP like connection or is this about like billing?
Is this about something else?
And they would start and say, all right, well,
the first thing I'm going to do is I'm going to start with BERT. And they do something super complicated.
But they get their pipeline and they're like, boom, BERT, 85% accuracy. And then usually the
feedback we'd give them is like, it's great. We're happy you got it. But in order to contextualize
this for someone, you probably want to just compare it to a simple baseline or some simple
thing you do yourself.
Quick thing for our listeners.
Can you just explain quickly BERT?
What is it?
Oh, BERT is an advanced natural language processing model
where essentially it was pre-trained by Google
on a very, very large corpus of data.
And it performs really well
when you take this very large model and kind of fine
tune it on your data set, usually get some pretty good results pretty fast. But it is still pretty
heavy and pretty unwieldy. And at least, you know, as of this podcast, usually not the fastest thing
that you could do, but that might change. But so they'd start with this kind of like advanced
method and also like pretty, pretty heavy duty method. And then you said like, all right, well,
do you want to try like a baseline maybe you could
try some simple machine learning model or maybe like you could just write you know like a like a
if else like a switch uh function just say like how would you do it and then you know they would
get 95 and i just like blow their very company and like i mean that's really a brutal situation
to be in because then you know you're
like well i guess like either you have to lie about how you got to this result and say like no
no i started with a simple model you know and then i tried something complicated but didn't work out
but like it's really not a good look if you're like this person that's like well you know i like
used this like massive weapon uh and then i realized that like a fly swatter would have been
enough um that was those
kind of mistakes were usually like pretty brutal and and honestly pretty disheartening to see like
i think it's also demoralizing because you're kind of like oh why did i do this um so that's
why we insisted on that so much is that we've seen that pattern really a lot of times it's
interesting that uh like from i didn't for instance, when fellows go through this process, it's unfortunate and, as you mentioned, disheartening.
But again, the cost is not that high.
In many cases, they could change the project or pivot or do something else.
Do you think it's also applicable to many teams and companies who are trying to do this where at least for someone who
is not who doesn't work on machine learning i don't think about machine learning as a solution
to a problem that i want to solve first i'm like how can i do it brute force or like through rules
for instance like the switch cases that's what i think about but you feel like sometimes teams fall
into this trap as well
where they use machine learning
to solve a problem where it's not the
best solution or it's not the best tool to solve
that problem.
If that happens, how could one go about
figuring out whether ML
is the right approach for something?
Or whether for something you just need something
much simplistic that satisfies the spec?
Yeah.
I think if you manage to write a flow chart that solves that you'll make millions because i think that's a
really hard problem uh there's there's a paper again i think it's it's by google that um i think
it's titled if i remember correctly you know machine learning um the high interest credit
card of technical debt.
Oh, I think I've seen that. Yeah, yeah, I've read that.
It's a very catchy title.
It's a catchy title, and it's an excellent
paper, because, like, machine
learning systems, the issue with them is, like,
they can be super valuable again.
I don't think we need to make that argument here,
but, like, you know, ML systems that are deployed
sometimes can, like,
make or break a company or certainly
make millions and millions of
dollars, and
usually a good ML system
will outperform
rules or switch case or heuristics,
almost always, but there's
a huge upfront cost to getting it right
or oftentimes a large upfront cost, and
a large cost of maintenance,
which we can get into that after
but like keeping your models running well and behaving well is really hard um i guess you yeah
you just had a previous podcast episode about that so uh you know all about it but um i think that
the like heuristic in a sense of i guess like heuristic for you a person to know whether
machine learning is is a good, whether you should do rules.
It's kind of like, it depends a lot
on your company environment initially.
There's some companies where, you know,
you have infrastructure teams
and machine learning platform teams
that expose to you really nice primitives
when you can say, oh, like,
we already have this feature store
where we have a bunch of features, you know,
I can use, like, you know, various attributes that I can use for a model,
and then I can just kind of even maybe have a UI where I can click
and trade this model and I can see if it's good,
and it doesn't cost me anything to deploy it,
and maybe I can have some data scientist look at it
and tell me if I missed something, but it's pretty self-serve.
I think that's the dream, and some companies are there.
If you're there,
then I actually would encourage,
in those companies,
I think the success stories we hear
is everybody can just experiment
and then most people
have at least a few good ideas
and so it's great.
That's what you want.
If that's not the case,
then I think you want to be,
basically do an ROI calculation.
For teams that currently have models in production, how much time does it take them?
On their roadmap, how much of it is just keep our model alive or refresh our model?
And how much is that in terms of salary costs?
And is your application worth at least that many dollars, I think would be the number one thing. And if the answer is like, there's no team at the company that has ML models production,
then like, you should almost certainly not do it if you're not like an ML engineer or like,
or somebody that has experience with it. So I think it depends a lot on the company.
No, that's a really good answer. I mean, that's really good advice for people thinking about
machine learning in general and thinking of those tradeoffs and costs because, well, it's not free for sure.
And I definitely want to get into that lifecycle of machine learning models in general.
Like once you deploy it, how do you saw a lot of teams or machine
learning teams structured in many different ways, who also worked in many different ways.
And then you have been at Stripe for a little over a year now, I think.
Almost two now.
Oh, almost two years. So what made you choose Stripe? And can you talk a little bit about
the team you're on and what you do yeah for sure um
yeah so I mean I guess first of all before I say why I chose Stripe it's like I uh why I left
Insight is is also because so at Insight I was leading these projects and it was super fun
and I got really to see the world of ML and I'm extremely thankful for it because like
it's very rare that you can be in a role where like you know uh you're at the same time working on like helping an nlp project and then like
trying to be like a cutting edge uh you know like reinforcement learning like a like research thing
and then you're also doing uh production like ml engineer for companies it was super fun
but i missed being an ic myself and so after uh after uh insight i really wanted to get back to
doing machine learning myself
because I really missed it. And so then I looked at various companies and I think there's,
I mean, I don't know how much of that we should get into in this conversation, but there's
a lot of different ways to structure ML companies and to have different employees on an ML team have different roles,
and how you do that partnership, that's a whole realm in and of itself,
and I think Stripe does that pretty well.
But one of the main reasons that attracted me to Stripe is that I felt like
one of the biggest challenges of ML is actually, I mean,
the zero-to-one aspect we talked about, there's many failure
modes, but it's actually the sort of like, oh, you've built a model and now it's serving, you
know, millions, billions of users, like however many, like now it's popular, now it's used, like
now what? Like, do you just keep that model there forever? How do you know if it starts doing
something horribly wrong? Like there's, you know know so many like um examples of models for example exhibiting bias right in search results
like how do you even detect that uh ahead of time how would you be aware of it um if you retrain
your model um and it kind of like looked better on your on your test set your number went up like
should you ship it how do you you know? That aspect of the engineering
around MLOps
people call it MLOps, but it's almost just
quality assurance for
models, right?
MLOps sounds definitely much better than
MLOps. I know, I know.
But yeah, it's good branding
I guess. That's why I wasn't in charge of coming up with the name.
But yeah, to to me it actually
sounded way more fascinating and maybe because
I had been a lot in the zero to one
world but I really
wanted to like go at a company like Stripe
which sort of like has
you know like a really good engineering
reputation and like really strong engineers
and has also is like
at a crucial
uh uh point in the in in like uh our our customers our merchants workflows right it's like it's not
like we're this tool that like if we break you know whatever we'll see what happens it's like
stripe processes uh their payments and so it's an extremely sort of like high bar if you're
going to have yeah i work on the on the at Stripe. And so the models I ship are the ones that decide whether
a payment is fraudulent or not, and blocks it or not. And so, you know, that's kind of a,
like a very compelling reason, let's say, to get that right. And to make that like a really
important part of your workflow. And so yeah, I was just fascinated with that. And wanted to both
learn more about it at a company that, you know, has seen like immense growth and has like a really important part of your workflow. And so, yeah, I was just fascinated with that and wanted to both learn more about it
at a company that, you know,
has seen like immense growth
and has like a really good engineering reputation
and try to contribute.
Because like when I was looking around for resources,
there wasn't much,
it's not something that seems figured out.
And so it was attractive for that reason.
I see, that makes sense.
I mean, I think the fraud team that you mentioned,
it's like the cost of getting a model
or an incorrect model deployed in production is extremely high in that case. So the engineering
rigor you need around the quality assurance is much needed. Talking about that, so you've touched
a little on that exploration phase of going from zero to one. Can you touch on the life cycle
of a machine learning model
once it's deployed in production?
Like when I think about a software
or when I think about a web application, for instance,
that is doing, let's just say serving cat videos,
in my head, I can think about, okay,
this is how you would go about running it in production,
keeping it alive.
This is how the storage would look like
and everything that goes into running it in production.
For some of the aspects that you mentioned,
let's just figure out,
once you train a model,
how do you decide that
this should be deployed in production?
Let's just start there
and then we can build on top of that.
Yeah.
I mean, it should be easy, right?
If it's better, you deploy it.
Or rather, what I meant to ask is,
how do you know if it's better?
When you train a model and there is one running in production,
how do you know the one you just trained is better than that one?
Like, how do you validate that?
Yeah, just to be clear, I said it should be easy.
It is definitely not.
I got the sarcasm.
Ron, I can take that invitation.
No, for people not in machine learning.
I'll play that hack.
Yeah, there you go.
No, it's a good question because it's hard.
I think there's a few things.
So I guess the easy part is all the way at the start where you have your model in production, and hopefully you can use it to score some set of transactions offline, let's say, for Stripe, because we're talking about broad.
Or you can score offline, and you score the same data set with your new model, and then you have some performance metrics.
So for Stripe, it might be like, how many while, you know, maintaining the same level of false positives?
And you say, oh, well, it's higher, so, you know, it's better.
But then, like, what usually happens, and this kind of, like, multiplies if your use case is more complicated, if your user base is more diverse, is that you have the top-line performance metric,
and that's good,
but that does not tell you
whether a model is good enough to deploy
for a few reasons.
One example is,
let's say I train this new model,
and it does catch more fraud than the old model,
and it doesn't have many more false positives in aggregate. It's the same level. But it doesn't have, you know, many more false positives in aggregate.
It's the same level.
But it turns out that I've blocked,
here, I'll use France as an example
because I'm French.
I've blocked every person in France.
You know, and somehow, like,
I've lowered the block rate in every other country.
So, like, in aggregate, you don't see it.
But I've just blocked all of France.
And, you know, and yet the model performs better
in aggregate.
So, like, if you were just looking at the aggregate metric, you would ship it.
And so I think, like, usually what, like, the terminology is here is, like, you'll have guardrails metric, guardrail metrics, where, you know, there's a set of, you might also have sort of bias guardrails.
So you're like trying to make sure that, yeah, maybe your new model is better at like improving the click through rate of, you know, people click on the results more often.
But also, you know, it doesn't start, you know, being supervised or like when you type like something, it promotes suggestive content that it should not or that sort of stuff um what's really hard about
those is that the set of potential things that could go wrong is almost infinite it's like
writing sort of all of those right it's like oh well it doesn't block all of france also it doesn't
happen to block you know everybody that's like over 25 or whatever like like there are about like
an infinite uh ways that could fail and so you kind of just have to think of a representative sample of them
and then carry on.
And so, you know, sorry, go ahead.
And I feel like for that, a lot of times,
and I feel like that's what I find interesting about ML
is that it does require a deep partnership with the product
because, right, like, how do you go about, like, doing those slices?
A lot of times it's very specific
to the domain that you're in right like maybe age uh you know geographic information that's
pretty general but once you go deeper into like what kind of transaction it is um i think right
like that has that been your experience actually sorry maybe go ahead with your original thought
but uh no no 100 yeah and i'll even like there's even one more which is that
once you start adding enough metrics
just through laws of
statistics alone for every model
that you want to release it'll be worse on
at least you know a couple of those
and then like you have to have a product conversation like okay
well if this model is better for everyone
you know but people that
happen to live in Sunnyvale get results
that are like 1% worse are we okay with that you know, but people that happen to live in Sunnyvale get results that are, like, 1% worse. Are we okay with that? You know, and, like, usually you'll have sort of, like,
these criteria will say, well, this is how much worse it can be for, like, one of the slices we
care about, because we trust that sort of, like, over multiple model releases, like, you know,
the same, it's sort of, like, we won't pick the same card multiple times, and so on average,
it'll be better for everyone. But, yeah, you 100% need to have that product conversation
and to define which slices you care about
because the failure mode of not having any of those guard rate metrics
or of having everything as a guard rate metric are both bad.
In both cases, you'll end up in a pretty bad situation.
Does it happen in cases where you're having this conversation with product
and some of the metrics, like you mentioned,
which the model doesn't perform well on,
that becomes a blocker of sorts to roll out that model?
So when that happens, how do you go about figuring out,
okay, we should retrain this model or do something else to to move forward
like how does one go about that yeah i mean that's uh that's so fun because uh there's no way that
doesn't exist either to my knowledge uh so i'll say more you're not stuck but like okay so let's
say you know you you have one of those government metrics and it's like yeah we end up blocking 50
percent of france like all right well you, let's try to not do that.
And the question is like, well, how do you do that?
So one, you know, you've done the first step of, like, measuring it.
So that's good.
So you can either have, like, you know, that broken out in your test metrics,
or you can have explicit tests.
There's more and more test libraries, which is kind of cool, that, like, say,
like, for this model, here's, like, a couple examples.
And, like, there's, like, let's say, like, France is a country of origin. And there's, let's say, France, the country of origin.
And let's verify that it doesn't just block them.
But so you've detected it.
And the question that you're asking is, how do you fix it?
And there's not really a super easy way
to give an ML model a human preference.
You know what I'm saying?
Like, hey, you're doing great.
There's just this one area where you're being really silly.
And we'd love for you to not do that.
And the way that you do it is very experimental, There's just this one area where you're being really silly, and we'd love for you to not do that.
And the way that you do it is very experimental,
where you'll say, okay, what we're going to do, for example,
a common technique is we're going to take all of the examples that are from France in our data set,
and we're going to either upweigh them,
which essentially tells the model they're really important,
or literally duplicate them.
Literally say, all right, we're sampling
one person from each country, but for France, it's going to be 50. Then we're going to train our model
and we're just going to measure whether it's still as good in general and whether
it stops being silly for France. The other alternative, of course, is that you
could say, well, for people from France, we'll just overrule the model
and say, no, no, they're good.
But that is kind of a road that is extremely dangerous
because you do this for one model release,
and then you release your new model,
and you're like, oh, now it's people that are 18 or under.
And then before you know it, you just have a horrible if-else,
and it's terrible.
Going back to the Switch statements all over the place.
The Switch case, yep.
That's right.
Definitely, I feel like I see that at work as well where right it's like if you want to actually
change something right there's two places either you change the model or you change the data
and a lot of times it's easier to change the data especially if your model is more complicated right
like if it's neural nets it's almost like yeah it's going to be much much Like, if it's neural nets, it's almost, like, yeah, it's going to be much, much harder. But if it's really simple, logistic regression, or maybe even some rules, then,
you know, you can just, like, tack that on. Has that been the case for you as well? Just easier
to fiddle stuff on the data side? Yeah, I think easier is right, right? Because you could argue,
and I'm sure, like, might have people in the comments that would, right? That you could say,
like, you could change your model objective so that, like, you have an the comments that would, right? That you could say, you could change your model objective
so that you have an additional term that says,
oh, and if like, you know, like a term could be,
it doesn't have to be about France.
If your concern is like a certain country is getting impacted,
you could say like, we're going to measure accuracy,
but also we're going to have a regularization,
or well, let's just like say like an additional term
that lowers the model's performance score
if there's a difference in performance between countries.
And I would say that there's really interesting papers in that direction and some applications, but it's definitely harder currently to get that right.
I think that might not always be true.
Actually, I'm pretty bullish on the idea that in the future, hopefully near future, that will be more of a thing that you can just kind of specify your concerns. But for now, yeah, playing with the data is definitely the easiest
way to do that. But it definitely feels unsatisfying as well, right? It feels very happy.
On one hand, if you're like fixing the model, you're like, oh, I'm a proper scientist versus
if you're just fixing the data, it's like, oh man, this is literally QA.
But yeah, which is very important. At the same time, yeah, yeah.
And I was going to say, at the same time,
I don't know, it is a data scientist.
It's not a model scientist.
And so there's a reason why that is,
and it's because it's very much like garbage in, garbage out.
And I found in my career, before Insight insight is a car at stripe that you can get very large gains
from just looking at the data and it kind of makes sense because like your model is in many ways a
black box that optimizes for a thing and you tell it like this is the input just like get the output
right and that's great and you can improve that black box so like it runs slightly better and you know like you you make the engine of your car better that's great um but in the data
is literally like what you're teaching your model and so no matter how good your model is you know
if like you haven't looked at your data in a while it's very likely that you have like a bunch of
garbage in there like a bunch of let's say like log events there are pest logs and they're just
like you know nobody wanted to filter them out, because
whatever. But then you remove those,
and you see huge performance gains.
Or, you know, recently
I had a project where, like, literally just
changing how we define, like, what
label
we use for a given transaction
also give, like, very large
gains, and it's sort of like,
in many ways,
if you enjoy working with data,
I find that it could be
just a very powerful thing to do
because it's usually more informative
about the actual business application.
You kind of get to see like,
oh, what is the outcome
that we're trying to model?
And you get a lot of performance gains usually.
And I really love that quote,
you're a data scientist,
not a model scientist.
I think that's very, very well said.
So one thing, there are a couple of things which I want to dig into in what you said.
One aspect of the validation itself.
So if I'm thinking about a new version of a software, one way to validate whether it's
performing well or not is, and again, this doesn't have machine learning model in it.
It's just a web server.
We're releasing a new version.
One common way to validate
is to dark canaries, for instance,
where you have one instance
of your app on a new version
that's taking traffic
but not responding back.
And you have logs and metrics to capture
whether it's performing
as well as the old one or not, or likely you want it to perform better.
You're also looking for performance regressions and things like that.
This requires a lot of infrastructure to exist to validate.
I assume, or I'm curious actually, what kind of infrastructure does one need to do this on the ML side?
So like you have this model in production you want to validate like
does a team capture all traffic replay it through tests do you have live testing going on like what
are some of the things that teams do to get the signal early in the process yeah um you call this
dark canary is that what you said yeah dark canary which is uh you you have a version of a product
uh just one instance or a few instances.
They're just not responding back.
They're just taking traffic.
Yeah.
Yeah.
No, it's funny because so similar.
There's a very similar.
I mean, I imagine a very stolen idea in ML, which is the same thing.
But usually I've seen it called shadow, which is amazing to me because it feels like a slightly nerdier,
more World of Warcraft version.
It is a better term.
I agree.
Yes.
But yeah,
so you could do that.
But before we get there,
okay, so I would say that,
you know,
the first way that you do it,
again, is like,
hopefully,
when your production model
is scoring,
you're kind of like logging
both its scores
and the values of the features
and like anything that you could log.
Because then what you can do is that you have those logs
and when you train a new model,
you can say, cool,
let's take the logs of the last three days.
You know, we haven't trained on that
and let's evaluate and see how the model does.
And that gives you sort of like
an early estimate of performance.
It's pretty good.
But there's a bunch of things that can go wrong,
especially if like your model is using new features and you have to redefine them. You can kind of like. It's pretty good. But there's a bunch of things that can go wrong, especially if your model is using new features
and you have to redefine them. You can kind of leak
data from the future. There also could
be some
differences between your online and
offline scoring system that can make that
an imperfect evaluation.
You would do exactly what you said, which is you would do shadow
where
you basically send every request,
you fork it. You send it to your main code path
that usually also has a tighter SLA.
And then you also duplicate it and you send it to Shadow.
And Shadow scores it and you log that somewhere.
And at the end of the day, you compare production and Shadow
and you say, okay, this is how they stack up.
And that's a pretty common pattern.
And in fact, I'm a huge proponent of it.
I think that a lot of machine learning is about data, as we mentioned,
and getting consistency between the first log system I talked about.
Making that represent online data is so hard.
And it is, like, very hard to get right.
And so it's much easier, especially if you, like, don't have that many resources, to just
build the, like, infrared to do shadow, which I actually argue, like, at least in the case
of machine learning, isn't that much.
Because you literally just need another server that can expose an endpoint and that just,
like, hosts your model and, like, you just log.
And, like, that is a great way to measure model performance.
And so I think, I don't know.
I actually only remember that you mentioned
that there was a similar process in normal engineering.
I don't remember what your question was,
but I'm a fan of Shadow.
That's a great answer.
I think I remember reading a post about it,
and I got very inspired
and I was trying to figure out how to do it.
And then I think the setup at our place
is a little bit trickier.
But I remember like, oh crap,
I got to set this up, set this up
and then have to connect this to this.
And I was like, all right, maybe let's wait a quarter or two.
So for you guys, is that something that you guys maintain yourselves
in terms of that entire shadow mode infrastructure?
Or is there a different infra team or tooling team that helps?
So we have an ML platform team that builds great tooling.
And part of the tooling that they build is sort of like this infrastructure to call models.
And so that's something that we can leverage.
And then we can sort of like have our main production model
and then have another model in shadow.
And for us, it's not much work on our end,
which is really nice.
So yeah, it's a lucky position to be in for sure.
Nice, nice. So yeah, it's a lucky position to be in for sure. Nice, nice.
Nice.
The other aspect of running machine learning models
in production is also,
like when something goes wrong,
you want to be able to debug it.
And again, if I'm comparing it to software,
it's like, well, attach a debugger to it,
worst case, and see what it is doing.
How do you do that with a machine learning model?
Like, what are some of the techniques
that help debug a model in production?
Oh, man.
Well, yeah, because I was going to say,
it's already hard when it's not in production, right?
So if you give a data scientist a model,
I mean, to take our previous examples, right?
If you give a data scientist a model,
and you tell them, like, well, this model somehow has learned that all French people are fraudsters.
Why?
I've looked at the data and it doesn't seem clear that that's a pattern.
Machine learning is not magic, so if your model has learned that,
there's something in your data set that leads to that behavior
and maybe some set of correlations. Or it could just be honestly you know a data issue right maybe like just like
somehow forgot to fill in that column and it just got misinterpreted but there's no easy path from
an observation from a sort of like high level observation of what your model is doing to a
resolution other than inspecting the data.
Again, you can often look at, in some models,
you can look at the model itself.
There's some explainability to it.
But it's rare that it'll answer questions as nuanced as this one.
Usually explainability will give you, let's say,
globally, this feature is important,
like country is important which which is good but
it's not exactly what you're looking for there's some methods for local explainability but they
also come with like a bunch of caveats and you know and like it's not debugging i guess what i'm
saying is it's a separate problem right because like when you say debugging you're saying like
there's a problem i know this is wrong fix it whereas explainability is a lot of like oh this
model is doing something really weird interesting you know let's study it it whereas explainability is a lot of like oh this model is doing something
really weird interesting you know let's study it for a while which is not not at all what you want
when your model is breaking in production um i think that there's uh it's such an interesting
question i the way i look at it is that it's mostly about monitoring um and even that is hard where it's like your model
how do I say this
I'll give an example from Stripe
which is we have a model
that tells us whether transactions look
fraudulent or legitimate
and so that model
it scores a bunch of transactions
and then the question is how would you know
if that model is performing poorly due anything, like maybe some data pipeline breaks, and it's not
getting the correct features anymore. And the answer is, it's really hard, because while you
can tell that at a certain point in time, the model is doing something it shouldn't do, like
compared to another model, when you're in production, like, how do you tell the difference between our model is broken and you know one
merchant that's using stripe uh just opened up in a new country that just happens to have an
extremely high fraud rate like it'll probably look very similar for that merchant like there's a bunch
of like you're like oh this is really weird like you know we should alert on this um but your model
is kind of doing the right thing and so I think like before you get to even debugging
getting alerting right is really hard
and it's something where like most alerting systems
and monitoring systems I've had like struggle really hard
to not have too many false positives
because just trends are crazy
and defining what's normal is hard
I guess I've been speaking for a while,
but one idea there that I find promising
that I haven't battle-tested as much as I want,
but that I think is interesting,
is you can try to compare your current model in production
to your model in shadow,
and do that not just as you're trying to deploy a new model.
You could try to do that and say,
are these models reacting similarly?
And if one model is going crazy but not the other one,
then maybe you should learn on that.
But then as far as debugging it,
then you still have the problem of what do you do about it?
Usually you roll back.
No, it's interesting because there's the data coming in.
That could be the issue, as you described.
And then you could also be the model itself.
So then how do you distinguish the two?
Because the data coming in is maybe almost like a,
well, I guess both are kind of like an A-B testing thing
where you also need to kind of set up sort of timeframe, right?
Because as you said, it's not a point in time.
You actually need to worry about sort of like
what are just statistic outliers
versus what's actually it being broken.
Yeah, exactly.
You mentioned alerting.
So do you also take care of alerting
when you deploy a model?
I know ML teams at different companies
are structured differently,
but at Stripe for your team,
let's say you did all of the validation that had to happen
before you, say, enabled this thing in production.
When you do that, let's say there was a new constraint you need to add
or you need to ensure it works fine.
Is alerting a part of that process that you work on
or is it a separate team that takes care of that um yeah let me just see how other companies do this as well but i so for stripe
the way it would work is i think it depends at which level of alerting so if um something's
happening to the features where maybe the features aren't being populated or something and it's like an infra failure,
then we would usually not design that alerting system
or carry the pager for it.
If it's something around maybe model throughput,
that seems infra related as well.
But if it's something around the ways
that the model itself is producing predictions,
then that is something that our team owns. And I think that that's, that kind of makes sense,
because it's something that's extremely tied to the product, right? Like, really, the product team
can tell you, because because I guess, we're not only like, you don't only alert on anomalies,
you alert on undesirable behavior, like going beyond beyond, oh, is the model performing well or not?
You could say, well, we expect that this model
would have this many fraudsters a day,
and we've seen that it's much lower.
And so we kind of know what that limit is,
whereas an NFL team might not.
And so while there's kind of room to provide maybe services
that team can like make,
just construct these alerts easily,
which is the type we're lucky enough to have
to like on an infra level,
I think kind of makes sense that for the,
like is your model really doing something crazy?
It should be on the product team
because that is just a different definition
for each product team, right?
Yeah, that makes sense.
In terms of just team structure,
who are the partner teams that you work with most frequently?
You mentioned product is one aspect.
I would imagine infra team would be another,
but can you tell us more about how the team is structured?
Yeah, yeah, yeah.
So yeah, obviously, close collaboration with products.
I mean, I wasn't even thinking of it as a different team.
So, you know, like a product team with some embedded ML folks on it.
But in some ways, it can be thought of as a different team as well.
But, yeah, very close collaboration with product.
Then, you know, other than just collaborating, I think, I think similar to other engineers, you have just foundation teams that manage servers, infrastructure, developer tools.
That's, I guess, much less of a collaboration and more like we're really thankful for their services and we use them to save so much time.
We collaborate with platform machine learning teams,
or machine learning platform teams, I should say.
And there are kind of two aspects to that,
which is serving models.
And we have some blog posts around sort of, you know,
our model training and serving framework,
where like many companies,
we have sort of like an in-house way to define a model,
you know, train it, and then deploy it and be kind of confident that that model is idempotent and will – or sorry, I mean immutable.
And will kind of do the same thing online that it did when you train it offline. And then there's a feature competition side of it, where we collaborate closely with that team that produces systems that help us define features that again, will be the same offline
and online that won't, where we won't have some like time traveling where we're seeing data from
the future when we're training models. And they also own the feature store aspect of it, where we get features very quickly during transactions.
I see. That makes sense.
One thing that you mentioned, it kind of reminded me of something we were discussing before we started recording,
which was versioning of the models, or rather the code itself that trained the model.
If I'm thinking about, again, the regular software development lifecycle,
there comes a point where you decide to deprecate a system
because it has served its purpose.
And you're like, yeah, we got to build a new one.
Does that ever happen with ML models?
Where you say, oh, this thing has been running really well
for the last X number of years.
And now it's time to replace it.
And if so, well, like you said, in some cases,
either you don't have the code
that trained that model
or the data that trained that model.
So like, how do you think about that?
Yeah, I mean, ML is crazy for that reason.
Again, like for more context,
yeah, I guess like what we're talking about is
it's very possible, you know,
if you have a good model hosting service
that like all you need is like
you train the model
and then, you know, you serialized it and then the model serving service just takes care of it and it can take in
requests for pretty much forever and so if you've done that four years ago and your code base your
training code base has changed and like you trained it on a version of scikit-learn that's
deprecated and you know like maybe it doesn't even work on your current infrastructure,
you can still have this model that works well
and know that it's impossible for you to retrain it.
And as much as possible, I think Stripe tries pretty hard
for that never to happen.
And so when you train a model,
the idea is that as part of that,
you want to serialize and keep like,
what's the data set, train and test what are all of
the features what's the version you know like the git hash that you trained it on and so at least
you know maybe there could still be factors that could make you unable to retrain that model but
like at least you can reconstruct almost uh exactly what it was um but i think it's a hard pattern to get right.
And I'll say this.
I think the main failure mode is that if I'm an ML engineer
and I'm on a project, and let's say I'm building this new model
for this new use case, it's never been done before,
I build my model, it's good, I ship it,
and I move on to something else.
And it's possible that retraining this model every month would like be very valuable or maybe every year.
But it's like, I'm not going to do it.
You know, it's like I don't have other things to do or like usually it'll just fall below the cut line of the projects that most people have.
And most models don't have that baked in. And so, you know, it's obviously like an opportunity at the infra level where you could say, like, okay, well, maybe it's something that we could kind of set up where if a model is too old, we like to say, hey, hey, we need to retrain it because we're afraid that we're going to, like, lose the ability to do so.
But then you go back to what we were talking about, which is it's not trivial to automate.
It's like you retrain it and then you block all of friends again.
And then if there's not an engineer to look at it like you're gonna auto ship it that's gonna be terrible
and so i don't know i i honestly think that's why this world is fascinating to me is it feels like
it feels like we just don't have almost the right vocabulary for it yet like we are you know kind of
mad scientists that create these like magic black boxes that really do good things for us we're like
oh this black box it's great let's like put it on a server and keep it there and hopefully it
keeps being kind to us over the years and and like i i think there's like some aspect of seeing like
not only like you know model lineage and like model development but kind of like almost like
the tree of life for models,
or how you evolve from one model to the next,
which is just not...
I don't know, I haven't read anything great about it,
and I think every team is kind of figuring it out for itself.
It's fascinating and challenging.
Yeah, on one hand, I feel like it's just a matter of time
for this whole CICD thing
to maybe kind of permeate to other
areas like ML.
But on the other hand, exactly what you were talking about, where having to have product
to take a look whenever you want to deploy something when you do block half of France,
like we're coming up with those criterias.
I feel like it is a lot trickier to automate, even if your tooling keeps on getting better.
But yeah, I guess we'll see.
Yeah.
I mean, the other interesting part is that when something is working well,
we don't tend to look at it.
At least I don't.
Because when something's working, it's like, yeah, don't touch it.
It's doing its thing.
And unintentionally, we kind of abandon.
I mean, it's a strong word, but yeah, we kind of abandon that
specific, in this case, model or another case of software in general.
Yeah. Oh, that's a really good point. In fact, if you think about it, you have
asymmetrical outcomes, right? Where it's like you have this model and you could retrain it and maybe
you'd be like 0.5% better. And maybe that's great. Maybe that's a million dollars a year,
but there's also like a 1% chance that you break everything and you know is that really worth it and at least like
even if it's not at the company level to you as like the person that's going to do it are you
going to make that decision of taking that gamble most people are pretty risk averse and so they'll
say like to your point yeah you know it's it's it's working well enough we we don't need that
extra extra million dollars yeah yeah well uh we're getting towards the end of the conversation. And I would do this podcast
or this service if we didn't mention that you wrote a successful book, Building Machine Learning
Powered Applications. We'll link it in our show notes. And we highly recommend our listeners go
check out the book. Can you tell our listeners a little bit about what the book is
and what kind of audience it would be most relevant to at this point?
Yeah, thanks so much.
Yeah, I'll just do a 25-minute overview, if that's okay.
Go for it.
Perfect.
No, I think it talks a lot about what we've talked about here.
I think specifically it talks a lot about that first part,
which is how, from my experience as a data scientist
and leading projects at Insight,
I've seen a lot of good and bad ways
to quickly build an ML project for a product application.
So it's really for that zero to one. You have a product idea. Maybe you have some experience with ML. for a product application. So it's really for that, like, zero to one.
You know, you have a product idea.
Maybe you have some experience with ML.
Maybe you don't.
You have some...
It requires a little bit of Python experience,
or at least there's some Python code to read.
And you want to kind of bridge that gap
between, like, product idea or, you know,
just, like, having done some data science on the side
and kind of making it useful.
And the goal is, yeah, to focus more on the process of machine learning
and less about the theory of it.
There's a lot of excellent books about the theory
or how you use the frameworks.
And I went to write one that was about that experimental mindset of like,
so you want to ship an ML product.
What do you do next?
And so, yeah, that was the goal of the book, and so
far, from the feedback
I've gotten, that seems to be the target demographic
that really benefits
from it the most. So if that's
something that resonates with anyone, I would
suggest it. There's also, I should say, a free
chapter available online that
maybe I could send to you, and you can add it to the
show notes for people that want to check it out.
Yeah, for sure.
What would be a good website for people to go check out this book?
Oh, if you just go to mlpowered.com, you will have that book.
That's machine learning, mlpowered.com. And if you go to that slash book, that's where the free chapter is.
Awesome.
We'll certainly link that in our show notes and we highly encourage our listeners to check it out.
I have a slightly different question about the book.
I'm actually a lot more curious about the process that you kind of went through to write the book itself.
And what I mean by that is there might be some engineers out there who have some ideas or want to put out a book.
How does one start?
In this case, O'Reilly is the publisher of the book.
For people who want to either find a publisher or just think about like,
hey, I have this idea, should this be a book?
And if it should be, what do I do next?
Can you talk a little bit about what that process looked like for you?
Oh, talk about a short, you know, one minute answer here that I can give you.
Okay.
So just to summarize, the questions were, how should you know if you can write a book?
How would you go about writing a book?
How would you know if it's a useful book to write?
Is that right?
Yes.
Yeah.
Easy peasy.
But I will say that O'Reilly is a great publisher
and they have
the simplest answer is
they have a
submission template that if you want to write a book
you can send them a template
or fill it out and tell them
I want to write a book about this
and it has three or four pages of really good questions
that you definitely should ask yourself
before you write a book
who's my audience?
Like, why don't they know about this?
Is there another book?
Are there no resources?
If there are no resources, is it because no one's interested in it?
You know, kind of like, just kind of market sizing, but also like, why are you the right person to talk about this?
I think that's a great process, even if you don't end up submitting it to just kind of
working through it yourself.
The other thing I'll say is, for me, the way I got started is I started by writing blog posts.
And I recommend that to anyone that wants to write a book. Because writing a book is a lot of work.
It takes a very long time. People warned me that it would take a long time. And it took
much longer than I thought it would. Whereas writing a blog post is already hard, actually.
But sort of like could tell you like, first, do you like this? If you hate writing a blog post, writing a
book might not be for you. But also, it can help you gauge your audience size. And I guess
that's the last point for me, which is I started by writing blog posts because I developed
strong opinions about machine learning from doing all these projects. And after a few
blog posts, I wrote a blog post that was extremely popular that
was around sort of like how to do nlp projects i think that blog post now has like half a million
reads um and and that uh particular blog post was the reason that o'reilly reached out because
they're like oh you like the right you're okay at it uh and you have an audience so it's probably
a good idea for you to write a book if i think that that's why it's not like a natural thing to
do like try that for a while see if you like it um and if you do go forth
with the book oh that's that's great advice thanks for sharing that uh and i know we're getting
towards the end and there is one question which we ask almost everyone on the show uh what is the
last tool you discovered and really liked you know what's really tragic is that I remember listening to the most recent episode of
your podcast,
hearing the question and thinking,
Hmm,
I should think of a great answer for this.
And then I forgot about it.
Oh man.
The last tool.
I, value oh man the last tool uh i i'm gonna cheat because it's not the last tool i used but it did come back up uh recently um it's something i'd use actually it's actually used in the book
uh but i hadn't thought about it for a while and it came up at work because i it it uh happened to be something that really helped.
Grid Expectations is a library that essentially tries to make tests
for machine learning models,
which is not an easy thing to do
because, again, it's not clear how you test them.
But I truly think that that's one of the promising areas
going forward.
And the reason I really liked it
is that it forces you to like,
you know, you build your machine learning model
and it's kind of like, you know, test-driven development advocates
would say it forces you to kind of maybe be a little bit more thoughtful.
You're like, wait a minute.
Like, what do I actually want to guarantee?
You know, like to give you an example, if it's a fraud,
a system that catches fraud, it's like,
what's an example of a transaction that's not fraud?
Like, how do you write that of like, here's an example of this transaction.'s not fraud like how do you write that of like here's an example of this transaction this transaction is a flawless transaction no fraud
you know and like kind of like working through that process is is both really useful and really
productive so for anyone that's thinking of doing machine learning or or you know does it already
great expectations it's an awesome library yeah Well, you shared a lot in this conversation.
Is there anything else you would like to share with our listeners?
Yes.
My deepest thanks and appreciation.
This was awesome.
Thanks for hosting this podcast.
It's so fun.
Oh, yeah.
Thank you so much for taking the time.
Like, this was awesome. I have learned a lot talking to you on this show, and I'm sure many of our listeners will as well. Thank you so much for
coming on the show, Manu. Really appreciate it. My pleasure. Thanks, Ananyu.
Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts
and learn more about us at softwaremisadventures.com. You can also write to us at
hello at softwaremisadventures.com. We would love write to us at hello at softwaremisadventures.com.
We would love to hear from you. Until next time, take care.