The Changelog: Software Development, Open Source - Estimating systems with napkin math (Interview)
Episode Date: September 11, 2020We're joined by Simon Eskildsen, Principal Engineer at Shopify, talking about how he uses a concept called napkin math where you use first-principle thinking to estimate systems without writing any co...de. By the end of the show we were estimating pretty much everything using napkin math.
Transcript
Discussion (0)
You know, if you can't do the napkin math, it's probably also too early to go and actually implement the system.
Like I call this programming through the wall when you just like keep keep writing code and it's like, oh, I'm almost there.
And then you just write code a little bit harder. Right.
When in a lot of cases, you just want to step back and think about the system and learn a little bit more about it.
But I mean, I don't mean to say here that like all tech problems can be solved by just sitting and with like, you know, a piece of paper and a pen and doing all of
this, right? In a lot of cases, you just need more information from actually like writing some code.
And you can often get stuck in a rut of just analysis paralysis. But I think that napkin
math plays a bigger role than, and could play a bigger role than it does now for a lot of projects.
Being With For Change log is provided by Fastly.
Learn more at fastly.com.
We move fast and fix things here at Changelog
because of Rollbar.
Check them out at rollbar.com
and we're hosted on Linode cloud servers.
Head to linode.com slash changelog.
What up friends?
You might not be aware,
but we've been partnering with Linode since 2016.
That's a long time ago.
Way back when we first launched our open source platform that you now see at changelog.com,
Linode was there to help us, and we are so grateful.
Fast forward several years now, and Linode is still in our corner,
behind the scenes helping us to ensure we're running on the very best cloud
infrastructure out there we trust linode they keep it fast and they keep it simple check them out at
linode.com slash changelog
welcome back everyone this is the changelog a podcast featuring the hackers the leaders and Welcome back, everyone.
This is the ChangeLog, a podcast featuring the hackers, the leaders, and the innovators in the world of software.
I'm Adam Stachowiak, Editor-in-Chief here at ChangeLog. On today's show, we're talking with Simon Eskelson, Principal Engineer at Shopify, about how he used a concept called napkin math, where he used first principle thinking to estimate systems without writing any code so we have simon here principal engineer at shopify simon welcome to the changelog
thank you so much we're happy to have you you're doing some interesting stuff in the world of
learning and advancing as a developer you have some really cool napkin math stuff to tell us about.
But a lot of this has come out of your experience
working at Shopify through crazy amounts of growth.
Why don't you tell us your Shopify journey?
Yeah, sure.
So I joined Shopify in about 2013.
And back then we were still up in years
to 100 or maybe 1,000 requests per second.
And now we're flying somewhere around 100,000 or more requests per second.
And I've been really, really fortunate to be part of that journey on the infrastructure side
and just seeing us through every level of that, migrating from our own on-prem to the cloud,
moving all of our shops between shards, sharding in the first place, running out of multiple regions,
architecting for multiple continents, running shops out of multiple regions, and a lot of the kind
of foundational architecture that underpins the Shopify that we run today.
And yeah, out of that, I have spent a lot of time having to learn about all of these
different things.
I don't come from an academic background.
I had some catching up to do, which meant, you know, trying to read the TCP book the
first time you encountered these kinds of problems.
I'm catching up as much as possible.
And I think that's a pretty healthy mindset to maintain for as long as possible.
Yeah.
Well, you may have overcompensated because now you're out there teaching other people these things, which is a cool shift.
Shopify is such a success story.
It's pretty amazing. I think, Adam, you and I were talking about it the other day.
I said maybe the most valuable company in Canada at this point
or one of the top ones by market cap.
And maybe Rails, Ruby on Rails monolith's biggest success story
in pure capitalistic terms.
Just an amazing, amazing growth, amazing company.
One that we've been watching for many years
and it's probably been cool to be there on the inside
and putting out the fires as it grows.
One thing that you were a part of,
which we aren't going to talk about too much today,
but we're going to talk about a lot in an upcoming episode,
was a recent rewrite or re-implementation
of the storefront, no longer a monolith.
Do you want to just give us like a 30 seconds on that
so we can tease it up for a bigger
show later? Absolutely. So we built a team last year to completely redesign how we do the storefront
that is serving of the stores that you see when you browse Shopify. We've learned a lot scaling
that over the past 15 years or so. And fundamentally, it just has some different requirements
for how it needs to be run be running across multiple data centers,
how it does caching, and all of these different things.
And Maxim will talk much more about all of these details
and why we didn't dare.
It's still Ruby, but it's now extracted out of the main monolith
that still powers the APIs and all the business logic.
That had to be it.
You seem like a learn-by-doing kind of person.
That's very much learn-by-doing.
It's like you've got to do something for many years,
and technologies progress over time, but then developers change the way they come into the market,
being less experienced or more experienced,
and now being at a position where you've got to do what you're doing now with learning by doing.
You seem very much like that, where you've been at Shopify a very, very long time.
And it's part of how you think it seems versus like, what I understand is you didn't go to college and you went to Shopify
and you've very much progressed there. So as Jared said, you're leading in many ways.
And so rather than having that academic background, you kind of have this background of
being in the trenches, so to speak, you know, having to read the book to get through it
rather than having taken a course to get your CS degree,
for example. Yeah. And I think what was really helpful for me was that when I was in high school,
I got exposed to this world of competitive programming, where every nation essentially
has a national team. And because Denmark, where I grew up is a very small country,
it's not super difficult to make it to the national team compared to somewhere like the
United States. But that really got me exposed to another level of engineering because before that,
I had mostly been exposed to JavaScript and Ruby on Rails and PHP. But seeing suddenly through
these algorithms and how something like Google might work by understanding a bit more of the
computer science was a fascinating journey. And then joining Shopify and getting to work on the systems that I'd seen and read about through high school was was just a
dream come true. Competitive programming journey. Have you heard of that before? Is this a first?
I've heard of it. I've never participated. I'd be afraid to do so. Tell us about it.
Yeah, sure. So essentially, what it is, is that it's a bit like an exam room, right? So you sit
down five hours, you have a computer, and you have an editor,
and you have a C++ compiler. And that's pretty much it. On your table, you will find usually
about three problems. And the problems might be, they range a lot in what they might be. So a
problem, for example, that I encountered at one point was that you were told that on your machine, there are these five
directories. One directory has images that are of impressionist paintings. This one has images that
are cubism and so on. So you had all these directories. And then you had to write your
own program that with this little training data set would be able to take an image that it had
never seen before and classify it into one of these five categories.
This is an impressionist painting, this is a cubist painting, and so on.
So that might be a task, which is a very abstract one.
And the way that you might solve that is that you might look at what is the average color difference between the pixels.
Because you can see that that is something that changes a lot based on the different kinds of paintings, right?
If you have something like cubism, where you have these like big areas that are yellow and green, and so on, the average color difference is a lot smaller
than something more impressionistic, right? So that's a more of a free problem, kind of free
form ad hoc problem, but they might also be a lot more algorithmic in nature. So one, for example,
that I remember is that if you imagine kind of a grid, right, of intersections, imagine kind of a
Manhattan sort of situation where you have almost a perfect grid. And you then know that at each
intersection, there's a coffee shop. And each coffee shop has a Wi Fi signal that has a certain
radius. And it also has a certain bandwidth. And then it says, assuming that you can connect to
multiple Wi Fis at the same time, and download on all of them, what is the best intersection for you to be in to get maximum bandwidth? So that's a more
algorithmic problem. So then what you will do is you write a solution to these, you upload your
C++ program to what's called a judge, which is an internal, the only thing you have access to
something running on the internal network. And then you get back a score somewhere between one
and 100 points, you usually get a hundred
points if you've solved it the problem to perfection but also very fast so often you can
solve these problems with a really simple but very slow algorithm and then you can get some points
let's say 30 points but then if you come up with a much faster algorithm you might be able to get
up to a hundred points and so you just sit there for five grueling hours and try to work through these problems and trying to balance, you know,
should I spend more time on this problem? Do I have it? Do I not have it? And it's definitely
some of the tougher hours in my life sitting in these exam rooms, but also great fun.
This reminds me of like survivor or naked and afraid for coders. It's like,
hope you weren't naked. You got the essentials. You're in the wilderness.
Quotes the wilderness.
Get out.
Like an escape room.
So compare the pressure, the stress,
the sweat dripping down your brow of competitive programming versus
a typical Black Thursday at Shopify
where you're at peak
requests per second and maybe a server
goes down or something inevitably goes wrong?
Is that equal amounts of stress
or is that way more stress because of other people's money?
How do those compare out?
Well, it's very different, right?
Because on Black Friday...
Did I say Black Thursday?
You said Black Thursday.
Yeah, it's all good.
Because the Thursday is when you make the final changes, right?
And we have kind of a very elaborate plan on what risk you can take several weeks in.
You know, we're not going to upgrade the MySQL version the week of Black Friday.
In fact, we're probably not even going to do that in November because of the things that can surface.
Batten down the hatches and hold on for, yeah.
Yeah, pretty much.
Keep things the same.
Don't change things much.
But it's also a really good internal deadline, right?
In October to make sure that things get in because suddenly everything multiplies, right? same don't change things much but it's also a really good internal deadline right in october
to make sure that things get in because suddenly everything multiplies right everything multiplies
and it's really the final exam i think it's more the kind of exam where you've built a lot and now
it's a bit out of your hands on black friday you can respond to things but there's no more building
allowed right so it's a very different kind of pressure where on black friday you have to sit there and monitor and make sure everything is fine we sit in this long room well
we used to before uh before we all we all went remote but we would all sit in this room and
there's monitors all the way around showing our dashboards with how all the systems are performing
and we just sit there all day and explicitly do not try to do anything hard because if something does come
up we all need to have maximum energy we have an lte router there in case something goes wrong
and we we sit there and monitor and we have teams doing that around the time and sometimes you know
small things will happen and we'll see a little pimple in one of the graphs and all like look in
and make sure everything is okay but it's a different kind of pressure because in the competitive
contest right you sit there and all of your training has come up to that point and you cannot
learn any more algorithms, no more data structures. You have to perform in that very moment. And Black
Friday is a little bit the same, but you have to sit back a little bit and just trust that you've
done the work required. How will this change then with being remote? What are your anticipations for November coming up and the way the world is now in terms of being distributed and not in the same room? a lot. So I think we've been prepared because of that. And what Black Friday is going to look like this year? Who knows, right? Maybe it's going to look a lot smaller than previous years, because
the steady state has gone up so much. Maybe it's going to be a lot bigger for opposite reasons.
It's just very, very hard to predict. But we're, of course, preparing for the worst,
or the best, depending on who you ask. Right. So undoubtedly, you've come up with a lot of
different engineered approaches, tips and tricks,
and weird solutions.
Share maybe one or two of your exploits,
not exploits like zero days,
but the things you've done at Shopify over the years.
You've mentioned pre-show, you're doing some stuff with MySQL.
When you have systems at scale,
you have to do things that other engineers
don't have to think about because you hit up against
the edges of certain technologies, and surely you've done that over the years. Yeah, I think maybe two that come
to mind through history here are one would be when we probably the biggest project or the biggest
project that I did the earliest was our potting project. So essentially, we've done sharding.
So essentially taking our database and splitting it into multiple and that was done around the same
time I joined and not by me, but by other teams. But we wanted to extend that even further so that we could have
essentially these groups of shops that live together and are isolated together. And those
shops should be able to run in multiple data centers. Because before that, we would have one
data center, all of the shops would be active in that data center. If something went horribly wrong,
we would fail all of them over to the other data center. But we wanted to isolate these shops in a way where we could run them out of multiple data centers at the same time.
That was a lot of engineering effort to make sure that there's nothing relying on the fact that everything is in one data center at the time.
So that was one of the biggest projects that I did a few years ago with a really, really good team.
This year, one of the things that I'm most excited about that we've worked on was that a lot of my focus has been around capacity planning and resiliency. So essentially
finding out that when a system becomes slow, how do you make sure that it doesn't jam up the entire
system? It's a lot worse when a database is slow than if it's down, because it can clog up the
systems and cause queues in all these different places and cause much more cascading failures.
And one of the things that we've had great success with here is this technique called
load shedding. The idea behind load shedding is essentially that when a system is overloaded or
close to be overloaded, you want to start prioritizing what type of traffic that you
okay and that you send through to the system. So if we have a store that is having a lot of
malicious traffic or some kind of sophisticated
DDoS, we want to make sure that we start to drop that traffic before we start to drop
the traffic at other merchants to protect the platform.
So we've done a lot of work in that.
And we've done a lot of work at that at the edge so that the load balancer can prioritize
traffic to make sure that our merchants have as much uptime as possible.
But we wanted to go even further and start providing that at the database level. One of the things that to me is very disappointing about the database realm today
is that a lot of companies are SaaS companies, right? They're multi-tenants companies,
and they've run all these tenancies on one or a few databases. But one of these tenets might have
a disproportionate impact on that database. They might have an API client
that is doing a lot more requests than anyone else. Or they might have a customer that has a
million orders because of some peculiarity in the way that they work. So you have all of this
cardinality and all of this uniqueness to the merchants or your customer. And that's not just
a Shopify problem. That's a SaaS problem. Because what you get with SaaS is that you have, you get these cost efficiencies of running
on the same infrastructure, but I don't think the infrastructure has really caught up to
that.
So in a database today, you know, it's not, you don't create a schema for every database.
MySQL would scream at you if you try to do that when you have enough tenants, because
it's just not made for that.
So it doesn't really give you any primitives to be able to do that.
And by default, the way you design your database is really not set up for multi-tenancy at all. So to go back
maybe to this example of a single tenant overwhelming the system, MySQL or Postgres or
any, there's no database that has a good mechanism for prioritizing traffic between these merchants.
So what we have been looking at was that we found out that in the MySQL protocol,
you can send an arbitrary string back with the query. So we thought, what if along with the
query results, you know, this could be a bunch of customers, a bunch of orders, whatever it might be,
what if we also sent back to the application, how many resources that query took? How many
nanoseconds on the CPU? how much IO time, how much memory
was loaded into the page cache in MySQL, how expensive was this query, the kinds of things
that you would see in a slow query log.
And so you might think, why is that useful?
Like you're going to look at that information.
Well, imagine an API throttle that is not some arbitrary number taken out of a hat of
you can do 100 requests per second, but rather the API throttle was actually
based on how the database is doing and how heavy the queries that your API calls are causing
actually are, right? Doing API throttling with something like GraphQL on an external API
is incredibly difficult to do correctly. And you're almost always going to either underestimate
the query complexity or overestimate it. But if we build systems that have multi-tenancy and databases that have multi-tenancy built in to that caliber where they can feed it back to the API throttling, that helps a lot.
But you can then also feed it to your load shedding mechanism.
So you can see, oh, this tenant is being really bad at the database, even though they're only doing very, very few queries.
So I think that's a really, really important thing for more databases to adapt,
and we've been working on a patch to MySQL to expose this.
That's interesting.
Do you have any observability problems,
or was it Heisenberg's principle
where by the actual observing of the slow queries,
if the response of like,
here is the metrics around this query,
it comes back with the query,
are you not adding load to the already slow query as well?
Is it just meaningless?
Not really.
In the benchmarks we looked at, the overhead is maybe 1% it just meaningless? Not really. In the benchmarks,
we looked at the overhead is maybe one or 2%. It's really not bad at all. And that's a very constant factor, right? You're doing a little bit of bookkeeping to see how much time that thread
in MySQL is spending on CPU, but you're not adding anything significant. It's usually just one context
switch. So that's the kind of thing that has to happen upstream though. So are you running like
a fork of MySQL or are you trying to,
is this still experimental phase
or how's that shake out?
Yeah, so we're maintaining an internal fork.
This is not in production on all the shards yet.
There's a lot that you have to do in due diligence
before you roll out your own patch
with a bunch of C.
But this is something that we're starting
to roll out more heavily.
And then we want not just to expose this information
to upstream places
so that we can do data analysis on it in the warehouse and we can do the API throttling based on it.
But now we can also build a shatter, like a load shatter, inside of MySQL to prioritize traffic.
So it chooses the queries that are most valuable rather than just the ones that are the most of to overwhelm the system. What's up friends? When was the last time you considered how much time your team is spending
building and maintaining internal tooling? And I bet if you looked at the way your team spends
its time, you're probably building and maintaining those tools way And I bet if you looked at the way your team spends its time,
you're probably building and maintaining those tools
way more often than you thought,
and you probably shouldn't have to do that.
I mean, there is such a thing as retool.
Have you heard about retool yet?
Well, companies like DoorDash, Braggs, Plaid, and even Amazon,
they use retool to build internal tools super fast,
and the idea is that almost all internal tools look the
same. They're made up of tables, dropdowns, buttons, text inputs, search, and all this is
very similar. And retool gives you a point click drag and drop interface that makes it super simple
to build internal UIs like that in hours, not days. So stop wasting your time and use Retool.
Check them out at retool.com slash changelog.
Again, retool.com slash changelog. so simon we talked about how you came into shopify no college degree definitely education but
needing to learn a lot on the go and And you were so intentional and disciplined around this,
you came up with different methodologies for learning. And you built that into a system.
And in fact, the first time that we came across you, it was on the super organizer substack.
There's a really nice article out there called how to make yourself into a learning machine,
which is all about you and the system that you came up with. And out of that comes a lot of stuff,
but most notably and most recently,
you have this idea of back-of-the-napkin math
or quick math for understanding systems from first principles,
which you're out there talking about.
It's really interesting and allows people to really quickly
and simply, perhaps simply, we'll talk about that,
figure out a thing about a system, like how it should be performing or how much it should
cost or how much throughput this should have without having to say, I'll get back to you,
right?
Or spend six hours crunching numbers.
So maybe start by telling us about your desire to learn in this intentional way that you
are learning and all the stuff that you're learning.
I mean, you're reading books and you're basically making sure
you remember what you read is to simplify it.
But there's a lot of interesting things in the details
and then how napkin math came into the equation.
Yeah, sure.
So it's funny that we actually opened a bit
with competitive programming unintentionally
because that's where this practice comes from.
When you're doing competitive programming,
a lot of your time is spent trying to,
implementing a solution, doing a competition, you know, it's going to take you probably about, you know, 30 minutes to an hour, depending on the complexity of it.
There's a lot of off by ones.
There is not a lot of help from an editor or a linter or anything like that.
So you really have to know how well your program is going to perform beforehand.
How fast is it going to be and how many points is that going to grant you?
And fortunately, doing these competitions, it's a very controlled environment. So you know that if you only have to see n once, like an O and n of n algorithm, then you're probably going to
perform pretty well. If the input is, you know, 10,000, and you have an n squared algorithm,
you're starting to get in trouble for doing something in less than a second.
So there, the napkin math was really, really easy and it was very encouraged and
anything you will read about competitive programming is going to talk a lot about the strategizing of how much that's going to create. And I kind of left that behind a little
bit when I went into Shopify. There wasn't really a lot of where we would need that.
There's not a lot of algorithms in day-to-day programming for most programmers. But over time,
as more and more of my time has been spent reviewing how systems are
going to perform and doing tech reviews and designing systems more so than implementing
parts of them, I basically took up this practice again of you might find yourself in a meeting
and you have a conversation with the other people in the meeting and someone says,
well, maybe we could do this. And someone else says, well, that's probably too slow.
And then someone else said, well, why don't we try try it and then we'll meet in a week or two and see and see how it's doing and
then you know you you go off from the meeting and the person works on that for a week or two you
come back into the meeting room and then the person comes back and say ah it was too slow and
the person advocating for it in the first place says what you implemented it wrong like i'll come
help you give me give me a week or two and then we'll go back back to it right and you can see
how this story kind of unfolds and then you spend a month or two kind of going back and forth on this.
But I think with a little bit of practice, you can estimate the performance of systems ahead of time.
And you can start to develop some expectations about how the system should perform, right?
Is it reasonable to continue to have this written in Ruby or Python instead of C++?
Is it reasonable to use
this database for this kind of operation? Can we build this on top of MySQL or do
we fundamentally need a different data structure? I very very firmly believe
that you should be developing your understanding from the bottom up. For
example right now I'm working in on search and I don't know anything about
search but the first thing I do is I go start to learn how an how an inverted index works how would i implement that how does lucene implement
it how do you do a top k like get the top k best documents for a query what does that look like
how does it do that efficiently how is it laid out on disk what heuristics does it use and then build
up from there because then my question is not oh oh, does Elasticsearch provide an API for this?
I think about, hey, fundamentally,
can an inverted index perform this operation?
What would it look like?
How long would it take?
How would it do in MySQL compared to here?
Oh, an inverted index is not just good
at doing full-text search.
It's also good at just merging arbitrary sets,
which then leads you to find other applications.
So that's something that I found really valuable is that you can now go into the meeting that I
described before and be like, hey, hang on, let's draw these scenarios and then do some
back of the envelope calculations. So an example might be that someone might say
scanning a gig of memory on every single request. That's way too slow. There's no way we can do that,
right? But then what you see is that if you sit down
and you write a program in C,
you allocate a bunch of memory,
and then you go through it and maybe add out the numbers
so the optimizer doesn't optimize it out,
you see that, whoa, you can actually read a gig of memory
in about 100 milliseconds.
So maybe that's not so crazy
if you can also do a little bit of caching, right?
So suddenly these things that weren't even solutions before become solutions, become
plausible, right?
My favorite thing about this is that I run this newsletter called the Napkin Math Newsletter.
If you Google Napkin Math Newsletter and Simon, you should find it.
And essentially what this is, is that this is my kind of monthly exercise in napkin math.
So I post myself these problems.
So a problem that I might post to
myself is how many transactions can a MySQL fundamentally do every single second, right?
Is it 1000? Is it 10,000? And so I sat down and tried to construct kind of a simple model
of how does MySQL apply a transaction, right? So I start to kind of from the bottom up,
think about this
So then it's like okay, you have to parse the sequel query. That's probably pretty fast
Then you have to sort of figure out what's in this insert. There's a bunch of data also pretty fast
So how do we commit this so it's durable, right?
There's this whole asset guarantee that we have to hold up that if the server shuts down it either needs to be committed or not
So what does it need to do to do that?
well
it needs to take that transaction that insert and put it at the end of a file and then it needs to be committed or not. So what does it need to do to do that? Well, it needs to take that transaction, that insert, and put it at the end of a file.
And then it needs to tell the file system, hey, commit this, send it to the hard drive,
and don't tell me that it's committed before you're sure that it's committed to the hard drive.
Right. Do what you said you're going to do. Yeah, exactly.
Exactly. And that operation is called F-Sync.
So then the hypothesis forms, right?
The napkin hypothesis forms that, well, the number of transactions you should be able to do in MySQL every single second must be equal to how many F-Syncs you can do per second, right?
Unless there's a bottleneck somewhere else.
Because that is the biggest number in a single transaction, right?
So you whittle it down to like, what does one look like?
And then you add up whatever the time is for one in this case f sync outweighs every other thing
which is pretty much rounds to zero and so that's why you say the number of f syncs is equal to the
amount of time because it's just massively larger than any other time that there is you know you
take that apply it to like a hdp request say, well, the network time is like massively bigger than any other thing.
Just throwing that out there.
So you can, you're figuring out what it is for one.
And in this case, it happens to be F sync is pretty much what matters.
Exactly.
So yeah, you look at the numbers, you're like, how long does it take to send a query to the
database?
Oh, probably like less than like a couple hundred microseconds.
How long does it take to parse the query?
Well, that's like, you know, a couple hundred bytes.
That's like less than five microseconds. How long does it take to parse the query? Well, that's like, you know, a couple hundred bytes. That's like less than five nanoseconds.
I'm just throwing out some numbers here,
but all of these numbers can be found
on github.com slash syrups and slash napkin dash math.
And then you see that,
oh, there's an Fsync operation here.
And Fsync is benchmarked at about one millisecond.
In the whole scheme of things,
that's actually a fairly long time.
And that seems to be the bottleneck for MySQL because the network and so on is typically not the bottleneck.
So, yeah, you take one millisecond and you divide it into a second and you see, OK, that's got to be a thousand transactions a second.
Right. And so what I did for this edition of the Napkin Math newsletter where I investigated this was that I went and I actually tried to do as many transactions per second as I could.
And I found that I could do about five to six do as many transactions per second as I could. And I found
that I could do about five to six to 7,000 transactions per second. That's way higher
than the thousand percent that I'd estimated from the essence. So now we have what I call the first
principle gap, right? Well, if you constructed a simple bottom-up understanding of how the system
works and we have a real result and there's a gap
between them they don't line up and they don't sometimes you know it's like a thousand two hundred
and a thousand it's probably found it's fine it's rounding errors but this is a significant enough
gap that there's something there my sequel is probably cheating somewhere yeah my understanding
of the system must be wrong yeah and as it turns it turns out, it was. So I started looking into it, I wrote some BPF probes to try to figure out what MySQL
was doing and reading some of the source code and some blog posts. And what it turns out that
MySQL does is that it does batching, right? If you have five transactions that come in in the
same millisecond, it's going to apply them as part of the same F sync. So instead of doing an F sync
for every single transaction, it's better that it tries to group those commits together. And that's
literally what it's called, a group commit. And there's lots of examples of these kinds of
discrepancies. An example from another context that I really like was when Elon Musk wanted to
build SpaceX, he went to, I think he went to the Russians and he's like, how much does a rocket
cost? They're like 120 million. And he's like, that's ridiculous. And then he multiplies,
you know, aluminum cost and titanium cost and so on. Um, like probably 10 tons of this and a hundred
tons of this. And he multiplies it with the spot prices on the, uh, London metal exchange and says,
okay, well all the materials for rocket cost 7 million. million. So that difference, that $113 million.
Yeah, what is that?
Between the raw costs.
What's in there?
That's inefficiencies, right?
We should be able to do better.
And in fact, he was able to do better.
Was he right about that?
Yeah, he was right.
He was right, yeah.
Gosh, that's smart.
Well, that's what happens when we assume though, right?
Like you mentioned this.
You never do that.
This sort of like root cause understanding
of a system like you you assumed that you know one minus equal right was equal to one f sync
at least that's my understanding what you're saying here and so you went into this problem
with an assumption that was incorrect and once you learned more which is good for a developer
to learn more about a system you can then have more understanding of it and now go beyond just
simply this limitation
and start to understand that gap, as you mentioned, the first principle gap.
Yeah.
And this was an example where my understanding didn't line up, but oftentimes the napkin
result is much better than the real world result, right?
So something I was doing in a recent newsletter was that I was trying to figure out how fast
we could serve a simple free text query. And Lucene, which is kind of the standard for doing free text searches
and inverted indexes. Lucene is about I think 20 times slower than my napkin math for this. So I
reach out to one of the maintainers and I'm like, can you can you explain this? Right? Is there an
opportunity here to optimize Lucene? Or is my understanding off, right? And I've found both
scenarios, right? Where, well, there's something we can optimize in the system or there's something
wrong with my understanding of it. But typically it's my understanding that's wrong, but sometimes
there's a very real inefficiency. Like someone has just written the code incorrectly or it's
not written in an optimum way. But going back to your original example here, you mentioned the
meeting, right? It sounded like you were kind of battling against this inefficiency of time.
You'd mentioned roughly a month being wasted or at least exploratory to discover this.
Whereas if you took, I don't those two engineers time you know writing investigating
arguing taking lunches you know whatever you know whatever it took to like come to this
understanding of the system how much you know the efficiency or the inefficiency is that time versus
the time it takes to investigate and do some sort of napkin math example to have an estimation, I suppose,
of how it might perform.
Well, the napkin math can often be done in a few minutes, right?
Usually the bottleneck is not doing the napkin math.
The bottleneck is understanding the system, right?
Enough that you can model it out in napkin math.
So if we're trying to come up with a more concrete example, right?
It might be something like, let's say, for example, that we have a Redis in production. And the production team that runs this Redis instance says, okay, we've hit the max
throughput, this Redis is doing 10,000 requests per second, and we need to shard Redis, right?
And sharding Redis, that's a big undertaking, right? Now you have to change all the application
code to be on multiple Redises. If you're doing anything that does something on multiple multiple keys you have to make sure they're on the same reddits it's a big
undertaking right now these developers have to spend months and months sharding this reddits
well the napkin math person you know the annoying person in the meeting with the napkin math hat on
might say hang on 10 000 requests per second that that's nothing. Like machines are fast. And they might say, okay, well, if you can read,
reading about 64 bytes of memory
takes about five nanoseconds, right?
If you divide these things together,
like theoretically, you should be able to do
hundreds of thousands of requests per second
per Redis instance.
So what's going on here?
Why are they reporting 10,000 requests per second
when the theoretical upper bound is hundreds of thousands of requests per second? Again, here's the discrepancy.
Is it my understanding of the system that's wrong? Is Redis doing a lot more than just
reading memory and serving it over the network? Or is there an opportunity here? Is there
something that's wrong with the system? So in this case of Redis, something that I've
seen before is that Redis will get a lot slower if you have a lot of connections to it. So if you have a language that's not particularly fast,
and it's spending a lot of time reading to Redis, you might have thousands of servers that are
connecting to that Redis, causing tens of thousands of connections. Now Redis is not spending all of
its time serving all of these queries. It's spending a lot of time just like an e-poll call,
basically figuring out which connection is active now. So then you might find, oh, maybe instead of doing all the sharding effort,
we can put a proxy in front of Redis, like Envoy or something else,
to reduce the number of connections on Redis.
And suddenly, we don't have to shard it.
We just have to put a proxy there.
And these developers might have just saved like three or four months of sharding work, right?
And all of the risk that's taking something like that on entails, right?
Like now you have these keys on different servers
and you're almost certainly going to mess that up.
So that might be an example where napkin math
really helps guide your decisions
because it just questions like,
is this really the maximum throughput?
Well, you've done a lot of people a service
by putting the numbers out there on that repo
that you referenced.
You have things like the latency
and the throughput of system calls hashing context switching tcp echo servers all the things where
that's where i'd probably get stuck is i got i would understand the system to a certain degree
from first principles i do want to ask where you start we'll get back to that once i have that
understanding i'm like i don't know how long this thing generally takes and probably these are like
a google or two away but you can always find the one result that like this leads you completely
and ruins your napkin math so it's pretty cool people are trying want to do this we'll link out
to the repo where a lot of these numbers are out there there's also a lot of question marks like
how how long does a mutex please contribute take so there's some There's some places to contribute there.
But let's go back to that very first example.
Maybe the search example.
So he's like, I'm going to go read about these indexes and how they work.
Well, how did you know that search works that way?
How did you know that that's a place to start?
Because you've got to find the bottom to build up from there.
And sometimes that can be a big effort as well,
just knowing where do I look?
It's a really good question. It's also similar to yak shaving. Sort of. It can be a big effort as well, just knowing where do I look? It's a really good question.
It's also similar to yak shaving.
Sort of. It can be.
You can really yak shave on this kind of investigation.
Right. Instead of using Lucene, you built your own Lucene in the process.
Yeah, exactly. But yeah, where do you know where to start?
Yeah, I think one thing that I definitely just want to point out before we go further
is that this napkin math is not my idea.
This is not an
original idea at all right people have been doing this since the beginning of time this is how we
find out if a business is a good idea right we're like sitting in a diner we're like writing on a
map if i sell this many widgets for this price am i gonna make money or not like for sure and in in
kind of the computer science realm jeff dean who's you know the legend engineer at google who stands
between a lot of the engineering that a lot of us build on top of i think he had a slideshow where he sort of managed mentioned it as like oh
by the way this is something that you might want to do and then posted some numbers that have been
going around i decided to create my own one because it's fun like sitting there and disassembling
things to make sure that it's as fast as it can get um and writing the rust program to do that
but also because i was missing more than just the latency. I wanted the throughput, like how much can you process in a millisecond? How long does
it take to process a gigabyte? So to go back to your question of how do you develop this first
principle understanding? I think in a lot of cases, you can ask an expert, right? A lot of places
might have a, for example, if you're modeling something like MySQL, you're going to have in a
lot of cases, at least at larger companies,
someone at some DBA who can tell you how that B-tree is laid out on disk. And that's going to
be a really, really enlightening conversation for you because you can't do the napkin math
unless you understand the system. So for something like an inverted index, well, you can read about
how an inverted index works. There's a book on Lucene called Lucene in Action. And I essentially
just started reading that book. And then you sort of like develop just a stronger and stronger model of how this works.
You read kind of the, there's some documentation for Lucene and how it's implemented.
And then you start seeing, okay, well, like it's sort of implemented like this.
And so if you have, you know, you need to find something, a term that is mentioned in a million documents and another term.
And you also want to check that against another term that's also in a million documents and another term, and you also want to check that against another term that's also in a million documents, well, then you probably have to read 1 million documents
plus 1 million document IDs.
Each one of those identifiers is like, you know, let's say 64 bits.
So now you have like 2 million 64-bit integers.
And then you can roughly figure out how long is that going to take to read and join those
two together and doing a search across both of them.
And then you also get into things that's like,
oh, it's actually possible to read faster than D64-bit integers
because you learn randomly that Lucene actually does really good compression
and in a lot of cases can get these down to about 8 bits per document ID stored.
And this is counterintuitive to a lot of people that reading compressed data
in a lot of cases is actually faster than reading uncompressed
data because your machine is bottlenecked by the memory bandwidth that you can get. So between you
getting pages from memory, you have a lot of CPU cycles where you're not doing anything. So if you
can get more in each memory page and then decompress it, uncompress it in the spare cycles until you
get the next piece of memory, you can often read faster than otherwise. But of course, this gets into the nuances of like, now we're beyond the napkin math. Now we're no
longer concerned about just getting this right within an order of magnitude. Now we're really
trying to squeeze this out. But yeah, essentially, you just have to start reading the literature,
which is usually a good practice, I think. But yeah, you can end up in a yak shave, right? This
whole yak shave on like reading a paper about how like comparing different
compression techniques for storing integers in something like an inverted index well that's
probably a yak shave that i didn't need to take but it turned out to be really interesting yeah
there's a nice side effect of knowledge right like you're getting the knowledge as a side effect
because there's there's two ways of going about it that i see you're tasked with this thing well
let's use let's evaluate what search solution we're going to do. Whether we write our own, use this, whatever it is.
Well, the first thing is, which I do
oftentimes, well, how long would it take
for me just to try it? Like a feasibility
kind of spike.
Well, I know that in your case with the meeting
you have a month lag time because you got
lunches and stuff. Apparently they're going to lunch
a lot, Adam. But, you know, maybe I can
do that in two weeks.
Well, he mentioned two weeks each and I figured they would do lunches
and talks.
What are these
developers doing?
Walks to like
vacations,
you know,
long weekends,
time at the cottage.
Yeah, exactly.
So I don't know
what they're doing
all that time.
But, you know,
sometimes you can
spike out a thing
in a couple hours
and get your answer.
But you don't have
the nice,
you have the answer
of is this feasible
or is this a good idea,
but you don't necessarily
have the side effect of I still don't know how it works i just know that it worked out in the math right
but the napkin math way of going about it if you don't understand the system of first principles
you can't really just grab a napkin and get your numbers you got to go get the knowledge
and maybe that takes two hours and then you got to wash but you ended up with now i understand how
search works you know if you can't do the napkin math it's probably also too early to go and takes two hours and then you gotta you gotta wash but you ended up with now i understand how search
works you know if you can't do the napkin math it's probably also too early to go and actually
implement the system um like i call this programming through the wall when you just like keep keep
writing code and it's like oh i'm almost there and then you just write code a little bit harder
right when in a lot of cases you just want to step back and and think about the system and learn a
little bit more about it but i mean i don't mean to say here that like all tech problems can be solved by just sitting and
with like, you know, a piece of paper and a pen and doing all of this, right? In a lot of cases,
you just need more information from actually like writing some code. And you can often get stuck in
a rut of just analysis paralysis. But I think that napkin math plays a bigger role than,
and could play a bigger role than, and could play a bigger role
than it does now for a lot of projects. Right. It's a tool for your toolbox.
What it seems like is you're encouraging this exploration though, so that you don't go and waste
the two weeks to go and implement an example and then two more on the argument or, you know,
in that scenario in particular, like you're encouraging one other option to take rather than
a Redis rewrite that
might take months and months and months on an assumption when you could have just put envoy
in front of it and you know proxy it and solve the problem you know like to encourage that
exploration i think is what's kind of key here this is like knowing more about the system is
always going to be a good thing it may be a yak shave in some cases or it may not be but it's
going to deepen your understanding and you encouraging exploration. Totally. And I mean, that's also how I use the
newsletter, right? Is that there are these problems that are ruminating in my mind that
I'm very interested in. Like recently, I was interested in how do you synchronize data really
efficiently between a mobile client and a server? How do you do that like really, really well?
And so I just decided that I was going to make a napkin math problem about it, right?
And then just started thinking about how could this work and then diving out and adding complexity
as I found out that the simplest solutions wouldn't work.
And that exercise is, I think, is really, really valuable.
But it certainly, it certainly takes time.
So is the way the newsletter works is you send out the problem and then you follow up
with the math solution or do you send it all out at once?
Like, is it interactive?
Do I get a chance to do my own math?
And then we reconvene with your answers?
Or how does that work?
Yeah, that's what I did for the first.
I've been writing a newsletter for about a year now.
And for the first maybe nine or so, I did exactly that.
I sent out like, hey, here's the problem.
Here's the scenario.
You know, your co-workers saying the red is slow.
Is it really slow?
What's the theoretical max throughput?
Blah, blah, blah, blah.
But what I found was that a lot of people just didn't didn't do it. There's a couple of
signals that just said that people just didn't do it. So people read it like a blog post,
but where the context was a month delayed. So I've switched format now to just make it more
of an article. But I really hope that people are doing this behind the scenes. And something that
I also didn't mention, but something where napkin math is incredibly useful is financial estimates,
right? Like how much money is it going to take to store incredibly useful is financial estimates right like how much money is it gonna take to store all of
this data right how much money is it gonna take to run this streaming process
job all the time how much extra money is it gonna take to run you know another 50
reticence and I have all those numbers on the in the napkin math like a github
repository as well which just kind of rounded to two numbers that that are
easy to to do some math with
what's up friends are you looking for a way to instantly troubleshoot your applications
and services running in production on kubernetes well pixie gives you a magical api to get instant
debug data and the best part is this doesn't involve code change and there are no
manual UIs and this all lives inside Kubernetes. Pixie is an API which lives inside your platform.
It harvests all the data you need and it exposes a bunch of interfaces that you can paint to get
the data that you need. And it's essentially like a decentralized Splunk. It's a programmable edge intelligence
platform, and it captures metrics, traces, logs, and events, all this without any code changes.
And the team behind Pixie is working hard to bring it to market for broad use by the end of 2020.
But guess what? Because you listened to this show, I'm here to tell you how you can get your
hands on the beta today by going to pixielabs.ai.
Links are in the show notes, so check them out to click through to the beta and their Slack community.
Once again, pixielabs.ai.
And look forward to a pixie day coming soon. All right, Simon.
So say that somebody is sold on the idea of adding this tool to their belt of tools they
can reach for when it's time to solve a problem or do a feasibility research.
And they're like, let's just do some quick napkin math.
But I've never done this before in the context of systems.
Maybe you've done it with your budget or some expenses or a business idea, but haven't done
it well.
And I don't really trust my ability
to come up with an answer
that I'm going to have much confidence in.
You have some techniques that you apply
and some tips for getting started.
Do you want to walk us through those?
Yeah, absolutely.
So these are in the GitHub repo as well.
So the first one is to not overcomplicate it.
We had this example before of a Redis instance, right?
And what are the things that actually matter, right?
Let's actually, let's take a database query instead, right? When you're committing a database
query to disk, the latency that's going to matter is committing the query to disk. Parsing the SQL
statement is not really going to matter. Like maybe you add on a couple hundred nanoseconds,
but in the grand scheme of things, it's just not going to matter. So just don't put those things
in there and just focus on
the biggest, slowest bottlenecks in the system that you're trying to model. So that will be the
first thing. And my kind of my rule of thumb is that if you have more than six assumptions, like
more than six additions in your napkin math, you probably need to simplify things a little bit.
That's usually a bad sign. The other thing too, is when I do napkin math, I usually try to keep
the units.
So this is, this is thing like, for example, the kilobytes or terabytes and things like that,
like just, just keeping those there or, you know, terabytes per second or requests per second.
Keeping the units is really, really handy because then you make sure that
you don't get a wrong number. So it's just kind of a check summing.
And Wolfram Alpha is often, often I don't actually do this on a
napkin. I just do it in Wolfram Alpha because Wolfram Alpha is very, very good at handling
units. It's very good at handling conversions between different units. So kilobytes to terabytes
the other way around. And so usually I just type in things with the units into Wolfram Alpha,
and then it gives me the right result. And if the units look weird, then I know that I did
something wrong. Plus it helps you conceptualize it better.
Like if you're thinking in megabytes and you type in megabytes, it just conceptually is right there versus having to do the conversion yourself and then having to convert it back
when you think about it.
Exactly.
And then the third one is to calculate with the exponents.
So often if you end up having something like, you know, 3.924 times 10 to the like eighth power or something,
like just lose everything after the decimal. Like it just, it just doesn't matter in the grand
scheme of things. With napkin math, you're just trying to get in within an order of magnitude
of the actual performance of the system. And as long as you're within that order of magnitude,
you've probably done it roughly right. That's one that I also, also really the system. And as long as you're within that order of magnitude, you've probably done it roughly right.
That's one that I also really make sure.
And this also means that it's just much easier to do
if you are doing it in a meeting room on a whiteboard
that you just have to multiply
or add the coefficients on the exponents together
instead of trying to do like multiplication of fractions
and things like that.
That's just not fun.
You're going to embarrass yourself in front of your coworkers.
You'll be umming a lot.
Yeah.
Got that one wrong.
You'll get your phone and pull up Wolfram Alpha, Alfred.
Nevermind.
You'll be, yeah, you'll be stuttering.
Because that's the beauty, like the reason why napkin math,
like when you're by yourself, you have a calculator available, right?
And you can write that out and put that in the GitHub repo.
It doesn't mean that you shouldn't try to still keep the units and not overcomplicate things,
because otherwise your co-workers are just going to approve the PR because it looks complicated, right?
And also just keep yourself to high school math, right?
You don't need anything fancy.
And then I think the last one that I have on the list here is arguably the most important,
which is to do what's called Fermi decomposition of the problem. And this sounds really fancy, but it's really not. It's just
decomposition with a fancy name. And the reason why it's called Fermi decomposition is because
there's this physics professor called Fermi, Enrico Fermi, I think is his full name. He was
Italian. He worked on the Manhattan Project. And he was a very revered physicist among
his coworkers because he had this knack for estimating things. So for example, when they did
the famous first detonation of an atom bomb in Nevada, he dropped a couple of shreds of paper
from the air. And based on how far they moved after the blast, he estimated to pretty good precision how strong that atom bomb was, which was very remarkable at the time.
Because actually doing the calculations for that is probably beyond any of our math skills and took them weeks to do.
But he had an estimate immediately. And you have to remember that this was at a time where people were literally afraid
that they were going to blow open the ozone layer because they just did not know how powerful this
was going to be. So the fact that they had that right there and then, and he did an estimate that
was so close, was remarkable at the time. He's very famous for this. And probably the most famous
example of a Fermi decomposition is to answer these kind of, I think he, I'm imagining that
he sort of went around the Manhattan project and then asked his, over lunch, asked his co-workers
these ridiculous questions, like how many piano tuners are there in Chicago is the really famous
one. And it's like, how are you going to answer that? Right. How are you going to answer, like,
who cares? And how are you going to answer that? You break it down. That's how you do it. Yeah,
exactly. So you break it down.
So you go like, okay, probably we should know roughly how many people there are in Chicago,
right?
And again, this is napkin math.
We just have to be within an order of magnitude, then it will all work out.
So there's maybe like, I don't know, 9, 10 million kind of in the metropolitan area of
Chicago.
So that's like an estimate that I think that most people could probably get there somewhere
between 5 and 10 million.
And then you think, well, okay, how many people are there per household? So maybe like two people
per household on average in that area. And then you start to think how many households might have
a piano. Do you guys have a piano? No. Do you have a piano? No piano here. Well, I have a keyboard.
Well, I guess it's a piano. It's not like a grand piano. It's more of like a keyboard piano.
Does it need a tuner? No, digital. It's not a piano grand piano. It's more of like a keyboard piano. Does it need a tuner?
Nope.
Digital.
It's not a piano then.
So we might say maybe like one in 20 households have a piano, right?
I was going to go one in a hundred.
You were going to go one in a hundred.
Okay.
That's high.
Or low.
So we could go one in 20, one in a hundred, one in 50.
Yeah.
You know, I think there's a lot of homes with pianos where they just can't get rid of them
because getting rid of them is the worst.
And then we might estimate how often are these things tuned,
right? You know, the estimate that I used when I was doing a presentation on this was about once
a year. That seemed really high, like that one in 20 people would tune this once a year,
but maybe once every few years. And then you might think, okay, so then we have to think about like
how much can a piano tuner do so tuning a piano probably including driving
within the chicago metropolitan area would maybe take about two hours and then we we assume that a
piano tuner works eight hours a day uh maybe about 50 weeks a year or however you know americans
work in a lot of weeks a year this is in america so 50 weeks a year fair and then we can start to
kind of compose these numbers and then we say okay we okay, we look at how many pianos there are and so on.
And we say, there's probably about 200,000 piano tunings per year in Chicago.
How many can each piano tuner do?
So maybe about 1,000 if you use those numbers from before.
And then we get somewhere around 200 piano tuners in Chicago, right?
So that's the rough estimate.
And this technique is called Fermi decomposition.
How many actual tuners are there?
And again, it's not meant to be on the money.
It's meant to be within an order of magnitude
because one thing I'd throw in there might be a curveball
is you assume that the supply is equal to demand.
Yes.
Right? Because there may be more people capable of tuning a piano
even though they may not do it professionally
and therefore the supply may actually be disproportionate to the demand. Because there may be more people capable of tuning a piano, even though they may not do it professionally.
And therefore, the supply may actually be disproportionate to the demand.
Yeah.
Now you're going from napkin to somewhere else, right?
Right.
More granular.
Theoretical, yeah.
We might have done this napkin math because we wanted to figure out, is there any opportunity here, right?
This could be someone trying to do product market fit or whatever, right?
And they're looking in the phone book and they're seeing like one piano tuner. they call them up and they try to book with him to figure out how busy he is.
And it's like, oh, this piano tuner like can only be booked three months out. And they call someone
else and she says, oh yeah, I'm booked all year. And then this person, so he's like, oh, there's a
big opportunity here because there's a mismatch between what's in the phone book and what I
estimated. But yeah, this is definitely, and then you might do, do a little bit more analysis after
that. Which we're getting to a good point, which is, what's the point?
Right.
What question are you asking?
You know, what's the point of the net camp math?
What's the whole point?
Right.
It's not to get to an accurate number.
In particular, it's to determine a good next step, right?
Exactly.
Right.
It's to answer a different question, right?
The question you're asking is not the one you care about.
Because if you cared about that question, you would ask it in a much more granular way.
Like you would say, well, what about churches
and community centers?
They're likely to have pianos.
We should add those in, right?
But we don't care about the actual piano tuners.
We care about some other question we're trying to answer,
which is like, is there an opportunity in Chicago
to open up a piano tuning business?
The question that you're trying to answer
with napkin math is, is there something there, right? Exactly. And then, you know, I think about decisions in kind
of a decision tree, right? And you have these branches. And your job as a decision maker is
to figure out how far down these different branches you need to go. And to chop the ones
off that don't look fruitful as fast as you can, right? So an example of using this right might be something like you
receive your bill from your cloud provider, and the bill is $100,000. Right. And you're like,
it seems pretty high, right? Right. You're like deep into red, and you're trying to figure out
is this reasonable or not? Right. And so you might say, okay, you look at it, and you say,
I'm doing about 10,000 requests per second.
I know that I can, you're doing this in a whiteboard, right?
You're like in crisis mode because you're deep into red and you're doing this with your co-workers.
Okay, friends, we're doing 10,000 RPS.
Each one is 100 milliseconds, right?
So if this is single threaded, we divide those two numbers and we see that we need about a thousand CPUs to serve all of this traffic, if all of that is CPU time. Okay, so if we know that a CPU, one CPU costs $10 a month,
then we multiply 10 by a thousand,
and we know that to serve all this traffic
should be about $10,000 a month, right?
So then, now we have an estimate here, right?
So our bill was 100,000 a month, right? So then now we have an estimate here, right? So our bill was was 100,000.
Our like main application costs roughly 10,000. What's going on here that does that, you know,
we have this gap again, right? And so you might add in like database cost and so on, but they just
they're just not adding up. And then you start going into it. And you find out that one of your
co workers left, you know, 200 machines running that they were training some machine learning
model on and they forgot to turn it off. And that happened. And then you have an
RCA and you figure out that you need to have something that monitors how many machines are
running, or whatever, right. But these are the kinds of things where again, the question you're
trying to answer is, is there something here, right? Or if these numbers added up to 70 or
80,000, it's like, okay, this, this must just be what it costs, we need to optimize it.
Yeah, that's really powerful stuff. I also think when you're doing feasibility, again,
if we go back to the opportunity to start a business or have a business that actually,
you know, the dog hunts, you know, you're comparing your potential revenue versus your
potential costs. And so the cost calculations, if you're going to be cloud-based, a lot of times
exactly what you're doing, you're estimating how much this is going to require us.
What are we going to be paying out a month to Amazon or to Microsoft or to Google?
And is that going to actually scale alongside the revenue that we come in?
So you just have your back, you have your napkin math on the revenue side and your napkin
math on the expensive side and start to make some decisions on, is this completely upside
down?
Is it tight?
Is it obviously an opportunity?
And then once you start having those answers, then can say well it's obviously an opportunity let's get
more specific right let's fill in those gaps and take out the napkin and put in the calculator you
know the more specific uh spreadsheets and drill in but if it's completely backwards like let's not
waste our time with the details it's not going going to work. How many times have you done this, Simon? I don't know. I've lost count.
Hundreds? Thousands? Give us an order. Just back of the map. Give me a napkin math underestimation.
Okay. I've been alive for this many years. You know, I don't write that much code anymore,
but then I would be like, how many PRs do I do a week? How many of them require napkin math,
right? But I really find that it's just useful to, when I'm reviewing code, I also think about this, right? It's not that necessarily that
I'm sitting down drawing something, right? But I'm like, okay, this person is getting this much
throughput on this, or they're doing these kinds of calls on this critical path. Like, is this
going to work or not, right? We talked about that MySQL extension earlier, right? How many, okay,
it's doing this many syscalls. And we know every sys call based on the napkin
math we know a sys call might take about let's see here 500 to a thousand nanoseconds depending on it
we're doing this many we know how much overhead we can roughly introduce per query and we say okay
we need to reduce the amount of system calls we have here because we have a very very tight budget
right so we might look at look at things like that and it just over time you also start to build an
intuition right i'm sure that both of you have seen, have encountered people, right, who have mastery over some domain,
and they just look at what you're doing. And they say, like, yeah, this is not gonna work. And you
mean, like, what do you mean, it's not gonna work immediately? Yeah, they just know immediately,
like, nope, doesn't make sense. There's that famous story, right, of a firefighter who took
his team into a building. they were trying to you know get
people out and so on and i think they emptied the building and then they were standing kind of in
the in the lobby of this building and suddenly the guy who was in charge of this operation said
everyone out and people were like what do you mean everyone out but you you know you just you follow
order in these kinds of circumstances and they all ran out and about a minute or two after they were out of the building, the floor collapsed.
So how did he know that, right?
Well, he built some kind of mastery, right?
And mastery is built by just deliberate practice over time.
And at some point, you might not even need to really reference these numbers anymore
because you start to have a pretty good feeling for what's fast and what's not fast.
The point of me asking that question was
to really get to how many times has this saved your bacon, so to speak? The reason why our audience
might care deeply about this or pick up this practice is because, you know, one, you're
introducing this idea to us, even though you're not the inventor of the idea. But the reason why
you do it has been because it's paid itself in dividends in your career. Yeah, I mean, it's hard
to say because I don't have the timeline in front career. Yeah, I mean, it's hard to say
because I don't have the timeline in front of me or the parallel reality of where I didn't have this,
right? I think that the biggest place where I end up applying it again and again, and again,
it's hard to give a number. I would say I do this at least once a week, apply to something at work,
right? Where it has some impact, what that impact is, it's not always true. But the impact almost
always is, hey, the simple solution
is going to work, like it's fast enough. Because the engineers, if they have an idea for how to
make something fast, they usually will, even if it takes longer, and they will justify why that's
the best thing to do. But when you realize that reading, you know, a megabyte of memory, even on
every single request, probably not your bottleneck, right? That only takes 100 microseconds. It's not
really that long, if your requests are taking 100 milliseconds, right?
So yeah, it's hard to answer the question directly.
Yeah.
You mentioned this newsletter you have.
Where can people subscribe to that
so they can follow along
as you do more of these napkin problems
and you share them?
Yeah, so you can go to,
it's linked from the GitHub repository,
github.com slash siripson slash napkin dash math.
You can go to my website, which is siripson.com slash napkin, and you can subscribe there.
If you Google just, you know, Simon napkin math, I think it will probably come up as well.
It's kind of a niche market.
Yeah.
And yeah, then you should every month, you should receive some kind of deep dive.
And my coworkers joke with me because they know exactly what project I'm working on
based on the napkin math newsletter. So I'm very much doing this on things that I'm actually
working on on real problems. So I'm, you know, I've at least done this 12 times next month,
because these are real problems. And a lot of these I send to co workers because they ask a
question. And then I go deep on it in one of these newsletters. So that's very, it's very real.
Yeah, well, you got one new subscriber, I'm subscribing, right when we hang up this call. I go deep on it in one of these newsletters. So it's very real.
Well, you got one new subscriber.
I'm subscribing right when we hang up this call.
One last question for you.
Do you have a specific brand of napkin that you suggest?
I actually do not own any napkins at all.
So I've never done this on a napkin.
It's terrible.
I do all of this on an iPad.
I thought maybe you were just working for Big Napkin. You're just out there shilling napkins.
Yeah.
I should have some napkins in the
background here. You should. Set this up a little bit
better. You should come up with a little
Simon branded napkin. You could sell
those on your website. That's true.
Come on, merchandise. That's true. You know, if I'd done that,
maybe I'd make some money after this aired.
Yeah. Because unfortunately, all of this is
free. I'm not earning a dime on this.
Huge missed opportunity.
Do you know anyone in Big Napkin?
We do now.
They're going to reach out to this and be like, we can sell you some napkins.
We can help you out here.
Strike up a product placement deal.
Perfect.
Can we sponsor the show in retrospect?
And it should be sort of the size of the airline napkin like those really really low quality ones because then you run
out of room real fast you have to grab another one see now you're using exactly you're using
two you need constraints i bet you also the airlines probably you know they're desperate
to make money right now they'll sell you some of their napkins i think we should just do the math
and see if it's gonna be a business true i like awesome simon thanks for sharing this cool stuff
this wisdom and this exploration.
This desire for curiosity, I think, is pretty interesting.
And that's what's interesting to me is that you encourage this exploration to see if there's actually something there worth doing more of or not, if the original assumption was correct.
But links will be in the show notes.
Listeners, you know that.
So the repo and the newsletter, all that stuff, check your notes.
You will see it there.
Simon, thank you so much.
Thanks for having me on the show.
That's it for this episode of The Change Law.
Thank you for tuning in.
If you haven't heard yet, we have launched Change Law Plus Plus.
It is our membership program that lets you get closer to the metal, remove the ads, make them disappear, as we say, and enjoy supporting us.
It's the best way to directly support this show
and our other podcasts here on changelog.com.
And if you've never been to changelog.com,
you should go there now.
Again, join Changelog++ to directly support our work
and make the ads disappear.
Check it out at changelog.com slash plus plus.
Of course, huge thanks to our partners
who get it, Fastly, Linode, and Rollbar. Also, thanks to Breakmaster Cylinder for making all
of our beats. And thank you to you for listening. We appreciate you. That's it for this week.
We'll see you next week. Thank you. Субтитры создавал DimaTorzok