The Changelog: Software Development, Open Source - Estimating systems with napkin math (Interview)

Starting point is 00:00:00 You know, if you can't do the napkin math, it's probably also too early to go and actually implement the system. Like I call this programming through the wall when you just like keep keep writing code and it's like, oh, I'm almost there. And then you just write code a little bit harder. Right. When in a lot of cases, you just want to step back and think about the system and learn a little bit more about it. But I mean, I don't mean to say here that like all tech problems can be solved by just sitting and with like, you know, a piece of paper and a pen and doing all of this, right? In a lot of cases, you just need more information from actually like writing some code. And you can often get stuck in a rut of just analysis paralysis. But I think that napkin math plays a bigger role than, and could play a bigger role than it does now for a lot of projects.

Starting point is 00:00:43 Being With For Change log is provided by Fastly. Learn more at fastly.com. We move fast and fix things here at Changelog because of Rollbar. Check them out at rollbar.com and we're hosted on Linode cloud servers. Head to linode.com slash changelog. What up friends?

Starting point is 00:01:02 You might not be aware, but we've been partnering with Linode since 2016. That's a long time ago. Way back when we first launched our open source platform that you now see at changelog.com, Linode was there to help us, and we are so grateful. Fast forward several years now, and Linode is still in our corner, behind the scenes helping us to ensure we're running on the very best cloud infrastructure out there we trust linode they keep it fast and they keep it simple check them out at

Starting point is 00:01:31 linode.com slash changelog welcome back everyone this is the changelog a podcast featuring the hackers the leaders and Welcome back, everyone. This is the ChangeLog, a podcast featuring the hackers, the leaders, and the innovators in the world of software. I'm Adam Stachowiak, Editor-in-Chief here at ChangeLog. On today's show, we're talking with Simon Eskelson, Principal Engineer at Shopify, about how he used a concept called napkin math, where he used first principle thinking to estimate systems without writing any code so we have simon here principal engineer at shopify simon welcome to the changelog thank you so much we're happy to have you you're doing some interesting stuff in the world of learning and advancing as a developer you have some really cool napkin math stuff to tell us about. But a lot of this has come out of your experience working at Shopify through crazy amounts of growth.

Starting point is 00:02:32 Why don't you tell us your Shopify journey? Yeah, sure. So I joined Shopify in about 2013. And back then we were still up in years to 100 or maybe 1,000 requests per second. And now we're flying somewhere around 100,000 or more requests per second. And I've been really, really fortunate to be part of that journey on the infrastructure side and just seeing us through every level of that, migrating from our own on-prem to the cloud,

Starting point is 00:02:59 moving all of our shops between shards, sharding in the first place, running out of multiple regions, architecting for multiple continents, running shops out of multiple regions, and a lot of the kind of foundational architecture that underpins the Shopify that we run today. And yeah, out of that, I have spent a lot of time having to learn about all of these different things. I don't come from an academic background. I had some catching up to do, which meant, you know, trying to read the TCP book the first time you encountered these kinds of problems.

Starting point is 00:03:27 I'm catching up as much as possible. And I think that's a pretty healthy mindset to maintain for as long as possible. Yeah. Well, you may have overcompensated because now you're out there teaching other people these things, which is a cool shift. Shopify is such a success story. It's pretty amazing. I think, Adam, you and I were talking about it the other day. I said maybe the most valuable company in Canada at this point or one of the top ones by market cap.

Starting point is 00:03:51 And maybe Rails, Ruby on Rails monolith's biggest success story in pure capitalistic terms. Just an amazing, amazing growth, amazing company. One that we've been watching for many years and it's probably been cool to be there on the inside and putting out the fires as it grows. One thing that you were a part of, which we aren't going to talk about too much today,

Starting point is 00:04:11 but we're going to talk about a lot in an upcoming episode, was a recent rewrite or re-implementation of the storefront, no longer a monolith. Do you want to just give us like a 30 seconds on that so we can tease it up for a bigger show later? Absolutely. So we built a team last year to completely redesign how we do the storefront that is serving of the stores that you see when you browse Shopify. We've learned a lot scaling that over the past 15 years or so. And fundamentally, it just has some different requirements

Starting point is 00:04:42 for how it needs to be run be running across multiple data centers, how it does caching, and all of these different things. And Maxim will talk much more about all of these details and why we didn't dare. It's still Ruby, but it's now extracted out of the main monolith that still powers the APIs and all the business logic. That had to be it. You seem like a learn-by-doing kind of person.

Starting point is 00:05:01 That's very much learn-by-doing. It's like you've got to do something for many years, and technologies progress over time, but then developers change the way they come into the market, being less experienced or more experienced, and now being at a position where you've got to do what you're doing now with learning by doing. You seem very much like that, where you've been at Shopify a very, very long time. And it's part of how you think it seems versus like, what I understand is you didn't go to college and you went to Shopify and you've very much progressed there. So as Jared said, you're leading in many ways.

Starting point is 00:05:34 And so rather than having that academic background, you kind of have this background of being in the trenches, so to speak, you know, having to read the book to get through it rather than having taken a course to get your CS degree, for example. Yeah. And I think what was really helpful for me was that when I was in high school, I got exposed to this world of competitive programming, where every nation essentially has a national team. And because Denmark, where I grew up is a very small country, it's not super difficult to make it to the national team compared to somewhere like the United States. But that really got me exposed to another level of engineering because before that,

Starting point is 00:06:08 I had mostly been exposed to JavaScript and Ruby on Rails and PHP. But seeing suddenly through these algorithms and how something like Google might work by understanding a bit more of the computer science was a fascinating journey. And then joining Shopify and getting to work on the systems that I'd seen and read about through high school was was just a dream come true. Competitive programming journey. Have you heard of that before? Is this a first? I've heard of it. I've never participated. I'd be afraid to do so. Tell us about it. Yeah, sure. So essentially, what it is, is that it's a bit like an exam room, right? So you sit down five hours, you have a computer, and you have an editor, and you have a C++ compiler. And that's pretty much it. On your table, you will find usually

Starting point is 00:06:53 about three problems. And the problems might be, they range a lot in what they might be. So a problem, for example, that I encountered at one point was that you were told that on your machine, there are these five directories. One directory has images that are of impressionist paintings. This one has images that are cubism and so on. So you had all these directories. And then you had to write your own program that with this little training data set would be able to take an image that it had never seen before and classify it into one of these five categories. This is an impressionist painting, this is a cubist painting, and so on. So that might be a task, which is a very abstract one.

Starting point is 00:07:31 And the way that you might solve that is that you might look at what is the average color difference between the pixels. Because you can see that that is something that changes a lot based on the different kinds of paintings, right? If you have something like cubism, where you have these like big areas that are yellow and green, and so on, the average color difference is a lot smaller than something more impressionistic, right? So that's a more of a free problem, kind of free form ad hoc problem, but they might also be a lot more algorithmic in nature. So one, for example, that I remember is that if you imagine kind of a grid, right, of intersections, imagine kind of a Manhattan sort of situation where you have almost a perfect grid. And you then know that at each intersection, there's a coffee shop. And each coffee shop has a Wi Fi signal that has a certain

Starting point is 00:08:15 radius. And it also has a certain bandwidth. And then it says, assuming that you can connect to multiple Wi Fis at the same time, and download on all of them, what is the best intersection for you to be in to get maximum bandwidth? So that's a more algorithmic problem. So then what you will do is you write a solution to these, you upload your C++ program to what's called a judge, which is an internal, the only thing you have access to something running on the internal network. And then you get back a score somewhere between one and 100 points, you usually get a hundred points if you've solved it the problem to perfection but also very fast so often you can solve these problems with a really simple but very slow algorithm and then you can get some points

Starting point is 00:08:56 let's say 30 points but then if you come up with a much faster algorithm you might be able to get up to a hundred points and so you just sit there for five grueling hours and try to work through these problems and trying to balance, you know, should I spend more time on this problem? Do I have it? Do I not have it? And it's definitely some of the tougher hours in my life sitting in these exam rooms, but also great fun. This reminds me of like survivor or naked and afraid for coders. It's like, hope you weren't naked. You got the essentials. You're in the wilderness. Quotes the wilderness. Get out.

Starting point is 00:09:30 Like an escape room. So compare the pressure, the stress, the sweat dripping down your brow of competitive programming versus a typical Black Thursday at Shopify where you're at peak requests per second and maybe a server goes down or something inevitably goes wrong? Is that equal amounts of stress

Starting point is 00:09:48 or is that way more stress because of other people's money? How do those compare out? Well, it's very different, right? Because on Black Friday... Did I say Black Thursday? You said Black Thursday. Yeah, it's all good. Because the Thursday is when you make the final changes, right?

Starting point is 00:10:03 And we have kind of a very elaborate plan on what risk you can take several weeks in. You know, we're not going to upgrade the MySQL version the week of Black Friday. In fact, we're probably not even going to do that in November because of the things that can surface. Batten down the hatches and hold on for, yeah. Yeah, pretty much. Keep things the same. Don't change things much. But it's also a really good internal deadline, right?

Starting point is 00:10:24 In October to make sure that things get in because suddenly everything multiplies, right? same don't change things much but it's also a really good internal deadline right in october to make sure that things get in because suddenly everything multiplies right everything multiplies and it's really the final exam i think it's more the kind of exam where you've built a lot and now it's a bit out of your hands on black friday you can respond to things but there's no more building allowed right so it's a very different kind of pressure where on black friday you have to sit there and monitor and make sure everything is fine we sit in this long room well we used to before uh before we all we all went remote but we would all sit in this room and there's monitors all the way around showing our dashboards with how all the systems are performing and we just sit there all day and explicitly do not try to do anything hard because if something does come

Starting point is 00:11:05 up we all need to have maximum energy we have an lte router there in case something goes wrong and we we sit there and monitor and we have teams doing that around the time and sometimes you know small things will happen and we'll see a little pimple in one of the graphs and all like look in and make sure everything is okay but it's a different kind of pressure because in the competitive contest right you sit there and all of your training has come up to that point and you cannot learn any more algorithms, no more data structures. You have to perform in that very moment. And Black Friday is a little bit the same, but you have to sit back a little bit and just trust that you've done the work required. How will this change then with being remote? What are your anticipations for November coming up and the way the world is now in terms of being distributed and not in the same room? a lot. So I think we've been prepared because of that. And what Black Friday is going to look like this year? Who knows, right? Maybe it's going to look a lot smaller than previous years, because

Starting point is 00:12:09 the steady state has gone up so much. Maybe it's going to be a lot bigger for opposite reasons. It's just very, very hard to predict. But we're, of course, preparing for the worst, or the best, depending on who you ask. Right. So undoubtedly, you've come up with a lot of different engineered approaches, tips and tricks, and weird solutions. Share maybe one or two of your exploits, not exploits like zero days, but the things you've done at Shopify over the years.

Starting point is 00:12:33 You've mentioned pre-show, you're doing some stuff with MySQL. When you have systems at scale, you have to do things that other engineers don't have to think about because you hit up against the edges of certain technologies, and surely you've done that over the years. Yeah, I think maybe two that come to mind through history here are one would be when we probably the biggest project or the biggest project that I did the earliest was our potting project. So essentially, we've done sharding. So essentially taking our database and splitting it into multiple and that was done around the same

Starting point is 00:13:03 time I joined and not by me, but by other teams. But we wanted to extend that even further so that we could have essentially these groups of shops that live together and are isolated together. And those shops should be able to run in multiple data centers. Because before that, we would have one data center, all of the shops would be active in that data center. If something went horribly wrong, we would fail all of them over to the other data center. But we wanted to isolate these shops in a way where we could run them out of multiple data centers at the same time. That was a lot of engineering effort to make sure that there's nothing relying on the fact that everything is in one data center at the time. So that was one of the biggest projects that I did a few years ago with a really, really good team. This year, one of the things that I'm most excited about that we've worked on was that a lot of my focus has been around capacity planning and resiliency. So essentially

Starting point is 00:13:49 finding out that when a system becomes slow, how do you make sure that it doesn't jam up the entire system? It's a lot worse when a database is slow than if it's down, because it can clog up the systems and cause queues in all these different places and cause much more cascading failures. And one of the things that we've had great success with here is this technique called load shedding. The idea behind load shedding is essentially that when a system is overloaded or close to be overloaded, you want to start prioritizing what type of traffic that you okay and that you send through to the system. So if we have a store that is having a lot of malicious traffic or some kind of sophisticated

Starting point is 00:14:25 DDoS, we want to make sure that we start to drop that traffic before we start to drop the traffic at other merchants to protect the platform. So we've done a lot of work in that. And we've done a lot of work at that at the edge so that the load balancer can prioritize traffic to make sure that our merchants have as much uptime as possible. But we wanted to go even further and start providing that at the database level. One of the things that to me is very disappointing about the database realm today is that a lot of companies are SaaS companies, right? They're multi-tenants companies, and they've run all these tenancies on one or a few databases. But one of these tenets might have

Starting point is 00:15:01 a disproportionate impact on that database. They might have an API client that is doing a lot more requests than anyone else. Or they might have a customer that has a million orders because of some peculiarity in the way that they work. So you have all of this cardinality and all of this uniqueness to the merchants or your customer. And that's not just a Shopify problem. That's a SaaS problem. Because what you get with SaaS is that you have, you get these cost efficiencies of running on the same infrastructure, but I don't think the infrastructure has really caught up to that. So in a database today, you know, it's not, you don't create a schema for every database.

Starting point is 00:15:35 MySQL would scream at you if you try to do that when you have enough tenants, because it's just not made for that. So it doesn't really give you any primitives to be able to do that. And by default, the way you design your database is really not set up for multi-tenancy at all. So to go back maybe to this example of a single tenant overwhelming the system, MySQL or Postgres or any, there's no database that has a good mechanism for prioritizing traffic between these merchants. So what we have been looking at was that we found out that in the MySQL protocol, you can send an arbitrary string back with the query. So we thought, what if along with the

Starting point is 00:16:12 query results, you know, this could be a bunch of customers, a bunch of orders, whatever it might be, what if we also sent back to the application, how many resources that query took? How many nanoseconds on the CPU? how much IO time, how much memory was loaded into the page cache in MySQL, how expensive was this query, the kinds of things that you would see in a slow query log. And so you might think, why is that useful? Like you're going to look at that information. Well, imagine an API throttle that is not some arbitrary number taken out of a hat of

Starting point is 00:16:41 you can do 100 requests per second, but rather the API throttle was actually based on how the database is doing and how heavy the queries that your API calls are causing actually are, right? Doing API throttling with something like GraphQL on an external API is incredibly difficult to do correctly. And you're almost always going to either underestimate the query complexity or overestimate it. But if we build systems that have multi-tenancy and databases that have multi-tenancy built in to that caliber where they can feed it back to the API throttling, that helps a lot. But you can then also feed it to your load shedding mechanism. So you can see, oh, this tenant is being really bad at the database, even though they're only doing very, very few queries. So I think that's a really, really important thing for more databases to adapt,

Starting point is 00:17:26 and we've been working on a patch to MySQL to expose this. That's interesting. Do you have any observability problems, or was it Heisenberg's principle where by the actual observing of the slow queries, if the response of like, here is the metrics around this query, it comes back with the query,

Starting point is 00:17:39 are you not adding load to the already slow query as well? Is it just meaningless? Not really. In the benchmarks we looked at, the overhead is maybe 1% it just meaningless? Not really. In the benchmarks, we looked at the overhead is maybe one or 2%. It's really not bad at all. And that's a very constant factor, right? You're doing a little bit of bookkeeping to see how much time that thread in MySQL is spending on CPU, but you're not adding anything significant. It's usually just one context switch. So that's the kind of thing that has to happen upstream though. So are you running like a fork of MySQL or are you trying to,

Starting point is 00:18:05 is this still experimental phase or how's that shake out? Yeah, so we're maintaining an internal fork. This is not in production on all the shards yet. There's a lot that you have to do in due diligence before you roll out your own patch with a bunch of C. But this is something that we're starting

Starting point is 00:18:18 to roll out more heavily. And then we want not just to expose this information to upstream places so that we can do data analysis on it in the warehouse and we can do the API throttling based on it. But now we can also build a shatter, like a load shatter, inside of MySQL to prioritize traffic. So it chooses the queries that are most valuable rather than just the ones that are the most of to overwhelm the system. What's up friends? When was the last time you considered how much time your team is spending building and maintaining internal tooling? And I bet if you looked at the way your team spends its time, you're probably building and maintaining those tools way And I bet if you looked at the way your team spends its time,

Starting point is 00:19:05 you're probably building and maintaining those tools way more often than you thought, and you probably shouldn't have to do that. I mean, there is such a thing as retool. Have you heard about retool yet? Well, companies like DoorDash, Braggs, Plaid, and even Amazon, they use retool to build internal tools super fast, and the idea is that almost all internal tools look the

Starting point is 00:19:26 same. They're made up of tables, dropdowns, buttons, text inputs, search, and all this is very similar. And retool gives you a point click drag and drop interface that makes it super simple to build internal UIs like that in hours, not days. So stop wasting your time and use Retool. Check them out at retool.com slash changelog. Again, retool.com slash changelog. so simon we talked about how you came into shopify no college degree definitely education but needing to learn a lot on the go and And you were so intentional and disciplined around this, you came up with different methodologies for learning. And you built that into a system. And in fact, the first time that we came across you, it was on the super organizer substack.

Starting point is 00:20:35 There's a really nice article out there called how to make yourself into a learning machine, which is all about you and the system that you came up with. And out of that comes a lot of stuff, but most notably and most recently, you have this idea of back-of-the-napkin math or quick math for understanding systems from first principles, which you're out there talking about. It's really interesting and allows people to really quickly and simply, perhaps simply, we'll talk about that,

Starting point is 00:21:04 figure out a thing about a system, like how it should be performing or how much it should cost or how much throughput this should have without having to say, I'll get back to you, right? Or spend six hours crunching numbers. So maybe start by telling us about your desire to learn in this intentional way that you are learning and all the stuff that you're learning. I mean, you're reading books and you're basically making sure you remember what you read is to simplify it.

Starting point is 00:21:28 But there's a lot of interesting things in the details and then how napkin math came into the equation. Yeah, sure. So it's funny that we actually opened a bit with competitive programming unintentionally because that's where this practice comes from. When you're doing competitive programming, a lot of your time is spent trying to,

Starting point is 00:21:52 implementing a solution, doing a competition, you know, it's going to take you probably about, you know, 30 minutes to an hour, depending on the complexity of it. There's a lot of off by ones. There is not a lot of help from an editor or a linter or anything like that. So you really have to know how well your program is going to perform beforehand. How fast is it going to be and how many points is that going to grant you? And fortunately, doing these competitions, it's a very controlled environment. So you know that if you only have to see n once, like an O and n of n algorithm, then you're probably going to perform pretty well. If the input is, you know, 10,000, and you have an n squared algorithm, you're starting to get in trouble for doing something in less than a second.

Starting point is 00:22:21 So there, the napkin math was really, really easy and it was very encouraged and anything you will read about competitive programming is going to talk a lot about the strategizing of how much that's going to create. And I kind of left that behind a little bit when I went into Shopify. There wasn't really a lot of where we would need that. There's not a lot of algorithms in day-to-day programming for most programmers. But over time, as more and more of my time has been spent reviewing how systems are going to perform and doing tech reviews and designing systems more so than implementing parts of them, I basically took up this practice again of you might find yourself in a meeting and you have a conversation with the other people in the meeting and someone says,

Starting point is 00:22:59 well, maybe we could do this. And someone else says, well, that's probably too slow. And then someone else said, well, why don't we try try it and then we'll meet in a week or two and see and see how it's doing and then you know you you go off from the meeting and the person works on that for a week or two you come back into the meeting room and then the person comes back and say ah it was too slow and the person advocating for it in the first place says what you implemented it wrong like i'll come help you give me give me a week or two and then we'll go back back to it right and you can see how this story kind of unfolds and then you spend a month or two kind of going back and forth on this. But I think with a little bit of practice, you can estimate the performance of systems ahead of time.

Starting point is 00:23:34 And you can start to develop some expectations about how the system should perform, right? Is it reasonable to continue to have this written in Ruby or Python instead of C++? Is it reasonable to use this database for this kind of operation? Can we build this on top of MySQL or do we fundamentally need a different data structure? I very very firmly believe that you should be developing your understanding from the bottom up. For example right now I'm working in on search and I don't know anything about search but the first thing I do is I go start to learn how an how an inverted index works how would i implement that how does lucene implement

Starting point is 00:24:09 it how do you do a top k like get the top k best documents for a query what does that look like how does it do that efficiently how is it laid out on disk what heuristics does it use and then build up from there because then my question is not oh oh, does Elasticsearch provide an API for this? I think about, hey, fundamentally, can an inverted index perform this operation? What would it look like? How long would it take? How would it do in MySQL compared to here?

Starting point is 00:24:35 Oh, an inverted index is not just good at doing full-text search. It's also good at just merging arbitrary sets, which then leads you to find other applications. So that's something that I found really valuable is that you can now go into the meeting that I described before and be like, hey, hang on, let's draw these scenarios and then do some back of the envelope calculations. So an example might be that someone might say scanning a gig of memory on every single request. That's way too slow. There's no way we can do that,

Starting point is 00:25:03 right? But then what you see is that if you sit down and you write a program in C, you allocate a bunch of memory, and then you go through it and maybe add out the numbers so the optimizer doesn't optimize it out, you see that, whoa, you can actually read a gig of memory in about 100 milliseconds. So maybe that's not so crazy

Starting point is 00:25:19 if you can also do a little bit of caching, right? So suddenly these things that weren't even solutions before become solutions, become plausible, right? My favorite thing about this is that I run this newsletter called the Napkin Math Newsletter. If you Google Napkin Math Newsletter and Simon, you should find it. And essentially what this is, is that this is my kind of monthly exercise in napkin math. So I post myself these problems. So a problem that I might post to

Starting point is 00:25:45 myself is how many transactions can a MySQL fundamentally do every single second, right? Is it 1000? Is it 10,000? And so I sat down and tried to construct kind of a simple model of how does MySQL apply a transaction, right? So I start to kind of from the bottom up, think about this So then it's like okay, you have to parse the sequel query. That's probably pretty fast Then you have to sort of figure out what's in this insert. There's a bunch of data also pretty fast So how do we commit this so it's durable, right? There's this whole asset guarantee that we have to hold up that if the server shuts down it either needs to be committed or not

Starting point is 00:26:21 So what does it need to do to do that? well it needs to take that transaction that insert and put it at the end of a file and then it needs to be committed or not. So what does it need to do to do that? Well, it needs to take that transaction, that insert, and put it at the end of a file. And then it needs to tell the file system, hey, commit this, send it to the hard drive, and don't tell me that it's committed before you're sure that it's committed to the hard drive. Right. Do what you said you're going to do. Yeah, exactly. Exactly. And that operation is called F-Sync. So then the hypothesis forms, right?

Starting point is 00:26:45 The napkin hypothesis forms that, well, the number of transactions you should be able to do in MySQL every single second must be equal to how many F-Syncs you can do per second, right? Unless there's a bottleneck somewhere else. Because that is the biggest number in a single transaction, right? So you whittle it down to like, what does one look like? And then you add up whatever the time is for one in this case f sync outweighs every other thing which is pretty much rounds to zero and so that's why you say the number of f syncs is equal to the amount of time because it's just massively larger than any other time that there is you know you take that apply it to like a hdp request say, well, the network time is like massively bigger than any other thing.

Starting point is 00:27:26 Just throwing that out there. So you can, you're figuring out what it is for one. And in this case, it happens to be F sync is pretty much what matters. Exactly. So yeah, you look at the numbers, you're like, how long does it take to send a query to the database? Oh, probably like less than like a couple hundred microseconds. How long does it take to parse the query?

Starting point is 00:27:43 Well, that's like, you know, a couple hundred bytes. That's like less than five microseconds. How long does it take to parse the query? Well, that's like, you know, a couple hundred bytes. That's like less than five nanoseconds. I'm just throwing out some numbers here, but all of these numbers can be found on github.com slash syrups and slash napkin dash math. And then you see that, oh, there's an Fsync operation here. And Fsync is benchmarked at about one millisecond.

Starting point is 00:27:59 In the whole scheme of things, that's actually a fairly long time. And that seems to be the bottleneck for MySQL because the network and so on is typically not the bottleneck. So, yeah, you take one millisecond and you divide it into a second and you see, OK, that's got to be a thousand transactions a second. Right. And so what I did for this edition of the Napkin Math newsletter where I investigated this was that I went and I actually tried to do as many transactions per second as I could. And I found that I could do about five to six do as many transactions per second as I could. And I found that I could do about five to six to 7,000 transactions per second. That's way higher than the thousand percent that I'd estimated from the essence. So now we have what I call the first

Starting point is 00:28:37 principle gap, right? Well, if you constructed a simple bottom-up understanding of how the system works and we have a real result and there's a gap between them they don't line up and they don't sometimes you know it's like a thousand two hundred and a thousand it's probably found it's fine it's rounding errors but this is a significant enough gap that there's something there my sequel is probably cheating somewhere yeah my understanding of the system must be wrong yeah and as it turns it turns out, it was. So I started looking into it, I wrote some BPF probes to try to figure out what MySQL was doing and reading some of the source code and some blog posts. And what it turns out that MySQL does is that it does batching, right? If you have five transactions that come in in the

Starting point is 00:29:19 same millisecond, it's going to apply them as part of the same F sync. So instead of doing an F sync for every single transaction, it's better that it tries to group those commits together. And that's literally what it's called, a group commit. And there's lots of examples of these kinds of discrepancies. An example from another context that I really like was when Elon Musk wanted to build SpaceX, he went to, I think he went to the Russians and he's like, how much does a rocket cost? They're like 120 million. And he's like, that's ridiculous. And then he multiplies, you know, aluminum cost and titanium cost and so on. Um, like probably 10 tons of this and a hundred tons of this. And he multiplies it with the spot prices on the, uh, London metal exchange and says,

Starting point is 00:29:59 okay, well all the materials for rocket cost 7 million. million. So that difference, that $113 million. Yeah, what is that? Between the raw costs. What's in there? That's inefficiencies, right? We should be able to do better. And in fact, he was able to do better. Was he right about that?

Starting point is 00:30:14 Yeah, he was right. He was right, yeah. Gosh, that's smart. Well, that's what happens when we assume though, right? Like you mentioned this. You never do that. This sort of like root cause understanding of a system like you you assumed that you know one minus equal right was equal to one f sync

Starting point is 00:30:31 at least that's my understanding what you're saying here and so you went into this problem with an assumption that was incorrect and once you learned more which is good for a developer to learn more about a system you can then have more understanding of it and now go beyond just simply this limitation and start to understand that gap, as you mentioned, the first principle gap. Yeah. And this was an example where my understanding didn't line up, but oftentimes the napkin result is much better than the real world result, right?

Starting point is 00:30:58 So something I was doing in a recent newsletter was that I was trying to figure out how fast we could serve a simple free text query. And Lucene, which is kind of the standard for doing free text searches and inverted indexes. Lucene is about I think 20 times slower than my napkin math for this. So I reach out to one of the maintainers and I'm like, can you can you explain this? Right? Is there an opportunity here to optimize Lucene? Or is my understanding off, right? And I've found both scenarios, right? Where, well, there's something we can optimize in the system or there's something wrong with my understanding of it. But typically it's my understanding that's wrong, but sometimes there's a very real inefficiency. Like someone has just written the code incorrectly or it's

Starting point is 00:31:40 not written in an optimum way. But going back to your original example here, you mentioned the meeting, right? It sounded like you were kind of battling against this inefficiency of time. You'd mentioned roughly a month being wasted or at least exploratory to discover this. Whereas if you took, I don't those two engineers time you know writing investigating arguing taking lunches you know whatever you know whatever it took to like come to this understanding of the system how much you know the efficiency or the inefficiency is that time versus the time it takes to investigate and do some sort of napkin math example to have an estimation, I suppose, of how it might perform.

Starting point is 00:32:28 Well, the napkin math can often be done in a few minutes, right? Usually the bottleneck is not doing the napkin math. The bottleneck is understanding the system, right? Enough that you can model it out in napkin math. So if we're trying to come up with a more concrete example, right? It might be something like, let's say, for example, that we have a Redis in production. And the production team that runs this Redis instance says, okay, we've hit the max throughput, this Redis is doing 10,000 requests per second, and we need to shard Redis, right? And sharding Redis, that's a big undertaking, right? Now you have to change all the application

Starting point is 00:33:00 code to be on multiple Redises. If you're doing anything that does something on multiple multiple keys you have to make sure they're on the same reddits it's a big undertaking right now these developers have to spend months and months sharding this reddits well the napkin math person you know the annoying person in the meeting with the napkin math hat on might say hang on 10 000 requests per second that that's nothing. Like machines are fast. And they might say, okay, well, if you can read, reading about 64 bytes of memory takes about five nanoseconds, right? If you divide these things together, like theoretically, you should be able to do

Starting point is 00:33:36 hundreds of thousands of requests per second per Redis instance. So what's going on here? Why are they reporting 10,000 requests per second when the theoretical upper bound is hundreds of thousands of requests per second? Again, here's the discrepancy. Is it my understanding of the system that's wrong? Is Redis doing a lot more than just reading memory and serving it over the network? Or is there an opportunity here? Is there something that's wrong with the system? So in this case of Redis, something that I've

Starting point is 00:34:01 seen before is that Redis will get a lot slower if you have a lot of connections to it. So if you have a language that's not particularly fast, and it's spending a lot of time reading to Redis, you might have thousands of servers that are connecting to that Redis, causing tens of thousands of connections. Now Redis is not spending all of its time serving all of these queries. It's spending a lot of time just like an e-poll call, basically figuring out which connection is active now. So then you might find, oh, maybe instead of doing all the sharding effort, we can put a proxy in front of Redis, like Envoy or something else, to reduce the number of connections on Redis. And suddenly, we don't have to shard it.

Starting point is 00:34:36 We just have to put a proxy there. And these developers might have just saved like three or four months of sharding work, right? And all of the risk that's taking something like that on entails, right? Like now you have these keys on different servers and you're almost certainly going to mess that up. So that might be an example where napkin math really helps guide your decisions because it just questions like,

Starting point is 00:34:55 is this really the maximum throughput? Well, you've done a lot of people a service by putting the numbers out there on that repo that you referenced. You have things like the latency and the throughput of system calls hashing context switching tcp echo servers all the things where that's where i'd probably get stuck is i got i would understand the system to a certain degree from first principles i do want to ask where you start we'll get back to that once i have that

Starting point is 00:35:19 understanding i'm like i don't know how long this thing generally takes and probably these are like a google or two away but you can always find the one result that like this leads you completely and ruins your napkin math so it's pretty cool people are trying want to do this we'll link out to the repo where a lot of these numbers are out there there's also a lot of question marks like how how long does a mutex please contribute take so there's some There's some places to contribute there. But let's go back to that very first example. Maybe the search example. So he's like, I'm going to go read about these indexes and how they work.

Starting point is 00:35:52 Well, how did you know that search works that way? How did you know that that's a place to start? Because you've got to find the bottom to build up from there. And sometimes that can be a big effort as well, just knowing where do I look? It's a really good question. It's also similar to yak shaving. Sort of. It can be a big effort as well, just knowing where do I look? It's a really good question. It's also similar to yak shaving. Sort of. It can be.

Starting point is 00:36:08 You can really yak shave on this kind of investigation. Right. Instead of using Lucene, you built your own Lucene in the process. Yeah, exactly. But yeah, where do you know where to start? Yeah, I think one thing that I definitely just want to point out before we go further is that this napkin math is not my idea. This is not an original idea at all right people have been doing this since the beginning of time this is how we find out if a business is a good idea right we're like sitting in a diner we're like writing on a

Starting point is 00:36:32 map if i sell this many widgets for this price am i gonna make money or not like for sure and in in kind of the computer science realm jeff dean who's you know the legend engineer at google who stands between a lot of the engineering that a lot of us build on top of i think he had a slideshow where he sort of managed mentioned it as like oh by the way this is something that you might want to do and then posted some numbers that have been going around i decided to create my own one because it's fun like sitting there and disassembling things to make sure that it's as fast as it can get um and writing the rust program to do that but also because i was missing more than just the latency. I wanted the throughput, like how much can you process in a millisecond? How long does it take to process a gigabyte? So to go back to your question of how do you develop this first

Starting point is 00:37:13 principle understanding? I think in a lot of cases, you can ask an expert, right? A lot of places might have a, for example, if you're modeling something like MySQL, you're going to have in a lot of cases, at least at larger companies, someone at some DBA who can tell you how that B-tree is laid out on disk. And that's going to be a really, really enlightening conversation for you because you can't do the napkin math unless you understand the system. So for something like an inverted index, well, you can read about how an inverted index works. There's a book on Lucene called Lucene in Action. And I essentially just started reading that book. And then you sort of like develop just a stronger and stronger model of how this works.

Starting point is 00:37:48 You read kind of the, there's some documentation for Lucene and how it's implemented. And then you start seeing, okay, well, like it's sort of implemented like this. And so if you have, you know, you need to find something, a term that is mentioned in a million documents and another term. And you also want to check that against another term that's also in a million documents and another term, and you also want to check that against another term that's also in a million documents, well, then you probably have to read 1 million documents plus 1 million document IDs. Each one of those identifiers is like, you know, let's say 64 bits. So now you have like 2 million 64-bit integers. And then you can roughly figure out how long is that going to take to read and join those

Starting point is 00:38:21 two together and doing a search across both of them. And then you also get into things that's like, oh, it's actually possible to read faster than D64-bit integers because you learn randomly that Lucene actually does really good compression and in a lot of cases can get these down to about 8 bits per document ID stored. And this is counterintuitive to a lot of people that reading compressed data in a lot of cases is actually faster than reading uncompressed data because your machine is bottlenecked by the memory bandwidth that you can get. So between you

Starting point is 00:38:50 getting pages from memory, you have a lot of CPU cycles where you're not doing anything. So if you can get more in each memory page and then decompress it, uncompress it in the spare cycles until you get the next piece of memory, you can often read faster than otherwise. But of course, this gets into the nuances of like, now we're beyond the napkin math. Now we're no longer concerned about just getting this right within an order of magnitude. Now we're really trying to squeeze this out. But yeah, essentially, you just have to start reading the literature, which is usually a good practice, I think. But yeah, you can end up in a yak shave, right? This whole yak shave on like reading a paper about how like comparing different compression techniques for storing integers in something like an inverted index well that's

Starting point is 00:39:29 probably a yak shave that i didn't need to take but it turned out to be really interesting yeah there's a nice side effect of knowledge right like you're getting the knowledge as a side effect because there's there's two ways of going about it that i see you're tasked with this thing well let's use let's evaluate what search solution we're going to do. Whether we write our own, use this, whatever it is. Well, the first thing is, which I do oftentimes, well, how long would it take for me just to try it? Like a feasibility kind of spike.

Starting point is 00:39:53 Well, I know that in your case with the meeting you have a month lag time because you got lunches and stuff. Apparently they're going to lunch a lot, Adam. But, you know, maybe I can do that in two weeks. Well, he mentioned two weeks each and I figured they would do lunches and talks. What are these

Starting point is 00:40:06 developers doing? Walks to like vacations, you know, long weekends, time at the cottage. Yeah, exactly. So I don't know

Starting point is 00:40:13 what they're doing all that time. But, you know, sometimes you can spike out a thing in a couple hours and get your answer. But you don't have

Starting point is 00:40:19 the nice, you have the answer of is this feasible or is this a good idea, but you don't necessarily have the side effect of I still don't know how it works i just know that it worked out in the math right but the napkin math way of going about it if you don't understand the system of first principles you can't really just grab a napkin and get your numbers you got to go get the knowledge

Starting point is 00:40:38 and maybe that takes two hours and then you got to wash but you ended up with now i understand how search works you know if you can't do the napkin math it's probably also too early to go and takes two hours and then you gotta you gotta wash but you ended up with now i understand how search works you know if you can't do the napkin math it's probably also too early to go and actually implement the system um like i call this programming through the wall when you just like keep keep writing code and it's like oh i'm almost there and then you just write code a little bit harder right when in a lot of cases you just want to step back and and think about the system and learn a little bit more about it but i mean i don't mean to say here that like all tech problems can be solved by just sitting and with like, you know, a piece of paper and a pen and doing all of this, right? In a lot of cases,

Starting point is 00:41:13 you just need more information from actually like writing some code. And you can often get stuck in a rut of just analysis paralysis. But I think that napkin math plays a bigger role than, and could play a bigger role than, and could play a bigger role than it does now for a lot of projects. Right. It's a tool for your toolbox. What it seems like is you're encouraging this exploration though, so that you don't go and waste the two weeks to go and implement an example and then two more on the argument or, you know, in that scenario in particular, like you're encouraging one other option to take rather than a Redis rewrite that

Starting point is 00:41:45 might take months and months and months on an assumption when you could have just put envoy in front of it and you know proxy it and solve the problem you know like to encourage that exploration i think is what's kind of key here this is like knowing more about the system is always going to be a good thing it may be a yak shave in some cases or it may not be but it's going to deepen your understanding and you encouraging exploration. Totally. And I mean, that's also how I use the newsletter, right? Is that there are these problems that are ruminating in my mind that I'm very interested in. Like recently, I was interested in how do you synchronize data really efficiently between a mobile client and a server? How do you do that like really, really well?

Starting point is 00:42:22 And so I just decided that I was going to make a napkin math problem about it, right? And then just started thinking about how could this work and then diving out and adding complexity as I found out that the simplest solutions wouldn't work. And that exercise is, I think, is really, really valuable. But it certainly, it certainly takes time. So is the way the newsletter works is you send out the problem and then you follow up with the math solution or do you send it all out at once? Like, is it interactive?

Starting point is 00:42:45 Do I get a chance to do my own math? And then we reconvene with your answers? Or how does that work? Yeah, that's what I did for the first. I've been writing a newsletter for about a year now. And for the first maybe nine or so, I did exactly that. I sent out like, hey, here's the problem. Here's the scenario.

Starting point is 00:42:59 You know, your co-workers saying the red is slow. Is it really slow? What's the theoretical max throughput? Blah, blah, blah, blah. But what I found was that a lot of people just didn't didn't do it. There's a couple of signals that just said that people just didn't do it. So people read it like a blog post, but where the context was a month delayed. So I've switched format now to just make it more of an article. But I really hope that people are doing this behind the scenes. And something that

Starting point is 00:43:19 I also didn't mention, but something where napkin math is incredibly useful is financial estimates, right? Like how much money is it going to take to store incredibly useful is financial estimates right like how much money is it gonna take to store all of this data right how much money is it gonna take to run this streaming process job all the time how much extra money is it gonna take to run you know another 50 reticence and I have all those numbers on the in the napkin math like a github repository as well which just kind of rounded to two numbers that that are easy to to do some math with what's up friends are you looking for a way to instantly troubleshoot your applications

Starting point is 00:43:54 and services running in production on kubernetes well pixie gives you a magical api to get instant debug data and the best part is this doesn't involve code change and there are no manual UIs and this all lives inside Kubernetes. Pixie is an API which lives inside your platform. It harvests all the data you need and it exposes a bunch of interfaces that you can paint to get the data that you need. And it's essentially like a decentralized Splunk. It's a programmable edge intelligence platform, and it captures metrics, traces, logs, and events, all this without any code changes. And the team behind Pixie is working hard to bring it to market for broad use by the end of 2020. But guess what? Because you listened to this show, I'm here to tell you how you can get your

Starting point is 00:44:42 hands on the beta today by going to pixielabs.ai. Links are in the show notes, so check them out to click through to the beta and their Slack community. Once again, pixielabs.ai. And look forward to a pixie day coming soon. All right, Simon. So say that somebody is sold on the idea of adding this tool to their belt of tools they can reach for when it's time to solve a problem or do a feasibility research. And they're like, let's just do some quick napkin math. But I've never done this before in the context of systems.

Starting point is 00:45:38 Maybe you've done it with your budget or some expenses or a business idea, but haven't done it well. And I don't really trust my ability to come up with an answer that I'm going to have much confidence in. You have some techniques that you apply and some tips for getting started. Do you want to walk us through those?

Starting point is 00:45:53 Yeah, absolutely. So these are in the GitHub repo as well. So the first one is to not overcomplicate it. We had this example before of a Redis instance, right? And what are the things that actually matter, right? Let's actually, let's take a database query instead, right? When you're committing a database query to disk, the latency that's going to matter is committing the query to disk. Parsing the SQL statement is not really going to matter. Like maybe you add on a couple hundred nanoseconds,

Starting point is 00:46:18 but in the grand scheme of things, it's just not going to matter. So just don't put those things in there and just focus on the biggest, slowest bottlenecks in the system that you're trying to model. So that will be the first thing. And my kind of my rule of thumb is that if you have more than six assumptions, like more than six additions in your napkin math, you probably need to simplify things a little bit. That's usually a bad sign. The other thing too, is when I do napkin math, I usually try to keep the units. So this is, this is thing like, for example, the kilobytes or terabytes and things like that,

Starting point is 00:46:49 like just, just keeping those there or, you know, terabytes per second or requests per second. Keeping the units is really, really handy because then you make sure that you don't get a wrong number. So it's just kind of a check summing. And Wolfram Alpha is often, often I don't actually do this on a napkin. I just do it in Wolfram Alpha because Wolfram Alpha is very, very good at handling units. It's very good at handling conversions between different units. So kilobytes to terabytes the other way around. And so usually I just type in things with the units into Wolfram Alpha, and then it gives me the right result. And if the units look weird, then I know that I did

Starting point is 00:47:22 something wrong. Plus it helps you conceptualize it better. Like if you're thinking in megabytes and you type in megabytes, it just conceptually is right there versus having to do the conversion yourself and then having to convert it back when you think about it. Exactly. And then the third one is to calculate with the exponents. So often if you end up having something like, you know, 3.924 times 10 to the like eighth power or something, like just lose everything after the decimal. Like it just, it just doesn't matter in the grand scheme of things. With napkin math, you're just trying to get in within an order of magnitude

Starting point is 00:47:58 of the actual performance of the system. And as long as you're within that order of magnitude, you've probably done it roughly right. That's one that I also, also really the system. And as long as you're within that order of magnitude, you've probably done it roughly right. That's one that I also really make sure. And this also means that it's just much easier to do if you are doing it in a meeting room on a whiteboard that you just have to multiply or add the coefficients on the exponents together instead of trying to do like multiplication of fractions

Starting point is 00:48:22 and things like that. That's just not fun. You're going to embarrass yourself in front of your coworkers. You'll be umming a lot. Yeah. Got that one wrong. You'll get your phone and pull up Wolfram Alpha, Alfred. Nevermind.

Starting point is 00:48:35 You'll be, yeah, you'll be stuttering. Because that's the beauty, like the reason why napkin math, like when you're by yourself, you have a calculator available, right? And you can write that out and put that in the GitHub repo. It doesn't mean that you shouldn't try to still keep the units and not overcomplicate things, because otherwise your co-workers are just going to approve the PR because it looks complicated, right? And also just keep yourself to high school math, right? You don't need anything fancy.

Starting point is 00:48:58 And then I think the last one that I have on the list here is arguably the most important, which is to do what's called Fermi decomposition of the problem. And this sounds really fancy, but it's really not. It's just decomposition with a fancy name. And the reason why it's called Fermi decomposition is because there's this physics professor called Fermi, Enrico Fermi, I think is his full name. He was Italian. He worked on the Manhattan Project. And he was a very revered physicist among his coworkers because he had this knack for estimating things. So for example, when they did the famous first detonation of an atom bomb in Nevada, he dropped a couple of shreds of paper from the air. And based on how far they moved after the blast, he estimated to pretty good precision how strong that atom bomb was, which was very remarkable at the time.

Starting point is 00:49:52 Because actually doing the calculations for that is probably beyond any of our math skills and took them weeks to do. But he had an estimate immediately. And you have to remember that this was at a time where people were literally afraid that they were going to blow open the ozone layer because they just did not know how powerful this was going to be. So the fact that they had that right there and then, and he did an estimate that was so close, was remarkable at the time. He's very famous for this. And probably the most famous example of a Fermi decomposition is to answer these kind of, I think he, I'm imagining that he sort of went around the Manhattan project and then asked his, over lunch, asked his co-workers these ridiculous questions, like how many piano tuners are there in Chicago is the really famous

Starting point is 00:50:35 one. And it's like, how are you going to answer that? Right. How are you going to answer, like, who cares? And how are you going to answer that? You break it down. That's how you do it. Yeah, exactly. So you break it down. So you go like, okay, probably we should know roughly how many people there are in Chicago, right? And again, this is napkin math. We just have to be within an order of magnitude, then it will all work out. So there's maybe like, I don't know, 9, 10 million kind of in the metropolitan area of

Starting point is 00:50:58 Chicago. So that's like an estimate that I think that most people could probably get there somewhere between 5 and 10 million. And then you think, well, okay, how many people are there per household? So maybe like two people per household on average in that area. And then you start to think how many households might have a piano. Do you guys have a piano? No. Do you have a piano? No piano here. Well, I have a keyboard. Well, I guess it's a piano. It's not like a grand piano. It's more of like a keyboard piano. Does it need a tuner? No, digital. It's not a piano grand piano. It's more of like a keyboard piano. Does it need a tuner?

Starting point is 00:51:25 Nope. Digital. It's not a piano then. So we might say maybe like one in 20 households have a piano, right? I was going to go one in a hundred. You were going to go one in a hundred. Okay. That's high.

Starting point is 00:51:35 Or low. So we could go one in 20, one in a hundred, one in 50. Yeah. You know, I think there's a lot of homes with pianos where they just can't get rid of them because getting rid of them is the worst. And then we might estimate how often are these things tuned, right? You know, the estimate that I used when I was doing a presentation on this was about once a year. That seemed really high, like that one in 20 people would tune this once a year,

Starting point is 00:51:55 but maybe once every few years. And then you might think, okay, so then we have to think about like how much can a piano tuner do so tuning a piano probably including driving within the chicago metropolitan area would maybe take about two hours and then we we assume that a piano tuner works eight hours a day uh maybe about 50 weeks a year or however you know americans work in a lot of weeks a year this is in america so 50 weeks a year fair and then we can start to kind of compose these numbers and then we say okay we okay, we look at how many pianos there are and so on. And we say, there's probably about 200,000 piano tunings per year in Chicago. How many can each piano tuner do?

Starting point is 00:52:33 So maybe about 1,000 if you use those numbers from before. And then we get somewhere around 200 piano tuners in Chicago, right? So that's the rough estimate. And this technique is called Fermi decomposition. How many actual tuners are there? And again, it's not meant to be on the money. It's meant to be within an order of magnitude because one thing I'd throw in there might be a curveball

Starting point is 00:52:54 is you assume that the supply is equal to demand. Yes. Right? Because there may be more people capable of tuning a piano even though they may not do it professionally and therefore the supply may actually be disproportionate to the demand. Because there may be more people capable of tuning a piano, even though they may not do it professionally. And therefore, the supply may actually be disproportionate to the demand. Yeah. Now you're going from napkin to somewhere else, right?

Starting point is 00:53:13 Right. More granular. Theoretical, yeah. We might have done this napkin math because we wanted to figure out, is there any opportunity here, right? This could be someone trying to do product market fit or whatever, right? And they're looking in the phone book and they're seeing like one piano tuner. they call them up and they try to book with him to figure out how busy he is. And it's like, oh, this piano tuner like can only be booked three months out. And they call someone else and she says, oh yeah, I'm booked all year. And then this person, so he's like, oh, there's a

Starting point is 00:53:35 big opportunity here because there's a mismatch between what's in the phone book and what I estimated. But yeah, this is definitely, and then you might do, do a little bit more analysis after that. Which we're getting to a good point, which is, what's the point? Right. What question are you asking? You know, what's the point of the net camp math? What's the whole point? Right.

Starting point is 00:53:52 It's not to get to an accurate number. In particular, it's to determine a good next step, right? Exactly. Right. It's to answer a different question, right? The question you're asking is not the one you care about. Because if you cared about that question, you would ask it in a much more granular way. Like you would say, well, what about churches

Starting point is 00:54:07 and community centers? They're likely to have pianos. We should add those in, right? But we don't care about the actual piano tuners. We care about some other question we're trying to answer, which is like, is there an opportunity in Chicago to open up a piano tuning business? The question that you're trying to answer

Starting point is 00:54:23 with napkin math is, is there something there, right? Exactly. And then, you know, I think about decisions in kind of a decision tree, right? And you have these branches. And your job as a decision maker is to figure out how far down these different branches you need to go. And to chop the ones off that don't look fruitful as fast as you can, right? So an example of using this right might be something like you receive your bill from your cloud provider, and the bill is $100,000. Right. And you're like, it seems pretty high, right? Right. You're like deep into red, and you're trying to figure out is this reasonable or not? Right. And so you might say, okay, you look at it, and you say, I'm doing about 10,000 requests per second.

Starting point is 00:55:07 I know that I can, you're doing this in a whiteboard, right? You're like in crisis mode because you're deep into red and you're doing this with your co-workers. Okay, friends, we're doing 10,000 RPS. Each one is 100 milliseconds, right? So if this is single threaded, we divide those two numbers and we see that we need about a thousand CPUs to serve all of this traffic, if all of that is CPU time. Okay, so if we know that a CPU, one CPU costs $10 a month, then we multiply 10 by a thousand, and we know that to serve all this traffic should be about $10,000 a month, right?

Starting point is 00:55:41 So then, now we have an estimate here, right? So our bill was 100,000 a month, right? So then now we have an estimate here, right? So our bill was was 100,000. Our like main application costs roughly 10,000. What's going on here that does that, you know, we have this gap again, right? And so you might add in like database cost and so on, but they just they're just not adding up. And then you start going into it. And you find out that one of your co workers left, you know, 200 machines running that they were training some machine learning model on and they forgot to turn it off. And that happened. And then you have an RCA and you figure out that you need to have something that monitors how many machines are

Starting point is 00:56:10 running, or whatever, right. But these are the kinds of things where again, the question you're trying to answer is, is there something here, right? Or if these numbers added up to 70 or 80,000, it's like, okay, this, this must just be what it costs, we need to optimize it. Yeah, that's really powerful stuff. I also think when you're doing feasibility, again, if we go back to the opportunity to start a business or have a business that actually, you know, the dog hunts, you know, you're comparing your potential revenue versus your potential costs. And so the cost calculations, if you're going to be cloud-based, a lot of times exactly what you're doing, you're estimating how much this is going to require us.

Starting point is 00:56:46 What are we going to be paying out a month to Amazon or to Microsoft or to Google? And is that going to actually scale alongside the revenue that we come in? So you just have your back, you have your napkin math on the revenue side and your napkin math on the expensive side and start to make some decisions on, is this completely upside down? Is it tight? Is it obviously an opportunity? And then once you start having those answers, then can say well it's obviously an opportunity let's get

Starting point is 00:57:10 more specific right let's fill in those gaps and take out the napkin and put in the calculator you know the more specific uh spreadsheets and drill in but if it's completely backwards like let's not waste our time with the details it's not going going to work. How many times have you done this, Simon? I don't know. I've lost count. Hundreds? Thousands? Give us an order. Just back of the map. Give me a napkin math underestimation. Okay. I've been alive for this many years. You know, I don't write that much code anymore, but then I would be like, how many PRs do I do a week? How many of them require napkin math, right? But I really find that it's just useful to, when I'm reviewing code, I also think about this, right? It's not that necessarily that I'm sitting down drawing something, right? But I'm like, okay, this person is getting this much

Starting point is 00:57:52 throughput on this, or they're doing these kinds of calls on this critical path. Like, is this going to work or not, right? We talked about that MySQL extension earlier, right? How many, okay, it's doing this many syscalls. And we know every sys call based on the napkin math we know a sys call might take about let's see here 500 to a thousand nanoseconds depending on it we're doing this many we know how much overhead we can roughly introduce per query and we say okay we need to reduce the amount of system calls we have here because we have a very very tight budget right so we might look at look at things like that and it just over time you also start to build an intuition right i'm sure that both of you have seen, have encountered people, right, who have mastery over some domain,

Starting point is 00:58:29 and they just look at what you're doing. And they say, like, yeah, this is not gonna work. And you mean, like, what do you mean, it's not gonna work immediately? Yeah, they just know immediately, like, nope, doesn't make sense. There's that famous story, right, of a firefighter who took his team into a building. they were trying to you know get people out and so on and i think they emptied the building and then they were standing kind of in the in the lobby of this building and suddenly the guy who was in charge of this operation said everyone out and people were like what do you mean everyone out but you you know you just you follow order in these kinds of circumstances and they all ran out and about a minute or two after they were out of the building, the floor collapsed.

Starting point is 00:59:07 So how did he know that, right? Well, he built some kind of mastery, right? And mastery is built by just deliberate practice over time. And at some point, you might not even need to really reference these numbers anymore because you start to have a pretty good feeling for what's fast and what's not fast. The point of me asking that question was to really get to how many times has this saved your bacon, so to speak? The reason why our audience might care deeply about this or pick up this practice is because, you know, one, you're

Starting point is 00:59:35 introducing this idea to us, even though you're not the inventor of the idea. But the reason why you do it has been because it's paid itself in dividends in your career. Yeah, I mean, it's hard to say because I don't have the timeline in front career. Yeah, I mean, it's hard to say because I don't have the timeline in front of me or the parallel reality of where I didn't have this, right? I think that the biggest place where I end up applying it again and again, and again, it's hard to give a number. I would say I do this at least once a week, apply to something at work, right? Where it has some impact, what that impact is, it's not always true. But the impact almost always is, hey, the simple solution

Starting point is 01:00:05 is going to work, like it's fast enough. Because the engineers, if they have an idea for how to make something fast, they usually will, even if it takes longer, and they will justify why that's the best thing to do. But when you realize that reading, you know, a megabyte of memory, even on every single request, probably not your bottleneck, right? That only takes 100 microseconds. It's not really that long, if your requests are taking 100 milliseconds, right? So yeah, it's hard to answer the question directly. Yeah. You mentioned this newsletter you have.

Starting point is 01:00:33 Where can people subscribe to that so they can follow along as you do more of these napkin problems and you share them? Yeah, so you can go to, it's linked from the GitHub repository, github.com slash siripson slash napkin dash math. You can go to my website, which is siripson.com slash napkin, and you can subscribe there.

Starting point is 01:00:50 If you Google just, you know, Simon napkin math, I think it will probably come up as well. It's kind of a niche market. Yeah. And yeah, then you should every month, you should receive some kind of deep dive. And my coworkers joke with me because they know exactly what project I'm working on based on the napkin math newsletter. So I'm very much doing this on things that I'm actually working on on real problems. So I'm, you know, I've at least done this 12 times next month, because these are real problems. And a lot of these I send to co workers because they ask a

Starting point is 01:01:18 question. And then I go deep on it in one of these newsletters. So that's very, it's very real. Yeah, well, you got one new subscriber, I'm subscribing, right when we hang up this call. I go deep on it in one of these newsletters. So it's very real. Well, you got one new subscriber. I'm subscribing right when we hang up this call. One last question for you. Do you have a specific brand of napkin that you suggest? I actually do not own any napkins at all. So I've never done this on a napkin.

Starting point is 01:01:38 It's terrible. I do all of this on an iPad. I thought maybe you were just working for Big Napkin. You're just out there shilling napkins. Yeah. I should have some napkins in the background here. You should. Set this up a little bit better. You should come up with a little Simon branded napkin. You could sell

Starting point is 01:01:55 those on your website. That's true. Come on, merchandise. That's true. You know, if I'd done that, maybe I'd make some money after this aired. Yeah. Because unfortunately, all of this is free. I'm not earning a dime on this. Huge missed opportunity. Do you know anyone in Big Napkin? We do now.

Starting point is 01:02:14 They're going to reach out to this and be like, we can sell you some napkins. We can help you out here. Strike up a product placement deal. Perfect. Can we sponsor the show in retrospect? And it should be sort of the size of the airline napkin like those really really low quality ones because then you run out of room real fast you have to grab another one see now you're using exactly you're using two you need constraints i bet you also the airlines probably you know they're desperate

Starting point is 01:02:35 to make money right now they'll sell you some of their napkins i think we should just do the math and see if it's gonna be a business true i like awesome simon thanks for sharing this cool stuff this wisdom and this exploration. This desire for curiosity, I think, is pretty interesting. And that's what's interesting to me is that you encourage this exploration to see if there's actually something there worth doing more of or not, if the original assumption was correct. But links will be in the show notes. Listeners, you know that. So the repo and the newsletter, all that stuff, check your notes.

Starting point is 01:03:06 You will see it there. Simon, thank you so much. Thanks for having me on the show. That's it for this episode of The Change Law. Thank you for tuning in. If you haven't heard yet, we have launched Change Law Plus Plus. It is our membership program that lets you get closer to the metal, remove the ads, make them disappear, as we say, and enjoy supporting us. It's the best way to directly support this show

Starting point is 01:03:28 and our other podcasts here on changelog.com. And if you've never been to changelog.com, you should go there now. Again, join Changelog++ to directly support our work and make the ads disappear. Check it out at changelog.com slash plus plus. Of course, huge thanks to our partners who get it, Fastly, Linode, and Rollbar. Also, thanks to Breakmaster Cylinder for making all

Starting point is 01:03:50 of our beats. And thank you to you for listening. We appreciate you. That's it for this week. We'll see you next week. Thank you. Субтитры создавал DimaTorzok

CODACE Plant Stand

The Changelog: Software Development, Open Source - Estimating systems with napkin math (Interview)

We're joined by Simon Eskildsen, Principal Engineer at Shopify, talking about how he uses a concept called napkin math where you use first-principle thinking to estimate systems without writing any co...de. By the end of the show we were estimating pretty much everything using napkin math.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

The Changelog: Software Development, Open Source - Estimating systems with napkin math (Interview)

We're joined by Simon Eskildsen, Principal Engineer at Shopify, talking about how he uses a concept called napkin math where you use first-principle thinking to estimate systems without writing any co...de. By the end of the show we were estimating pretty much everything using napkin math.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.