Software at Scale - Software at Scale 16 - Nipunn Koorapati: ex Software Engineer, Dropbox

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Welcome to another episode of the Software at Scale podcast. I have with me Nipun, who is an ex-engineer of Dropbox, just I think a few months out or like it's a few weeks out. A few days. A few days. A few days, yeah. For almost eight years on a bunch of different stuff where he learned, I think this was your first job out of university.

Starting point is 00:00:34 Correct me if I'm wrong. Yeah. And some of the major areas you worked on was like the developer productivity area and sync engine. So like working on rewriting our Sync Engine from the Sync Engine Classic, which you worked on for a year to Nucleus, which is like the new brand new rewritten and rust Sync Engine, which you've written like a bunch of blog posts on. Anyways, welcome to the show. And what did I miss from your intro?

Starting point is 00:00:59 Thanks for having me on. No, I think you nailed it. Yeah, I've been at Dropbox almost eight years and my last day was just last week. And I hope I can convince you to join me at my new company, but I'm not going to bank on it. But the first thing that I want to talk to you about is repository merges. So you've driven a lot of repository merges. And just for context, Dropbox used to have a bunch of different repositories for server side and client side code so I think Go used to be in its own separate repository and there was like a Python repository

Starting point is 00:01:31 and there was like a separate one for developer tools and build specific stuff I think and I'm sure there's more that I'm missing and you worked on merging a lot of them into a single repo and maybe you can tell listeners about why was that important and where do you worked on merging a lot of them into a single repo. And maybe you can tell listeners about why was that important and where do you work on it? Yeah, so I guess I can tell you the reasons that it was important at the time that we did it. So I'm going to talk about the ones that we did five, six years, 2015, six years ago.

Starting point is 00:02:00 And maybe you can just walk through listeners about what was the development environment and why did we have so many? Yeah. I mean, so to go back to 2015 at Dropbox, it was a company that probably had 600, 700 people, maybe something like that, and probably a few hundred engineers on that order. And to be clear, it had grown probably 2x in the previous year year and a half so it was like rapidly growing so it would not be a stretch to say a lot of the developer environment just kind of grew organically and like everything felt a little bit like there were too many people to handle the thing that we had at the time. That was sort of the feel of it, but that was kind of the feel of everything. So you learned a role with it.

Starting point is 00:02:49 One of the key initiatives that we really wanted to do was add some kind of like commit blocking system. We called it the commit queue, but essentially we wanted to block commits from landing on repositories until they pass tests. So prior to this, like the experience is, you're supposed to run the test yourself and land the code. But it was getting increasingly hard to run all of the tests yourself and to not have conflicts with other people landing around

Starting point is 00:03:17 the same time. And so we wanted to automate a lot of this process by adding something we called the commit queue. And we quickly noticed there was going to be a lot of this process by adding something we called the commit queue. We quickly noticed there was going to be a lot of challenges to this. We wouldn't want to deploy such a system unless it was good enough that people would be using it happily. It should remove a problem, not add one. One of the big things we really had to do was just identify why would things fail if we were to land land them if we were to use something like the commit queue and we had to go through and identify like what all the problems with the tests were and there were quite a few basically every single day it was someone's full-time job roughly the daily push rotation but it was before that but it was roughly that kind of job where

Starting point is 00:03:59 your job included making sure everything in the build was green and so the previous day ideally it was green at some point all the tests were passing and some point in the last one day that you know several hundred engineers working probably made it red at some point because there was you know the validation happened after landing uh and so somebody's job was to go figure out what happened and revert things if they went wrong etc so you, we had to make this big list of reasons why something would go, why things would fail. And common ones like people committed code that was bad. So that was one. People, two individual people committed code that individually would have been good, but when they landed on top of each other was bad.

Starting point is 00:04:41 This one would happen. Sometimes the infrastructure for the testing would fail, like you would try to run the test, but it couldn't set up the virtual machine to run the tests on something like that. Sometimes the test itself was flaky. So one really super good concrete example is a lot of our payments tests, the cash tests at the time would like fail on the last day of the month. Just because of something bizarre about some API they're using. But like, it's like this sort of thing existed where there would be tests that were written, they would pass consistently. And then on the last day of the month, or maybe on

Starting point is 00:05:14 February 29th, or maybe, I don't know, on a month with 31 days on the 31st day, specifically, it would fail, there would there'd be cases like this. And so we had to also look at things like that. I have a personal example, which I find really funny. But I there was one time where I added an Easter egg to our internal admin page that would post a little message on the bottom of the page on certain days of the year, say things like happy birthday to Drew or like, happy Valentine's Day. But one of the ones I added was happy Mother's Day. But in the week leading up to Mother's Day, it would give reminders to that it's not too late to get something, you know, to buy something. And in writing the code to do that, I had a bug in my

Starting point is 00:05:57 Python code. And I wrote this code in December. And so on May 1, all of our internal pages crashed because I had some bug in this code. And so that's like an example. I had a unit test. The unit test was passing every day until May 1st. It broke. I then wrote a test that iterated over the next 10 years and made sure that it wouldn't break for a long time. But yeah, these kinds of things creep in. So you have to find all of these ways that the test would go bad and kind of try to squash them one by one.

Starting point is 00:06:27 You want to get things up to a reasonable bar. Notably, you don't have to solve the ones where people are making mistakes. If the individual landing the code made a mistake for landing the code, it's fine. If that fails, it's not a big deal. You have to find all the ones that will piss people off. The ones where someone will be mad that their code didn't land because it was not their own fault. They'd be less mad if it was like bad luck, like another engineer landed something at the same time. They'll be more mad if it's something unrelated to them, like the VM failed to load or like, you know, just stuff that seemed

Starting point is 00:06:59 really unreasonable. And so we kind of prioritize all these things and tried to solve them one by one. Yeah, first thing that comes to mind is that was there something like, you know, GitHub actions, like pre check type thing you could have done, instead of, you know, running code on like running tests only after submitting? Yeah, I mean, this is roughly what we were trying to do. We did have things that would run the tests. I guess part of what we built was when we would upload the code to the code review tool, it would run the test at that point. But even that's not good enough

Starting point is 00:07:32 because if you run the test there and then you see that they pass and then you land it, it can still fail afterwards. It can also, people can ignore the results there. And I think at the time, it wasn't required that those results were passing when you go to land your code. Like making that into the requirement is really the meaning of the commit queue.

Starting point is 00:07:51 Okay, so there would be like a status that says like the tests are passing, but they would still fail because of like, you know, as you said, like conflicts or something in the test infrastructure. Or infrastructure. Yeah. Okay. Or test flakiness. We tried to separate test flakiness the test infrastructure or infrastructure yeah okay or test flakiness we tried to separate test flakiness from test infrastructure okay um yeah actually one concrete

Starting point is 00:08:11 thing is if there's an infrastructure failure that's the fault of the team that runs the infrastructure yeah so that's like the people making that was like me at the time the connect you if there's a flaky test that's the fault of the test owner and so being able to distinguish these two is really critical from an ownership perspective, because if you quarantine a test for being flaky, but it was flaking because of an infrastructure problem, then the test owners become kind of, you know, what is it? They don't believe your quarantines anymore. They'll just start quarantine without looking.

Starting point is 00:08:42 You just learn not to believe it. And so being able to distinguish those is really critical. Yeah. And quarantine just means like disabling a test from running in CI. Yeah. Okay. Quarantining is where the test infrastructure team turns the test off and then tells the team that owns the test, hey, something's wrong with your test.

Starting point is 00:09:00 So if you do that and nothing was wrong with the test, just looks bad. It's a bad look. You don't want to have the bad look. Yeah. So again, going back to the first question then, so why did you need to merge repositories to build something like a commit blocking system? Yeah, let's tie it back together.

Starting point is 00:09:16 So one of the things that was really challenging was that a lot of the integration tests in particular, I'll look at Magic Pocket integration tests as a really concrete example. It had a bunch of services talking to each other there was like one written in rust one written in go there's like meta server which is like the the the piece that tied into the rest of dropbox and there's like a bunch of components that need to work together and then the integration test would need all of those to be running and they they were all in different repositories. Some of the repositories were based on the language of the source, which I feel it's kind of arbitrary,

Starting point is 00:09:50 but that was the case. Some of them were based on the project they were coming from. There was just a lot of like, the reasons for splitting the repositories were really distinct. And so it felt kind of random where things were. When you ran the test, it would pull all of these repositories to master essentially, and then try to run all of the tests there. And so the motivation here was that we wanted to be able to distinctly know and reproduce exactly which tests we were running, like which combination of commits and all of these services we were running. And we kind of looked at this problem from like, you know, how can we solve this? Like first principles, what are different ways we can solve this? One of the approaches is to have some kind of global ordering where when you commit to any of the repositories, there's some way to identify like, you could you could like, have a linear order for all commits, regardless of what order they're in, you could do some Lamport clock-esque thing where you have like a pin for all of the repos and you have some point

Starting point is 00:10:49 in time that you're looking at. There's like, you know, there's ways you can solve this. All of those require doing additional work on top of the repositories. I don't think it's an unreasonable solution. It would have been another solution we took. I think we picked the easiest one because at the time it was just to merge the repositories together. We picked that solution because it gives you a single commit that you can plan things on. It kind of removes the idea of how do you decide what repo to put your code in? If you're not sure, do you make a new one? Do you shove it into an existing one? It just kind of removes that problem off the table. And it makes it easier to land code that affects multiple repos and maybe do code reviews across multiple services at the same time. So we saw a few different

Starting point is 00:11:30 benefits here. I wouldn't say it was like a unanimously happy thing. But I think it was unanimously felt that we should do something. So some people were some concerns that people had with the trees. Yeah, the biggest one, the biggest one by far, in my opinion, is that the repository would be huge. And at the time, it was pretty tolerable. We can do it. And kind of our mindset was like, if this makes it five years, we're pretty happy, honestly.

Starting point is 00:11:58 Like company was like six years old at the time, or seven or something like that. Like if this thing makes it five years we have succeeded and that was sort of the goal i think there was some long-term like hope that the repository like by doing this like if there's a problem that's technical five years is a lot of time to solve technical problems there is like an immediate problem and we're going to be able to do it easily by solving it with in this manner some Some of the concerns, repository would be huge. We actually mitigated a lot of that by just cleaning it.

Starting point is 00:12:29 While we were merging them, we just cleaned the garbage out of the repositories. If there were giant files that didn't need to be there in the history or in the master, we would kill them. If there was an absurd amount of history with automated commits and things like that, we could clean a lot of that out. We use this tool called BFG, which is this Git tool that you can find to clean out a lot of stuff. Nowadays, there's actually a tool that is called Git Filter Repo. It's written in the last few years. It's pretty solid. So if you're doing something like this, you can clean history of a repo quite easily. And that stuff is nice. But, you know, if you're going to do a major operation

Starting point is 00:13:06 that requires all of your engineers to reclone the repo, it's a good time to clean stuff out. So we were able to mitigate a lot of that at the time. But yeah, I think it was long term concerns about size of management of repo. And you were at Dropbox for like the five years after you merged most of the repositories, right? So like, how did that concern play out in practice? Yeah, I think it did. Like, the repo slowly grew, you know, we tried pretty hard to avoid having massive numbers of automated commits that, you know, was kind of lower knowledge that slowly vanished. And so we started having massive amounts of automated commits.

Starting point is 00:13:48 So that sort of stuff happened. On the flip side, the tech for Git improved a lot. Slowly, painfully, you know, with speed bumps and stuff like that. But the tech for Git itself improved a lot. And I think the option select for being able to switch to another, you know, repo structure is kind of always there as well. But I do think, yeah, like having everyone have to deal with all of the files did slowly become an issue. I would like to call out that maybe there was a benefit that also is hard to notice looking back because you it's very hard to look

Starting point is 00:14:21 at the counterfactual world. It's hard to look at the world of like how would it have evolved if we went the other direction and split into lots of repos that would have had other challenges and i would note that they're like engineers moving between different teams working on the server code base was pretty smooth that could have been more challenging i think it was easier to make code changes that affected, say, a service and all of its call sites, things like that. If you were running a migration, often that could be done by an individual. It might take multiple commits and multiple pushes, but it could be done by one person. And I think we could have culturally gone in a direction where that sort of thing would have required more, like, for smaller levels of refact like for extremely large refactors you always have to go contact all of the people and coordinate and for small ones you can do it yourself but like

Starting point is 00:15:14 the threshold of where where you have to switch modes into an organizational mode would have been on the more is currently on the more complex side but i think it could have been on the more simple side i'm sorry is that making sense it kind of makes sense it's like what you're saying is that is currently on the more complex side, but I think it could have been on the more simple side. I'm sorry, is that making sense? It kind of makes sense. It's like what you're saying is that it was just easier to refactor when you have like one repository versus when you have separate and you have to coordinate changes and all of that. Yeah.

Starting point is 00:15:38 Like if you had to do a refactor, you could change the surface and then you could find all the call sites and you could put up code reviews for all the call sites and send them to all their owners. You could probably pull a lot more off yourself. I'm not saying it's not possible to do that when there's multiple repos. It totally is, but it becomes a little more of a barrier and you become more likely to, like, if you have the tool in your toolbox to go contact 10 people and coordinate with a spreadsheet and you have the tool in your toolbox to send out 10 diffs, when there's a lot of repos and a lot of complexity and stuff you haven't used before, you're more likely to go with

Starting point is 00:16:12 organizational tool as opposed to the technical tool. So you would say it's like certainly a net benefit of like merging the repos even five years later, even though the repository got bigger. I wouldn't say that was true for everyone, but I would say it was true for most people. Okay. So who is it not true for then? I think if you are someone who only works on a small corner of the repo and never looks at anything else,

Starting point is 00:16:38 it might not seem like a benefit in the same way. But it's hard to tell ahead of time, which components are going to look like that and which ones aren't. Maybe by you don't have to say like by the end of it, but like, roughly what size of the repository was the pain like, like real or like people start complaining about the size of their positive, like how many files or anything that you can share even in like aggregate or like approximate numbers? Let's see. It's hard to answer because it's always a spectrum. Like there's always going to be people who want it to be like this. And it was immediately not like that. But then I think things, you know, it's like they slowly get a little bit slower. There's also differences because I know at a few points we made optimizations to the setup. I also

Starting point is 00:17:24 know that people have different machines that they're using and so there's different environments for example everyone who used Linux just ran into problems way way way way later like it was just a non-issue for much longer if you're on Linux Linux's file system is just faster compared to like mac os mac or windows yeah mac or windows okay yeah. Mac or Windows, okay. Yeah. Okay. Yeah. I can kind of try to answer the question. I don't want to dodge it completely.

Starting point is 00:17:55 I think if you have something like 10,000 files, you don't notice it all and things are fine. If you have 100,000, it's like sometimes okay, sometimes not. If you're at like 500,000, you start to notice it pretty often. Yeah. I was going to say, I worked on sync for a long time. okay sometimes not if you're at like 500 000 you start to notice it pretty often yeah i was gonna say i i worked on sync for a long time and so i actually there's a huge overlap between knowledge i got from how to make a good syncing system which is dealing with lots of files on the file system and for repositories yeah i'm gonna dive into that like a lot yeah soon enough like i i'm curious to

Starting point is 00:18:24 understand like why do you think you you know, Mac was like slower than Linux? Like you said, the file system was like much worse. Do you know exactly what went wrong at like 100,000, like 500,000 files? Like it seems like Git is mostly doing read only operations in most cases. Yeah. I think assist calls are a little bit worse on Mac. So that was one. i think maybe people using

Starting point is 00:18:47 linux computers might have just had faster computers because a lot of them were like remote bms or or like newer things or desktops like a lot of people had mac laptops and linux desktops so might be a contrary variable there i do know that Dropbox put a lot of like security software on computers. So there was a lot of like syscalls being trapped and like, you know, logging information to Splunk and things like that, that was going on on the Mac machines that was less there on the Linux ones. So I think there was that factor. We did our best to mitigate that, but it's kind of hard to mitigate that wholly. And then I do think that Mac syscalls are just not quite as fast. Okay. So should like server development just happen in like Linux, like remote VMs or like Linux boxes after a while? Maybe this is too broad, but like, what's your perspective on that?

Starting point is 00:19:41 I don't think so. Okay. I guess it should happen on the same machine that you're doing like for something like dropbox it's often not a choice right like dropbox needs to work on mac windows and linux so you kind of have to have things working when you're on a server developer i would say yeah like probably linux like if you have the luxury to get to pick i think getting it on some kind of linux file system is good. That being said, like, there's a reason Mac and Windows are popular. They have nice UIs. They have nice IDEs. Like, you know, I don't think, yeah, I don't think we should, like, talk that short either.

Starting point is 00:20:17 Pros and cons there. So it's kind of a struggle, like, to pick the right way. Like, honestly, in my opinion, the best way, like technically speaking, we should make it fast on all of them. I don't think, I think if there are hundreds of thousands of files, like the real ideal solution is that you don't need all of them, right? And like, you shouldn't need to pick

Starting point is 00:20:37 which ones you need really explicitly. Like you should be able to just get the ones you need with relatively small amount of work. Yeah. Yeah, both monorepo and multirepo have this struggle where you have to know what you're looking for with a lot of granularity. I think it's a little bit worse than monorepo.

Starting point is 00:20:57 Okay. So yeah, what you're saying is, for example, Git status runs like a syscall on every file in your repository. And ideally, why should it do that? For example, Git status runs a syscall on every file in your repository. And ideally, why should it do that? Because in most cases, things aren't going to change. And for many parts of the code base, you're never going to touch those files. Yeah, I believe that Git can be architected.

Starting point is 00:21:21 If we were to rewrite Git, I think it can be architected in such a way that most commands have already had the answer pre-computed. Okay. That's total inspiration from working on sync, like Dropbox itself. If you think about it, it's like when you double click on a file, most of the time it's already there. Like that's what it does. It sits in the background and does all of the things you're about to do with things. Yeah. Yeah. So let's talk about both those things then. Cause I know that's the new stuff that you're working on. So what has Git done in the last five years? You said there's been a bunch of improvements in Git that has made things for monorepos better.

Starting point is 00:21:52 So what are those improvements exactly? And what's your latest project that you're working on? Yeah, so a lot of this stuff is super recent, like the last couple of years. And I'll quickly like hit on what a few of the topics are and like where they came from. But yeah, one of them is FS monitor, which is a hook integration that Git has to that was inspired by Facebook's Watchman. So if you've heard of Watchman, it's the utility that listens to, it basically subscribes to file system updates with the operating system. So something like inotify on Linux or FS events on Mac, for examples.

Starting point is 00:22:33 But basically, it listens for file system events, and then you can query the Watchman to see what has changed since the last time you queried. So every time you query, you get a token. And then the next time you query, you give the token. It tells you everything has changed since that token. And so Git made a hook to integrate, it was inspired specifically to integrate with Watchman. And so Git has this hook that integrates with Watchman, you have to write a little hook to connect the two. And then what Git will do is when you type Git status in, Git will query Watchman via a hook and get a list of what has changed.

Starting point is 00:23:09 And it will only stack the things that have changed in between. So that was one of the big developments. That was specifically built into Git to work on Git status. And then over time, we added support for Git diff and Git add and a few other things. In my opinion, it would be pretty hard to say that it works for everything it should. That's a really hard sell to make, but it definitely works for git status

Starting point is 00:23:30 and for most git diff and git add commands. Git add, you might think, might not need to touch everything, but if you do something like git add dot, that needs to stat everything in order to see what's there. Yeah, so something like git add dash a or git commit dash am still needs to stat everything in order to see what's there. Yeah. So something like git add dash a or git commit dash am still needs to check everything. Yeah. Git commit dash am has this nice advantage of only checking things that are in the index

Starting point is 00:23:55 because it doesn't check untracked files. But as soon as you have to touch untracked files, you got to stat everything. Okay. So like, why is like untracked files different if you know? Yeah. Yeah. Git has this data structure in the docket directory called the index. And the index is roughly speaking a like database ish thing. It's a file, you can kind of think of it as a database, but it just tracks every file that Git knows about. And so if you add a new file that Git doesn't know about

Starting point is 00:24:25 and you use a command that's supposed to identify these files, it's going to have to stat everything to find them. And by definition, an untracked file wouldn't be in that index. Yeah, exactly. Okay, so when you run git add dash A, it still has to stat everything, or it definitely has to stat untracked files, if not everything else.

Starting point is 00:24:45 Yeah, there's a subtle difference, actually. There's a subtle difference between git commit dash a and git add dot, which I thought were the same thing for a long time, but there's a subtle difference, and if you read the docs carefully, one of them includes untracked files and the other one doesn't. So it's important. Git add dot will add all of your untracked files.

Starting point is 00:25:07 So the git commit will commit them. Git commit dash a will not include untracked files. So they do subtly different operations. I did not know that. I'm just glad I don't use git add dot for anything. Some people learn it that way. So the thing is, like, if you're like a new engineer, like, trying to explain what the index is, is probably just way or like, you probably just memorize git commit dash A, or you memorize git add dot followed by git commit, you just memorize it. And it probably takes you months to years of engineering before you start to even care about what an index is i mean illustrated by this conversation where we're still learning things about git but you know it's not unreasonable

Starting point is 00:25:50 like to not know this yeah yeah i mean i didn't know it until like a few weeks ago but that being said i never used git dot git add dot i like had memorized one of them when i was a young engineer who had just learned what git was yeah yeah, I just have bash aliases for stuff, and I don't even know the exact commands sometimes. Fascinating. So how much improvement does that make? And does it help to turn on FSMonitor for all repositories? Or does it only make sense after a certain scale?

Starting point is 00:26:21 It definitely helps if you have a lot of scale. I would say that it's hard to even say it has like a 100x improvement or a 1,000x improvement because the improvement is actually proportional to how many files you have. Because what it's doing is it's changing Git status from being a command that stats everything. So in Dropbox's server repo, that might be 500,000 files. And Git status checks untracked files too.

Starting point is 00:26:46 So if you have a build directory with a bunch of tokens in it, it'll stat all of those too. So it can be variable depending on how much random build artifacts you have floating around. But yeah, Git status moves from that sort of operation into an operation that only checks what has changed since the last time you called git status, which can be a big improvement for a lot of people.

Starting point is 00:27:09 That could go from 20 seconds to 0.1 seconds. So 20 seconds to 0.1 seconds. That is amazing. And most people don't experience it because you're working in smaller repositories. I think for comparison, Chromium or something is like 300,000 files. Like I remember benchmarking that like a year or two years ago might have

Starting point is 00:27:28 been so, but at 500,000, like styling each file definitely adds up to like unbearable times. It adds up. Yeah. You can look into S trace or D trust on Mac D trust on Mac or S trace on Linux. That was always how I checked.

Starting point is 00:27:44 I would just be like, s trace minus C, get status. And then it just gives you a listing of how often every sys call is called and you look for l stat and it'll be l stat 500,000. That's how you know that you got the slow one. So who added these like performance or like scaling features to get because I remember like this thing a long time ago where like facebook sent an email like this is like a story i don't even know how true it is like facebook considered adding stuff to get and then they decided to just go with mercurial instead and that's why they use mercurial yeah yeah well let me go to the other ones after fs monitor

Starting point is 00:28:20 because i think that's the ones i know better so fs monitor was the first like foray into improving git itself microsoft worked on this thing called vfs for git which you might be able to look up it only works on windows but specifically it tries to make it into a virtual file system where you can where it only fetches the files you need i don't actually know a ton of details about it but essentially the idea is that the file system that your files are on is a virtual file system. And when you go to access a file, it's running some code that will fetch them as you need them. And so Microsoft actually changed the Git server and the Git clients so that everything would work virtually, so that the Windows source and the Microsoft Word source would be able to run more effectively.

Starting point is 00:29:02 So they wanted to move some of that technology so that it would work across different platforms. And Mac recently put out a note saying they're just kind of banning kernel extensions. And so they actually went a different route. They thought a little bit about going the route Mac, but they went a different route to augment Git to take advantage of a lot of these features. So one of the key ones is something called partial clone, where the naming, in my opinion, is quite confusing. So there's a few features that sound like the same thing, but they're actually subtly different things. But I'll try to call them out. You got partial clone, which is when you run git clone, you don't fetch the entire history of the repository, you just fetch kind of like metadata, and not all of the blocks of the history.

Starting point is 00:29:47 Like you don't fetch all of the historical files, only the metadata and making messages and things like that. So that's partial clone. There's this thing called sparse checkout where unlike clone, checkout is where you take files from the.git directory and shove them actually onto your workspace into your file system. So partial checkout is only doing some, or it's like, I already confused them. Sparse checkout is only getting some of the directories. And so sparse checkout limits the directories, partial clone limits which historical files you fetch.

Starting point is 00:30:18 Like you only fetch metadata and not the actual files in the history. And there's this third feature called, it's when you pass depth equals one to get clone, I believe it's called shallow clone. And that one actually just fetches less metadata from your history. And so there's these three different like related features. And I think two of them were built by folks at GitHub, or Microsoft, GitHub and or Microsoft working together here. Okay, so I didn't know about partial clone and that seems super helpful.

Starting point is 00:30:47 What happens when you like, I'm guessing you can do something like get partial clone for a repo up to like just a years of metadata. And then what happens if you check out to like two years back or something? Yeah, so partial clone landed like a year and a half ago. It's pretty new. Like you're going to have to get a pretty new version

Starting point is 00:31:04 of Git to use it, but it works. Yeah, it's git clone dash dash filter equals, and then you have to read the man page. Filter equals blob none, but it's got some, you can find tutorials. There's a pretty good tutorial on GitLab.

Starting point is 00:31:16 But yeah, the idea is that you can clone any repository where the server supports partial clone and GitHub itself added support for partial clone in all of its backend servers pretty recently. And that's not a coincidence. There's folks at GitHub who work on Git. So that means that you can partial clone stuff from GitHub. Yeah, but what happens if you check out something that's like older than what you

Starting point is 00:31:39 have? Yeah, so if you if you check out something older than what you've partially cloned, and it contains blobs, so blob is the word for a file, a historical file object. If you check out a blob that you don't have, there's this notion of a promiser, which promises to get you the blob, and you have a list of them. And so it'll go check all of your promisers to go fetch the blob dynamically. And so it moves a lot of the logic from the time of git clone to the time of git checkout. And it only does it on the checkouts where you actually need it. Interesting. So you don't have to worry about, you know, it entering out in case you just need to check one file out. That's like super old.

Starting point is 00:32:18 It kind of does that on the fly. Yeah, it'll fetch them. It does have this key downside. When you clone, you get all of the files for all of history all at once at the beginning, which means you get really good compression and kind of like packing. And so when you have partial clone, the packing is naturally like a little bit worse. And so in order to have good space usage, you need to periodically repack your repository if you're using something like

Starting point is 00:32:45 partial clone. So this is something that VFS for Git built in. And they actually built it into Git itself. I believe it landed in March 2021. So like the very, very newest version of Git has this command called Git maintenance, where you can run background repacks, like kind of like a cron job. So Git is slowly moving in this direction of having a lot of the pre-computation and work being done offline, packing, garbage collection, stuff like that. Interesting. Yeah, I think I saw that and I just ran Git maintenance just for fun,

Starting point is 00:33:15 even though I have a super tiny repository now because I'm still interested in these things. That's fascinating. I was going to say, Microsoft has this tool called Scalar, which their idea is that they implement a lot of these asynchronous tools into Scalar, but they're slowly trying to upstream them all into Git.

Starting point is 00:33:33 And they're hoping to completely obsolete Scalar in favor of just upstreaming everything into Git. And it seems like it's working. Just run Git maintenance, and that will do everything that Scalar does. But Scalar will be like, you just have to install it on your own and probably run it and ask your users to run it. Yeah.

Starting point is 00:33:52 The key challenge that I ran into when deploying improvements to Git was Git often requires people to run commands in order to, as you said, run Git maintenance. How do you work around that? I'm sure for the FS monitor stuff, you have to think about that stuff as well. Users have to manually configure FS monitor and everything. Honestly, I think there's a key insight that I learned from working on sync that was helpful here. So Dropbox, the product auto update software out to tons of clients and it's software that you know your

Starting point is 00:34:26 users aren't going to be running commands they just need it to work it is really really hard to write correct software that has persistence like deep persistence on the client side like if it is not easy for you to wipe all of your local state and be okay, it is really hard to write correct software. And the more state, more historical state that is stored locally, the harder and harder it gets. For something like Dropbox, Dropbox needs to, in order to know whether or not a, so it'll know that there's a difference between your local files and the remote files. In order to be able to tell whether that's something you should upload or download, like to tell whether the difference is something new on your side or new on their side, you need to store metadata about what you think the server has and

Starting point is 00:35:14 metadata about what you think you have locally. So you can tell the difference. And that's a lot of state that you're storing locally. And so you have to be very careful about that state. There's a super good quote that one of my Dropbox colleagues said that said that a system is not correct. If the code is correct, the system is correct, if the code is correct, and the data is correct. So not only is your code correct, your databases need to be consistent and look the way that you want. Maybe a concrete way to think about this is that if you introduce a bug into your code, and it writes some bad data to your database, and then you fix the bug, your system is not yet correct, right? You have to fix the bad database as well. For simpler systems, you know, long-term state systems, it's really hard to do this well.

Starting point is 00:36:06 And I think Git absolutely qualifies. The index is a long-term local stateful thing. And I think the new features of Git really exacerbate this because they're like, let's look at FS Monitor. FS Monitor, in order for Git status to give you the right answer, it needs to persist the correct thing about when the last time it queried FS monitor was. And then when it gets information from FS monitor, you know, if you hit

Starting point is 00:36:31 control C, like you need to make sure that you either persist the new token and incorporate all of the new files into your index or not. And if you have a bug there, it means get status is going to show the wrong answer. And it's not like it'll fix itself it'll just keep showing the wrong answer forever like if you miss a file you'll never see it yeah and so it is really hard to write this kind of stuff to be super highly correct so this is why git is always super careful about having you type in commands and like you you know, you're using a beta feature, you have to run this thing that like read, rewrites your index, because they want you to be involved, because until they're extremely confident that everything is working perfectly, you need to be involved. And you need to know that, like, if you hit a bug, like, it's not the kind of thing where you run get status again, and things are okay, you might

Starting point is 00:37:20 have to like clean, you know, clean and restart over. Yeah. You need like a support channel or something like where people can ask like, oh, it looks like my index got into like a terrible state. And fortunately, it's okay to like throw away the index and start from scratch. But of course, you don't want to throw away files or anything like that. Yeah. If you know that your local state is clean, like if you know that you haven't been working on it, then you can throw away your index pretty safely. But they don't want to automatically throw away the index

Starting point is 00:37:49 because if you are working on something, then when you automatically throw away the index, Git can no longer tell. It couldn't answer the Git status command. It wouldn't be able to give the correct answer. Okay. And it sounds like you have a bunch of stories from when Dropbox might have messed up people's state on action.

Starting point is 00:38:07 I don't know how much you can share that. And this is like just going on your experience working on the Dropbox sync engine, which, as you said, it's like remarkably similar to like improving Git in a way. Yeah. Honestly, both of them inspired me to work on the other maybe as a bit of an aside i think that a developer tools engineer one of the best developer tools engineer is someone who formerly was a developer tools customer just like by switching between the two it's like you understand the pain points perfectly so that's one of the reasons i flip back and forth but yeah working on sync it's really hard to write a perfectly stateful, good system.

Starting point is 00:38:52 A huge part of it, you want to cover your ass in as many ways as possible. You want to have good tests so that you cover situations that you anticipate. You want to have sorts of tests that help you uncover situations that you don't anticipate. You want to have recovery mechanisms so that if you hit situations that you didn't anticipate, you still have a way to handle it well. And all three of those are important. Like I think it is super naive to think that for a very stable system, you can have good test coverage, you can't cover all the cases, it's just there's too many, because one of your inputs is the state you started in. And users can mess with your state in all sorts of terrible ways as well so you need to make sure you're doing things like checksums or

Starting point is 00:39:31 like validity checks i guess on your existing state to make sure that it's not malformed by someone even on accident right yeah you want to do validity checks, because for example, people could have faulty hard drives, or maybe they were messing with it. But honestly, that's not even the scary part. Like, if I were to like, look at how many weird states we ended up in, and how many of them were due to the user being malicious, it's tiny, like, negligibly tiny. If that's how many are due to the hardware being bad, it's a little bit higher, but also pretty tiny. Most of them were due to bugs that were in our code for a transient amount of time. Really obscure bugs usually.

Starting point is 00:40:13 Like you turned your computer on after not being on for several days. And while it was off, you renamed a folder that was called A and you put it into B. And at the same time, you took a folder called b and you put it into a and then like it's trying to resolve this situation the people writing the software never thought of this situation originally or at least maybe maybe it was thought of now but it wasn't thought of a year ago like it doesn't matter like some weird situation like this happens and you write some bad state out. And that bad state is now persisted. And so there's no like, fix, right? One of the things is like software like Dropbox, or Git status, for example, or Git status with MS Monitor, is that it's trusting this persisted state as one of its inputs. And so if the input persisted state is bad, it can be really hard to

Starting point is 00:41:02 figure out what the issue is. So a lot of the protections you want to put in place. One is if there are cheap protections, like if on the fly, while you're reading stuff in, you can just check it, you should do that. And if something is wrong with that input state, you should think carefully about what you want to do. Maybe you want to fix it on the fly. Maybe it's illustrative of a problem and you just want to crash. In some cases, fixing it on the fly is reasonable. In some cases, like you might just be causing another problem because you don't actually truly understand the problem that you're fixing. And so sometimes it's like more dangerous to fix it.

Starting point is 00:41:35 One of the tools that's really helpful is to build in some kind of more expensive consistency checker. For example, Dropbox has one that ensures that file IDs are unique, where, you know, if you have a lot of files, you want to make sure that the identifier for the... Like if you rename a file, it should have the same identifier in the new location. And if there's some bug, some complicated, weird bug, and you end up with the same file ID in more than one location, that's super bad. And fixing it on the fly might be hard to do. So you have things that run full consistency checks over all of your files to make sure that every file that's in a directory, that directory also exists. You might have one

Starting point is 00:42:10 with permissions, like every file that's in your folder you have access to, like that should be the case, but you really want to make sure you might run ones that validate that. I'm having trouble thinking I'm off the top of my head. Yeah, but you want to have things like this, you might Oh, here's one, every folder should be in a directory. But you want to make sure that that like, graph that like computer science sense graph is doesn't have any cycles, like you don't want a to be inside of B and then B to accidentally also be inside of a tree, not a graph. Yeah, it should be a directed acyclic graph of your files and not like there shouldn't be circles um but you know that that's an invariant that could be broken by

Starting point is 00:42:51 a bug pretty easily um so yes stuff like that those are expensive invariant checkers but you want to have those built in and i know dropbox has them running in the background i think once a year for most people, but for internal employees who get the everyday code release, it runs every day because if there's a bug that went in today, we want to find out immediately. And so I think stuff like that is really critical. In my opinion, I don't think it has enough of this. Like it doesn't do background internal consistency checks at the index, unless you type a command in explicitly. And so when bad it's like it's hard to even notice so i think this could be an area where something like git maintenance really has a lot of benefit like you could opt into automatic

Starting point is 00:43:33 background consistency checks yeah i think that kind of stuff is really useful yeah i really want to know like what what prompted all of this like correctness. I mean, it makes total sense to me like just some incidents or some bugs that were like super hard to like, we don't know what the heck is going on. Drop the database, re-download everything from the server, like restat every single file in the file system. Like we're going to do this super slow operation to just try to piece things together from scratch. And it's a slightly lossy operation in that if you're working on things while the operation is happening, we might lose that. But but it is usually like that's our hammer that's like we don't know what's going on we're going to apply this hammer and so we try not to use that hammer but basically working on sync is a lot of using this hammer it's a lot of like

Starting point is 00:44:37 figuring out what cases that we you know come across where we might have to use something like this yeah what are some good examples like I'm going to try to think of one that has a lot to do with persistence locally, because I know that we had some issues that are often server-side, but I think there's a dime a dozen of those. You hear them from lots of companies, but I'm going to try to really think of one

Starting point is 00:44:56 that highlights the client-side version. Okay, this isn't quite client-side, but I think it'll illustrate the point. Dropbox has a journal that stores every version of every file and you append rows to this journal every time there's a new version of a file you'll append a row to the journal and there's this like index that you can use to identify the latest version of a given file and so at some point much later into the history of the company, like 10 years into the company, we're like writing a consistency checker. Every file should have

Starting point is 00:45:31 a latest version. Like there's a tombstone when you delete a file. So there's a row that's like, we call it a tombstone. It's like a row that says like this entry no longer exists. So we started writing these consistency checkers that just go over all of history and try to figure out like, every file should have a row that is considered the last row for that file, like this index should have something. And you know, you run this consistency checker, the current code, as far as we can tell, never creates something that has this problem. So we're pretty good about it. You run the consistency checker, the data is not consistent. Like there are files that have no latest version. You're like, it only has history. This is weird. What's going on?

Starting point is 00:46:10 Yeah, turns out long ago, there was probably a period of time where there was a bug in the source or maybe an intended behavior, hard to tell. But at some point in the past, it didn't work this way. And so you have to go sleuthing to figure out what that is. We actually, I think we used to not write tombstones. And we started writing tombstones later, but only specifically for I think it was like a shared folder on mount. So if you stopped being part of a shared folder, we would, it's called unmounting, unmount the shared folder and not write a tombstone. And then at some point later, we were like, for consistency, we should write a tombstone in this case. So much later, after that decision was made years after that, we were writing this consistency checker, we saw that

Starting point is 00:46:55 some files don't have a latest row. And so we thought, oh, let's just fix the problem. And we would go in. So we wrote a fixer that would run the consistency checker and then just make the last row the latest one. And it turns out that this started just resurrecting. It would just bring back shared folders that people had left. And the way we found out about this, I think this is hilarious. But the way we found out about this is that one engineer's ex-partner, someone they broke up with, they had a folder of pictures with them. And it reappeared that day. And they were like, what the heck?

Starting point is 00:47:29 I deleted these for a reason. Like what's going on? It was so funny. Like that's how we found out about it. And it actually turned out to be a pretty severe thing. Dropbox is a company that's like very seriously cares about data loss. Like we very carefully don't want to delete files that unintentionally. This is a case

Starting point is 00:47:45 of the reverse. It's actually also bad if you thought you deleted it, but you didn't. Yeah. This was not intentional. So we ended up fixing it and, you know, communicating to the people that this was due to an issue. Honestly, I'm very appreciative of like, if a company makes a mistake and then just goes to great lengths to fix the mistake i'm very appreciative of that because having worked on software the mistakes are just an inevitability they happen so i'm more looking at the response to the mistake rather than the fact that it happened that's fascinating and i'm guessing there's some part of the code that just like since there is a latest version of this thing i need to i need to sink it down it that just like, since there is a latest version of this thing, I need to I need to sink it down. It was just like, an expectation that was good if I probably like years before this consistency check came in and started fixing things. Yeah, so it's not just your code that

Starting point is 00:48:37 has to be correct, your data has to be correct, too. And if you find an inconsistency in your data, it's probably good to figure out how it got there. Because just fixing it, you might not actually, you need to like really understand. Maybe you don't need to perfectly understand how it got there, but you need to understand the consequences of your fix. Because it might not be as obvious as you think. And also, yeah, like stakes are important, right? Like if the stakes are you're bringing back files people thought they deleted, it's pretty high.

Starting point is 00:49:07 If your stakes are you're bringing back high scores on Candy Crush Saga that you thought you deleted, it's probably okay. Yeah, stuff like that. People can be pretty serious about their high scores. An interesting project that I know about about sync is just the rewriting sync from, like, the sync engine from engine from like Python to Rust.

Starting point is 00:49:32 So first of all, how do you pull off a rewrite like that safely? I'm not going to go into like making the decision and all that, but like more like how do you make sure your new code base actually works for all of those edge cases that have existed for like years and years? Yeah. Well, you got to know the technical shortcomings of your old system i think one really key like you got to respect your old system like it's very easy to talk shit about something that's causing you pain you also need to understand and respect and fully like appreciate what got you there about that system because you you see the pain points, you don't like it's the second system effect. It's like when you build a second version of something, you forget what made the first system so good. So you want to make sure that you hold that first system with a

Starting point is 00:50:15 lot of respect and really understand all the good things it did well. And for its shortcomings, you want to make sure that when you solve them, you're not also damaging the stuff that made the original one good. You want to understand what changed, like when the first one was built to now, maybe some constraints have changed that make different options more optimal. One of the most major constraints is like when the first system is being built, there's like usually less engineers and usually their stakes are lower like if it's like how do i put it it's like they're okay with the thing making more mistakes because it's the first version of the thing and they're just kind of trying to get something that sort of works uh and so when you look at it later you have to understand that that's the perspective they came from looking at dropbox for example dropbox was like you know if it worked for most people most of the time like that's amazing and you just

Starting point is 00:51:06 want to get it working for more people more of the time uh and if it like that's the perspective you have when you're a 30 person company just trying to get customers when you have a lot of customers that trust you a lot you have a very different perspective where you're like don't poke it because like it's dangerous like if we delete something and it affects a tiny fraction of people, that person will get very angry and probably post on Twitter about it or whatever it is. You want to be a little more careful at that point. So for rewriting the sync engine, yeah,

Starting point is 00:51:34 lots of respect for the original system. Understand its shortcomings. Those are the things you know, but also understand what's good about it. So maybe I can get concrete. Sync Engine Classic, which we called it that, the shortcomings, which were pretty clear to us, was that it was really hard to build things on top of it.

Starting point is 00:51:51 Particularly, it does a lot of things, right? It syncs files up, it syncs files down. It had selective sync on the local side. It had smart sync where it was dynamically fetching files from the server. And then it had a bunch of features bolted onto it as well. Things like the Google integration for Google Docs. And there are a few other things, but it had a lot of stuff like attached to it.

Starting point is 00:52:13 And then it was closely tied to the data model of the server. So this idea of having like a row for each change to a file was pretty key. And then I think another really fundamental thing is that it tied a file to a file was pretty key. And then I think another really fundamental thing is that it tied a file to a path. And so renaming a file was a pretty challenging operation. Renaming it as a delete plus an add pretty easy. But Dropbox had grown a lot of features that really wanted to make sure you kept the identity of a file across a reading. For example, if you attach comments to a file, or if the file was put in by someone

Starting point is 00:52:45 else, the original person who put the file in, you want to have that trail along. And so renames in the data model were pretty tricky to do. Particularly, if you rename a directory, just imagine you have a directory with 10,000 things inside of it. If you have a path based file model, then renaming a directory with 10,000 objects is something like 20,000 operations. And so there are these like fundamental constraints there as well, where I think prior to nucleus, you just couldn't rename things that had more than a certain number of files in it. Like if you went to the web and renamed directory, and it had too many things in it would just be like, we can't and if you did it on the desktop it would slowly in a random order issue 10 000 deletes and 10 000 ads which is kind of scary because like maybe it deletes 10 000 things and

Starting point is 00:53:35 then issues 10 000 ads so it was you know there was a lot of like fragility there and we would put heuristics in there was a lot of heuristics and sync engine classic to protect against these because we didn't want like, every rename to just delete stuff by accident. And so there was a lot of careful heuristics in there. So like, really make sure we did this right. And that's where the respect comes in. It's like, the fundamental data model made it impossible to do this well. But there were a lot, there was a lot of logic in there that made it possible to do it pretty well for most cases. And so you really wanted to respect a lot of that. It's like, let me see if I can give an example. If Okay, if you if Dropbox had to do a lot of operations

Starting point is 00:54:18 on a bunch of files, it would sort all the files by the length of the path. And this would make sure that all the directory operations happen before the file operations. So that if you were adding a bunch of directories with files in them, the directories got added before the files. But for deletes, you wanted to do it in the opposite order. And so if you have a jumble of ads and jumble of deletes, and they all have paths, and you want to carefully like figure out what order to send things up, there were a lot of heuristics in there to make sure you got it right and the workload changes right like you could have added it and then later change your mind and deleted it and those show up as two entries and

Starting point is 00:54:52 so there's like heuristics that squash your queue and if it sees something that's already there it would squash it in to make sure you do the right thing it's very easy to mess this stuff up. So, you know, like that system, like pretty fragile, pretty hard to change, worked pretty well. But, you know, it wasn't going to help us build new things on top of it. Those are the pain points. I think I also hit on some of the things we were like really respected about it and really tried to understand. Like what it took were people who really understood why the system was built the way it was. Like, you know, if there was a bug and you lost data, you had to understand exactly what happened. And that was like a really important thing.

Starting point is 00:55:33 And that was more than just reading the current code. It was reading historical code. It was understanding which versions of the code went out and which releases. It was reading the log files from different releases. You really need to understand a lot of things in order to be able to do well. Do you think having a complete new set of engineers trying to rebuild this thing would have been basically a total failure? It was people who were steeped in the existing system for a long time who understood how it worked. And then them rewriting it for a long period of time is what made sure it could

Starting point is 00:56:06 actually succeed how big a deal was that let me make sure i answer this the right way like i think my my knee-jerk answer is yes like yes it was really important that you had people who understood the original system yeah but i think the key constraint i'm missing here because i don't want to discourage like just building another version like not knowing anything that's how the original system got built i think the key insight here is like if you're building another version of the system like it will solve the most immediate problems and you'll probably end up with another thing that's kind of like the original system like maybe maybe, like, I think if you tried to build Dropbox from scratch, a lot of people could build something

Starting point is 00:56:48 that worked most of the way good for most of the cases. Like, I think a lot of people could build that without a lot of prior knowledge. But I think you would end up in a similar situation over time where you make similar mistakes, and maybe some key insights like the file ID data model versus the path data model or things like that might help you. But it's really hard to foresee a lot of these issues. And it's really hard to foresee a lot of the complexity that comes in. And I think deeply understanding your original system, particularly if you're trying to replace it in place is really important. And for an established company that has a lot of existing customers that have existing expectations, I think it's like basically ground stakes.

Starting point is 00:57:26 Like it's pretty critical. So yeah, what kind of like testing or like verification or like metrics or whatever did you like monitor when rolling this out incrementally? And like how long did the incremental rollout take? Yeah, these are great questions. Maybe I want to draw attention to like making the decision of like whether you should roll it out

Starting point is 00:57:46 or you should add more testing. And I think it's a very situational decision. If you're making a brand new product, especially if you don't have product market fit, absolutely roll it out. I think this is just universally good advice. For replacing a system that already has product market fit, I think it gets harder. Like, I think it gets harder to pick. I think it'd be too easy to say, definitely roll it out. It'd be too easy to say, definitely make sure you got it perfect before rolling it out.

Starting point is 00:58:13 Like, I think it's pretty situational. I think pretty good heuristics are, if you understand what the biggest failure modes are and how to protect against them, I think it is worth delaying the rollout to do that. Where the biggest failure mode is not, it's not every failure mode. It's the ones that you know will cause the SEBs, like lose you customers, like the ones that will lose your money, like the ones that really have a big impact. If you know what those are, then, and you can protect against them ahead of time. I think those are the ones that it's worth delaying your rollout to iron those out ahead of time. I also think, what is your remediation?

Starting point is 00:58:54 Know your remediation story. If your remediation is naturally pretty expensive, very stateful systems have this problem. Like forget the remediation is like wipe it and reclone. If you have a small repo, no reclone if you have a small repo no big deal if you have a big repo it's painful so like that's a good analogy like for dropbox like a lot of the customers have big situations so if we roll out something with a bug we're gonna have to send an email that's like i'm sorry like you're gonna have to reinstall and people will be like no you know you don't want to put people in that situation or we'd have to roll out something that did that automatically but either way, you don't want to put people in that situation. Or we'd have to roll out something that did that automatically. But either way, you don't want to put people in that situation. So know your remediation. If your remediation is

Starting point is 00:59:33 easy, then roll it out because you'll find the bug faster that way. And even in a system like this one, we had to pick and choose which features we wanted to do early and which ones we wanted to do later after rolling out. And it was actually pretty hard to carefully pick and choose these because a lot of it is a careful way of what's your remediation what is the cost if it goes wrong and like making decisions case by case it was very hard to have a singular mindset so that was the art was having this like picking and choosing what we wanted to build into it so there was like i'm guessing a rollout framework like not like something in code but something in a document where you said these are the set of expectations we need to have in order to roll out

Starting point is 01:00:16 to a certain percentage yeah i'll give you a pretty concrete example dropbox has a feature called p2p where if you fetch a block from someone who's on the same network as you, it's faster because you'll fetch it from them directly instead of from the server. That was a feature that we didn't roll out in the first version that we rolled out to people. That was an example where we're like, the failure mode is that it's a bit slower. And the remediation is that it goes to the server and fetches the block. So it's okay. But for our biggest customers, where that's a huge pain point, especially ones in Australia, where our servers are in California and they're in Australia. So it can be painful.

Starting point is 01:00:51 For those customers, we're like, actually, this is a killer feature for them. So those are the ones we shouldn't roll it out to. So we have to make a lot of careful decisions like that. So one example was like, large customers, business customers came later. And business customers had a larger set of features. And within the business customers, the ones in say, Australia would have to wait until we finished something like P2P. Whereas some of the other ones, we would just roll it out without finishing P2P, and we'd add it later. And those are all just judgment calls, we had to make case by case. Okay. And you mentioned that, you know, both your code and data need to be correct.

Starting point is 01:01:26 And for your data, you had things like verifiers or like consistency checkers that would make sure that it's not extremely terrible. But like, how do you, I'm still a little confused about how do you make sure that you are confident just like code wise to roll out a new version of like the sync engine when there's so many edge cases that you could be dealing with like what made you confident to like turn it on for like you know 0.1 percent of people yep this is also a great question yeah so we had lots of layers of testing and we didn't think of testing as like you should write tests because it's good to know if your thing is working we thought like how do we validate that this thing will work when we

Starting point is 01:02:07 replace it? Like, how do we answer that question? And whatever tools are at our disposal to answer that, we're happy to use them. So we went through all the tools, like you can write unit tests, you can write integration tests, you can write randomized tests, you can roll out to a fraction of the population, you can ask for beta testers, you can roll it out to internal employees, you can run it in the background and compare results. We went through a whole gamut of things. And we tried to pick and choose the ones that would help us the most. One of the things I'm most proud of is that we built a randomized testing framework that would it's kind of like a fuzzer, where you're like throwing random inputs in. But I think one of the key things here is that because it's a stateful

Starting point is 01:02:47 system, one of the inputs is the state. And so you would, we had this, we had this component called canopy. Canopy was the trees within Dropbox. So that we had a local tree and a remote tree and something we called the synced tree, which was the base between the two. And the three trees together was called the canopy. This is a fun name. But we had this randomized checker called canopy check. And it would randomly generate state of the three trees, and not totally random, like it would really try to be random, but hit weird cases a lot. So random with some bias towards weird cases. For example, like,

Starting point is 01:03:28 if you rename a file locally, then that file is going to exist on the local side and on the remote side. And we want to make sure we hit that. If the remote side puts B inside of A and the local side puts A inside of B, we want a randomized checker to find that and figure something out. Like, essentially, what it would do is it would create three random trees for like the initial state, and then it would just run the sync engine and make sure that something reasonable happened. In a lot of cases, it's pretty obvious what the reasonable thing is. But in a lot of cases, like the A inside B, B inside A, or maybe even a three cycle, A goes in B, B goes in C, C goes in A. Like all three of those operations just got put in at the same time. What the heck do you do? Concretely, maybe three remotes put those operations at the same time,

Starting point is 01:04:12 something like that. So if something like that is going on, just do something reasonable, like any reasonable operation. And so the art to randomized testing, in my opinion, is making your validation step at the end of the randomized test as simple as possible in the validator versus an issue in the code gets too high. You just start to lose faith in it. So the key is to make it as simple as possible and have the ratio of like canopy check fails. What probability is it that it's a real bug versus a validator bug? You want that to be really high. So the earliest validator we had was just make sure it's synced. Make sure all three are

Starting point is 01:05:05 the same this is not good enough on its own because like you could delete all the files from everywhere and that would be synced congrats you've synced all the files but pissed off all your customers but yeah we would add a few more checks on top of that but you can get arbitrarily complicated all the way to the level of writing a sync engine again in your validator. So we tried to be pretty simple. Like, if a file is there at the beginning, make sure it's there at the end, unless it was deleted. So we started with something like that. And then the unless it was deleted, got harder and harder to implement. So we tried to be careful about it. For example, if you selectively sync it out, when we built the selective sync feature,

Starting point is 01:05:44 that would not be on the local side, but it would be on the remote side, but it wasn't deleted. And so we had to add, the validator was like, if it's not deleted or selectively synced out. And so that's the art, right? I think that check was cheap enough that it was worth it,

Starting point is 01:06:00 but some checks are too expensive. One thing that randomized testing really does, which I love about it, is that it makes asserts in your code get superpowers. Asserts suddenly become amazing because for every assert you write, that could have been a unit test. But when you have randomized testing,

Starting point is 01:06:17 suddenly these asserts cover things that you didn't anticipate. You can even have expensive asserts. You can just have them like turned on in your randomized testing framework. You could have an assert that like consistency checks an entire data structure or does something that would be unreasonable to do at runtime. So that kind of stuff is very cool. Asserts get superpowers. That's so interesting. And like, you can just check for something like does the sync engine crash when trying to sync stuff?

Starting point is 01:06:45 And the simple answer is like, the sync engine should never crash. Yeah. Right? Yeah. And that gives a lot of power to like engineers. But do you feel like engineers got lazy when they had this in place?

Starting point is 01:06:57 Or do you think that was like just one layer in like the list of things you needed to do in order to test things like, and how often did the randomized checker actually catch stuff? I would say that it got disproportionate value upfront. Like when we first wrote it, it caught tons and tons and tons and tons of stuff. And then over time, less and less and less and further and fewer in between. And sometimes it would find something really obscure, like it found one where the file name was longer than a certain length. Or it might find one where like the same file has two of the same block in it. And we tried to make these

Starting point is 01:07:34 obscure situations more common, just like by tweaking the way that the random trees were generated. But that stuff is, it's hard to do because you can only cover what you anticipate there. But I think inherently randomized testing covers stuff that you didn't anticipate. And I think that's a wonderful thing about it. It's never going to cover everything. And it's always important to have validation afterwards as well. So we also covered, we ran stuff in the background, did other things like that. Let me ask, answer your question about what it did with engineers. I think it made the barrier to entry to the code base, or sorry, it made the barrier to landing your first diff

Starting point is 01:08:10 a little bit higher, but I think it made the barrier to entry a little bit lower. And this is kind of a weird concept because it meant that it took longer to land your first diff. So it felt like the barrier was higher, but I compare it to my experience on Sick Engine Classic. It was easy to land your on sick engine classic it was easy to land your first diff but it was also really likely that your first diff caused a giant bug

Starting point is 01:08:28 or it would like reduce you couldn't assign a new engineer a thing that was in a risky area of the code like it was just too dangerous like if something went wrong like the effort it takes to fix the thing that went wrong was 100x higher than actually doing the change. And so it was just super dangerous to like let new engineers near anything important. And I think one of the cool things is that randomized testing just flipped that, like new engineers, you could let them near something important. They could, I actually thought, I've had this corroborated by a few people, but I thought canopy check taught people how the code worked. Like you would have to make a change that maybe you added something to CanopyCheck, some kind of validation.

Starting point is 01:09:09 Let me think of a concrete example. We added randomized X adders on files to the randomized tester. And there was a new engineer who I was mentoring, and that was one of their intro projects. When they added it, CanopyCheck started failing. And it was like, hmm. And they added it, canopy checks started failing. And it was like, and they started investigating and part of it was because they added it in the wrong place in the code. But in order to figure that out, they had to understand the code and understand the validator. And so they just started learning about the system, the system would tell them what was wrong.

Starting point is 01:09:38 And then finally, it turned out that they actually caught an actual bug where I think it was like if I think it was like if an x adder had two of the same blocks in it was a super obscure condition it would like crash the engine or something and we found a real bug and we fixed it so like I thought it was really illustrative in that way and then also to the not to the new engineer but to the medium engineer taking their first major project like maybe someone hasn't added a feature before. They've only done bug fixes, and they're tasked with adding a feature. That's much harder.

Starting point is 01:10:11 Something like CanopyCheck, you go to add the feature, you do your first diff. It's kind of a big diff that touches a bunch of areas. You think you got it, but maybe you missed something. You run CanopyCheck. It fails. The failure tells you a test case that you failed on you read the logs of that test case really carefully you start to understand exactly what's going on and like why this is here and what you know and then you can fix it so i thought in that way it was really

Starting point is 01:10:37 enabling but i would not pretend to say that it makes it easy to land your first diff like it can it's complicated that's why i think you want to use it in systems where it brings a lot of value. You wouldn't want to use it in systems where the cost of moving slow isn't worth the cost of being more correct. Yeah, that makes a lot of sense. And it reminds me of the value

Starting point is 01:10:59 of a really good integration test in a world where it makes sense to write one of those, whereas with the sync engine, it totally makes sense that you can't integration test for everything that just doesn't fly. People can have all sorts of weird file systems and just weird file system setups. Yeah. And it's not even, I mean, there was also a lot of weird file systems. I'm not going to lie about that.

Starting point is 01:11:25 But even just like weird things the user can do. Like I just talked about A going in B, B going in C, C going in A. For every feature Dropbox has, there's some weird thing like that, that you're like, who the hell would do that? But then you think about it for a moment. You're like, oh, there's actually like 10 people simultaneously working on this folder. Like this could totally happen. And so that kind of stuff you end up thinking about, particularly in company context,

Starting point is 01:11:49 if you have 10 people on a team, or like 100 people who are all in a shared folder together, even if it's an obscure action, that happens very rarely. If one person does it, it affects everyone. And so sometimes it can be a big deal. Like if you're, if your obscure bug accidentally clears the permissions on a folder, suddenly everyone doesn't have permissions on the folder. Like it's a, it's a big deal. So often things like, I don't know, like not that many customers use Linux. So maybe this obscure bug is okay to leave on Linux. That was often not a good enough answer because if you're at a company with one Linux user and they hit the obscure bug, it affects everyone. And so the calculus of that changes a bit.

Starting point is 01:12:29 That makes sense. I think this is like a pretty good stopping point. Do you have something for like listeners, you know, like just venturing out to their first code base or like, you know, engineers just starting off, like, do you have any advice for any listeners like that? Yeah, I think don't be afraid of making mistakes. And honestly, this is a really hard one to internalize. Like, I think everyone's going to tell you this, but it's really hard to internalize because you're afraid of making mistakes. Like, it's just in you.

Starting point is 01:13:02 Like, it's going to be there. Like, you're going to go try to put your code in, but you're like afraid that the code is going to mess something up. It's okay. A good mentor will reassure you. This is the case, but often as a mentor, it can be pretty hard to remember what that fear is like. And so I think it's just really important as a new engineer don't be afraid to make mistakes if you make a mistake this is a good thing just remember like this is how you learn this is a good thing and the mistake is not your fault even if it's like i think even if there's something you could have done to fix it like software is not about like blaming the person who wrote the code

Starting point is 01:13:43 that made the mistake like it's just a bad way to live life, honestly. But like, yeah, if you, it's like, try to make the mistake if you can, right? Like if you're, I mean, how do I put it? It's like, you don't want to do it on purpose, but what you want to do is you want to put yourself in situations where it's like, I can't, I'm not good enough to understand what's going on here. And then you make the mistake and then you become good enough to understand what's going on there. And the mistake is not your fault. Like, it's probably honestly fine. Probably everyone around you is having a good time debugging it together.

Starting point is 01:14:13 Like, it's okay. Yeah, that's probably that's a good tip. Yeah, you should basically like stretch yourself and don't be scared of making large changes. Because that's the best way learn and like you learn through like code review and debugging things and going on call and putting out a fire from time to time yeah and maybe converse advice if you're mentoring a new engineer can be really hard to remember what it was like being in that situation really really encourage mistake making encourage taking risks encourage putting that dip like you see a bug you're not sure why this line of code is here you think it might not need to be here but you're afraid of deleting it like

Starting point is 01:14:55 for someone like a new engineer i feel like taking that risk that's how you learn maybe maybe something breaks maybe you feel bad for a moment, but just reassure. It's okay. That's why we do this stuff. Cool. Well, thank you, Nipun, for being a guest. I think this was a lot of fun

Starting point is 01:15:13 and people are going to learn a bunch from all of your experiences. Thanks for having me.

Your Ad Here

Software at Scale - Software at Scale 16 - Nipunn Koorapati: ex Software Engineer, Dropbox

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.