Software at Scale - Software at Scale 16 - Nipunn Koorapati: ex Software Engineer, Dropbox
Episode Date: April 13, 2021Nipunn Koorapati was a Software Engineer at Dropbox, where he worked on two distinct areas - Developer Productivity and Client Sync. He drove many initiatives like consolidating various repositories i...nto a server-side monorepo (read more here), and was part of a high leverage project to rewrite the sync engine, a core part of Dropbox’s business.Apple Podcasts | Spotify | Google PodcastsI worked with Nipunn in 2020, and we discovered interesting but unsurprising similarities between the software challenges facing Git and Dropbox. We explore some of the reasons why a tech organization might want to consolidate repositories, some of the improvements being developed into Git like partial clones and sparse checkouts, the similarities between Git and Dropbox, how to think about and roll out a massive, business-critical rewrite, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Welcome to another episode of the Software at Scale podcast.
I have with me Nipun, who is an ex-engineer of Dropbox, just I think a few months out or like it's a few weeks out.
A few days. A few days.
A few days, yeah.
For almost eight years on a bunch of different stuff where he learned,
I think this was your first job out of university.
Correct me if I'm wrong.
Yeah.
And some of the major areas you worked on was like the developer productivity area and sync engine.
So like working on rewriting our Sync Engine from the Sync Engine Classic, which you worked
on for a year to Nucleus, which is like the new brand new rewritten and rust Sync Engine,
which you've written like a bunch of blog posts on.
Anyways, welcome to the show.
And what did I miss from your intro?
Thanks for having me on.
No, I think you nailed it.
Yeah, I've been at Dropbox almost eight years and my last day was
just last week. And I hope I can convince you to join me at my new company, but I'm not going to
bank on it. But the first thing that I want to talk to you about is repository merges. So you've
driven a lot of repository merges. And just for context, Dropbox used to have a bunch of different
repositories for server side and client side code
so I think Go used to be in its own separate repository and there was like a Python repository
and there was like a separate one for developer tools and build specific stuff I think and I'm
sure there's more that I'm missing and you worked on merging a lot of them into a single repo and
maybe you can tell listeners about why was that important and where do you worked on merging a lot of them into a single repo. And maybe you can tell listeners about why was that important
and where do you work on it?
Yeah, so I guess I can tell you the reasons
that it was important at the time that we did it.
So I'm going to talk about the ones that we did five, six years,
2015, six years ago.
And maybe you can just walk through listeners about
what was the development environment and why did we have so many?
Yeah. I mean, so to go back to 2015 at Dropbox, it was a company that probably had 600, 700 people, maybe something like that, and probably a few hundred engineers on that order.
And to be clear, it had grown probably 2x in the previous year year and a
half so it was like rapidly growing so it would not be a stretch to say a lot of the developer
environment just kind of grew organically and like everything felt a little bit like there were too
many people to handle the thing that we had at the time. That was sort of the feel of it, but that was kind of the feel of everything.
So you learned a role with it.
One of the key initiatives that we really wanted to do
was add some kind of like commit blocking system.
We called it the commit queue,
but essentially we wanted to block commits
from landing on repositories until they pass tests.
So prior to this, like the experience
is, you're supposed to run the test yourself and land the code. But it was getting increasingly
hard to run all of the tests yourself and to not have conflicts with other people landing around
the same time. And so we wanted to automate a lot of this process by adding something we called the
commit queue. And we quickly noticed there was going to be a lot of this process by adding something we called the commit queue.
We quickly noticed there was going to be a lot of challenges to this.
We wouldn't want to deploy such a system unless it was good enough that people would be using it happily.
It should remove a problem, not add one.
One of the big things we really had to do was just identify why would things fail if we were to land land them if we were to use something like the commit queue and we had to go through and identify like what all the problems with the
tests were and there were quite a few basically every single day it was someone's full-time job
roughly the daily push rotation but it was before that but it was roughly that kind of job where
your job included making sure everything in the build was green and so the previous day ideally it was
green at some point all the tests were passing and some point in the last one day that you know
several hundred engineers working probably made it red at some point because there was you know
the validation happened after landing uh and so somebody's job was to go figure out what happened
and revert things if they went wrong etc so you, we had to make this big list of reasons why something would go, why things would fail.
And common ones like people committed code that was bad.
So that was one.
People, two individual people committed code that individually would have been good, but when they landed on top of each other was bad.
This one would happen.
Sometimes the infrastructure for the testing would fail,
like you would try to run the test, but it couldn't set up the virtual machine to run the
tests on something like that. Sometimes the test itself was flaky. So one really super good concrete
example is a lot of our payments tests, the cash tests at the time would like fail on the last day
of the month. Just because of something bizarre about some API
they're using. But like, it's like this sort of thing existed where there would be tests that
were written, they would pass consistently. And then on the last day of the month, or maybe on
February 29th, or maybe, I don't know, on a month with 31 days on the 31st day, specifically,
it would fail, there would there'd be cases like this. And so we had to also look at things
like that. I have a personal example, which I find really funny. But I there was one time where I
added an Easter egg to our internal admin page that would post a little message on the bottom
of the page on certain days of the year, say things like happy birthday to Drew or like,
happy Valentine's Day. But one of the ones I added was happy Mother's Day. But
in the week leading up to Mother's Day, it would give reminders to that it's not too late to get
something, you know, to buy something. And in writing the code to do that, I had a bug in my
Python code. And I wrote this code in December. And so on May 1, all of our internal pages crashed
because I had some bug in this code. And so that's like an example.
I had a unit test.
The unit test was passing every day until May 1st.
It broke.
I then wrote a test that iterated over the next 10 years and made sure that it wouldn't break for a long time.
But yeah, these kinds of things creep in.
So you have to find all of these ways that the test would go bad and kind of try to squash them one by one.
You want to get things up to a reasonable bar.
Notably, you don't have to solve the ones where people are making mistakes.
If the individual landing the code made a mistake for landing the code, it's fine.
If that fails, it's not a big deal.
You have to find all the ones that will piss people off.
The ones where someone will be mad that their code didn't land because it was not their own fault. They'd be less mad if it was like
bad luck, like another engineer landed something at the same time. They'll be more mad if it's
something unrelated to them, like the VM failed to load or like, you know, just stuff that seemed
really unreasonable. And so we kind of prioritize all these things and tried to solve them one by one.
Yeah, first thing that comes to mind is that was there something like, you know,
GitHub actions, like pre check type thing you could have done, instead of, you know, running code on like running tests only after submitting?
Yeah, I mean, this is roughly what we were trying to do. We did have things that would run the tests.
I guess part of what we built was
when we would upload the code to the code review tool,
it would run the test at that point.
But even that's not good enough
because if you run the test there
and then you see that they pass and then you land it,
it can still fail afterwards.
It can also, people can ignore the results there.
And I think at the time,
it wasn't required that those results were passing when you go
to land your code.
Like making that into the requirement is really the meaning of the commit queue.
Okay, so there would be like a status that says like the tests are passing, but they
would still fail because of like, you know, as you said, like conflicts or something in
the test infrastructure.
Or infrastructure.
Yeah.
Okay.
Or test flakiness. We tried to separate test flakiness the test infrastructure or infrastructure yeah okay or test flakiness
we tried to separate test flakiness from test infrastructure okay um yeah actually one concrete
thing is if there's an infrastructure failure that's the fault of the team that runs the
infrastructure yeah so that's like the people making that was like me at the time the connect
you if there's a flaky test that's the fault of the test owner and so being able to distinguish
these two is really critical from an ownership perspective,
because if you quarantine a test for being flaky, but it was flaking because of an infrastructure
problem, then the test owners become kind of, you know, what is it?
They don't believe your quarantines anymore.
They'll just start quarantine without looking.
You just learn not to believe it.
And so being able to distinguish those is really critical.
Yeah.
And quarantine just means like disabling a test from running in CI.
Yeah.
Okay.
Quarantining is where the test infrastructure team turns the test off and then tells the
team that owns the test, hey, something's wrong with your test.
So if you do that and nothing was wrong with the test, just looks bad.
It's a bad look.
You don't want to have the bad look.
Yeah.
So again, going back to the first question then,
so why did you need to merge repositories
to build something like a commit blocking system?
Yeah, let's tie it back together.
So one of the things that was really challenging
was that a lot of the integration tests in particular,
I'll look at Magic Pocket integration tests
as a really concrete example. It had a bunch of services talking to each other there was like one written
in rust one written in go there's like meta server which is like the the the piece that tied into the
rest of dropbox and there's like a bunch of components that need to work together and then
the integration test would need all of those to be running and they they were all in different repositories. Some of the
repositories were based on the language of the source, which I feel it's kind of arbitrary,
but that was the case. Some of them were based on the project they were coming from.
There was just a lot of like, the reasons for splitting the repositories were really distinct.
And so it felt kind of random where things were. When you ran the test, it would pull all of these repositories to master essentially,
and then try to run all of the tests there.
And so the motivation here was that we wanted to be able to distinctly know and reproduce
exactly which tests we were running, like which combination of commits and all of these
services we were running.
And we kind of looked at this problem from like, you know, how can we solve this? Like first principles, what are different ways we can solve this? One of the approaches is to have some kind of global ordering where when you commit to any of the repositories, there's some way to identify like, you could you could like, have a linear order for all commits, regardless of what order they're in, you could do some Lamport clock-esque thing where you have like a pin for all of the repos and you have some point
in time that you're looking at. There's like, you know, there's ways you can solve this.
All of those require doing additional work on top of the repositories. I don't think it's an
unreasonable solution. It would have been another solution we took. I think we picked the easiest
one because at the time it was just to merge the repositories together. We picked that solution because it gives you a single commit that
you can plan things on. It kind of removes the idea of how do you decide what repo to put your
code in? If you're not sure, do you make a new one? Do you shove it into an existing one? It just
kind of removes that problem off the table. And it makes it easier to land code that affects multiple repos
and maybe do code reviews across multiple services at the same time. So we saw a few different
benefits here. I wouldn't say it was like a unanimously happy thing. But I think it was
unanimously felt that we should do something. So some people were some concerns that people had
with the trees. Yeah, the biggest one, the biggest one by far, in my opinion,
is that the repository would be huge.
And at the time, it was pretty tolerable.
We can do it.
And kind of our mindset was like, if this makes it five years,
we're pretty happy, honestly.
Like company was like six years old at the time,
or seven or something like that.
Like if this thing makes it five years we have succeeded
and that was sort of the goal i think there was some long-term like hope that the repository like
by doing this like if there's a problem that's technical five years is a lot of time to solve
technical problems there is like an immediate problem and we're going to be able to do it
easily by solving it with in this manner some Some of the concerns, repository would be huge.
We actually mitigated a lot of that by just cleaning it.
While we were merging them, we just cleaned the garbage out of the repositories.
If there were giant files that didn't need to be there in the history or in the master,
we would kill them.
If there was an absurd amount of history with automated commits and things like that,
we could clean a lot of that out.
We use this tool called BFG, which is this Git tool that you can find to clean out a lot of stuff. Nowadays, there's actually a tool that is called Git Filter Repo. It's written in
the last few years. It's pretty solid. So if you're doing something like this, you can clean
history of a repo quite easily. And that stuff is nice. But, you know, if you're going to do a major operation
that requires all of your engineers to reclone the repo, it's a good time to clean stuff out.
So we were able to mitigate a lot of that at the time. But yeah, I think it was long term concerns
about size of management of repo. And you were at Dropbox for like the five years after you
merged most of the repositories,
right? So like, how did that concern play out in practice?
Yeah, I think it did. Like, the repo slowly grew, you know, we tried pretty hard to avoid having
massive numbers of automated commits that, you know, was kind of lower knowledge that
slowly vanished. And so we started having massive amounts of automated commits.
So that sort of stuff happened.
On the flip side, the tech for Git improved a lot.
Slowly, painfully, you know, with speed bumps and stuff like that.
But the tech for Git itself improved a lot.
And I think the option select for being able to switch to another, you know,
repo structure is kind of always there as well. But I do think, yeah, like having everyone have
to deal with all of the files did slowly become an issue. I would like to call out that maybe
there was a benefit that also is hard to notice looking back because you it's very hard to look
at the counterfactual world. It's hard to look at the world of like how would it have evolved if we went the other direction and split into lots of repos that
would have had other challenges and i would note that they're like engineers moving between
different teams working on the server code base was pretty smooth that could have been more
challenging i think it was easier to make code changes that affected, say, a service and all of its call sites, things like that.
If you were running a migration, often that could be done by an individual.
It might take multiple commits and multiple pushes, but it could be done by one person.
And I think we could have culturally gone in a direction where that sort of thing would have required more, like, for smaller levels of refact like for extremely large refactors you always
have to go contact all of the people and coordinate and for small ones you can do it yourself but like
the threshold of where where you have to switch modes into an organizational mode would have been
on the more is currently on the more complex side but i think it could have been on the more simple
side i'm sorry is that making sense it kind of makes sense it's like what you're saying is that is currently on the more complex side, but I think it could have been on the more simple side.
I'm sorry, is that making sense?
It kind of makes sense.
It's like what you're saying is that it was just easier to refactor when you have like one repository versus when you have separate
and you have to coordinate changes and all of that.
Yeah.
Like if you had to do a refactor, you could change the surface
and then you could find all the call sites
and you could put up code reviews for all the call sites and send them to all their owners. You could probably pull
a lot more off yourself. I'm not saying it's not possible to do that when there's multiple repos.
It totally is, but it becomes a little more of a barrier and you become more likely to,
like, if you have the tool in your toolbox to go contact 10 people and coordinate with a spreadsheet
and you have the tool in your toolbox to send out 10 diffs, when there's a lot of repos
and a lot of complexity and stuff you haven't used before, you're more likely to go with
organizational tool as opposed to the technical tool.
So you would say it's like certainly a net benefit of like merging the repos even five
years later, even though the repository got bigger.
I wouldn't say that was true for everyone,
but I would say it was true for most people.
Okay. So who is it not true for then?
I think if you are someone who only works on a small corner of the repo
and never looks at anything else,
it might not seem like a benefit in the same way.
But it's hard to tell ahead of time, which components are going to
look like that and which ones aren't. Maybe by you don't have to say like by the end of it,
but like, roughly what size of the repository was the pain like, like real or like people start
complaining about the size of their positive, like how many files or anything that you can share even in like aggregate or like approximate numbers? Let's see. It's hard to answer because it's always a spectrum. Like
there's always going to be people who want it to be like this. And it was immediately not like that.
But then I think things, you know, it's like they slowly get a little bit slower.
There's also differences because I know at a few points we made optimizations to the setup. I also
know that people have different machines that they're using and so there's different environments
for example everyone who used Linux just ran into problems way way way way later like it was just a
non-issue for much longer if you're on Linux Linux's file system is just faster compared to
like mac os mac or windows yeah mac or windows okay yeah. Mac or Windows, okay. Yeah.
Okay.
Yeah.
I can kind of try to answer the question.
I don't want to dodge it completely.
I think if you have something like 10,000 files,
you don't notice it all and things are fine.
If you have 100,000, it's like sometimes okay, sometimes not.
If you're at like 500,000, you start to notice it pretty often.
Yeah. I was going to say, I worked on sync for a long time. okay sometimes not if you're at like 500 000 you start to notice it pretty often yeah i was gonna
say i i worked on sync for a long time and so i actually there's a huge overlap between knowledge
i got from how to make a good syncing system which is dealing with lots of files on the file system
and for repositories yeah i'm gonna dive into that like a lot yeah soon enough like i i'm curious to
understand like why do you think you you know, Mac was like slower
than Linux?
Like you said, the file system was like much worse.
Do you know exactly what went wrong at like 100,000, like 500,000 files?
Like it seems like Git is mostly doing read only operations in most cases.
Yeah.
I think assist calls are a little bit worse on Mac.
So that was one. i think maybe people using
linux computers might have just had faster computers because a lot of them were like
remote bms or or like newer things or desktops like a lot of people had mac laptops and linux
desktops so might be a contrary variable there i do know that Dropbox put a lot of like security software on computers. So there
was a lot of like syscalls being trapped and like, you know, logging information to Splunk
and things like that, that was going on on the Mac machines that was less there on the Linux ones.
So I think there was that factor. We did our best to mitigate that, but it's kind of hard to mitigate that wholly. And then I do think that Mac syscalls are just not quite as fast.
Okay. So should like server development just happen in like Linux, like remote VMs or like
Linux boxes after a while? Maybe this is too broad, but like, what's your perspective on that?
I don't think so. Okay. I guess it should happen on the same machine that you're doing like for
something like dropbox it's often not a choice right like dropbox needs to work on mac windows
and linux so you kind of have to have things working when you're on a server developer i
would say yeah like probably linux like if you have the luxury to get to pick i think getting
it on some kind of linux file system is good. That being said, like, there's a reason Mac and Windows are popular.
They have nice UIs.
They have nice IDEs.
Like, you know, I don't think, yeah, I don't think we should, like, talk that short either.
Pros and cons there.
So it's kind of a struggle, like, to pick the right way.
Like, honestly, in my opinion, the best way, like technically speaking,
we should make it fast on all of them.
I don't think, I think if there are hundreds
of thousands of files, like the real ideal solution
is that you don't need all of them, right?
And like, you shouldn't need to pick
which ones you need really explicitly.
Like you should be able to just get the ones you need
with relatively small amount of work.
Yeah.
Yeah, both monorepo and multirepo have this struggle
where you have to know what you're looking for
with a lot of granularity.
I think it's a little bit worse than monorepo.
Okay.
So yeah, what you're saying is, for example,
Git status runs like a syscall on every file in your repository.
And ideally, why should it do that? For example, Git status runs a syscall on every file in your repository.
And ideally, why should it do that?
Because in most cases, things aren't going to change.
And for many parts of the code base, you're never going to touch those files.
Yeah, I believe that Git can be architected.
If we were to rewrite Git, I think it can be architected in such a way that most commands have already had the answer pre-computed.
Okay. That's total inspiration from working on sync, like Dropbox itself. If you think about it, it's like when you double click on a file, most of the time it's already there. Like that's what
it does. It sits in the background and does all of the things you're about to do with things.
Yeah. Yeah. So let's talk about both those things then. Cause I know that's the new stuff that
you're working on.
So what has Git done in the last five years?
You said there's been a bunch of improvements in Git
that has made things for monorepos better.
So what are those improvements exactly?
And what's your latest project that you're working on?
Yeah, so a lot of this stuff is super recent,
like the last couple of years.
And I'll quickly like hit on what a few of the topics are and like where they came from.
But yeah, one of them is FS monitor, which is a hook integration that Git has to that was inspired by Facebook's Watchman.
So if you've heard of Watchman, it's the utility that listens to, it basically subscribes to file system updates with the operating system.
So something like inotify on Linux or FS events on Mac, for examples.
But basically, it listens for file system events, and then you can query the Watchman to see what has changed since the last time you queried.
So every time you query, you get a token.
And then the next time you query, you give the token.
It tells you everything has changed since that token. And so Git made a hook
to integrate, it was inspired specifically to integrate with Watchman. And so Git has this hook
that integrates with Watchman, you have to write a little hook to connect the two. And then what Git
will do is when you type Git status in, Git will query Watchman via a hook
and get a list of what has changed.
And it will only stack the things that have changed in between.
So that was one of the big developments.
That was specifically built into Git to work on Git status.
And then over time, we added support for Git diff and Git add and a few other things.
In my opinion, it would be pretty hard to say
that it works for everything it should.
That's a really hard sell to make,
but it definitely works for git status
and for most git diff and git add commands.
Git add, you might think, might not need to touch everything,
but if you do something like git add dot,
that needs to stat everything in order to see what's there.
Yeah, so something like git add dash a
or git commit dash am still needs to stat everything in order to see what's there. Yeah. So something like git add dash a or git commit dash am still needs to check everything.
Yeah.
Git commit dash am has this nice advantage of only checking things that are in the index
because it doesn't check untracked files.
But as soon as you have to touch untracked files, you got to stat everything.
Okay.
So like, why is like untracked files different if you know?
Yeah.
Yeah. Git has this data structure in the docket directory called the index. And the index is roughly speaking a like database ish thing. It's a file, you can kind of think of it as a
database, but it just tracks every file that Git knows about. And so if you add a new file that
Git doesn't know about
and you use a command that's supposed to identify these files,
it's going to have to stat everything to find them.
And by definition, an untracked file wouldn't be in that index.
Yeah, exactly.
Okay, so when you run git add dash A,
it still has to stat everything,
or it definitely has to stat untracked files,
if not everything else.
Yeah, there's a subtle difference, actually.
There's a subtle difference between git commit dash a
and git add dot,
which I thought were the same thing for a long time,
but there's a subtle difference, and if you read the docs carefully,
one of them includes untracked files and the other one doesn't.
So it's important.
Git add dot will add all of your untracked files.
So the git commit will commit them.
Git commit dash a will not include untracked files.
So they do subtly different operations.
I did not know that.
I'm just glad I don't use git add dot for anything.
Some people learn it that way.
So the thing is, like, if you're like a new engineer, like, trying to explain what the index is, is probably just way or like, you probably just memorize git commit dash A, or you memorize git add dot followed by git commit, you just memorize it. And it probably takes you months to years of engineering before you start to even care about what an index is i mean illustrated by this
conversation where we're still learning things about git but you know it's not unreasonable
like to not know this yeah yeah i mean i didn't know it until like a few weeks ago but that being
said i never used git dot git add dot i like had memorized one of them when i was a young engineer
who had just learned what git was yeah yeah, I just have bash aliases for stuff,
and I don't even know the exact commands sometimes.
Fascinating.
So how much improvement does that make?
And does it help to turn on FSMonitor for all repositories?
Or does it only make sense after a certain scale?
It definitely helps if you have a lot of scale.
I would say that it's hard to even say it has like a 100x improvement
or a 1,000x improvement because the improvement is actually proportional
to how many files you have.
Because what it's doing is it's changing Git status
from being a command that stats everything.
So in Dropbox's server repo, that might be 500,000 files.
And Git status checks untracked files too.
So if you have a build directory with a bunch of tokens in it,
it'll stat all of those too.
So it can be variable depending on how much random build artifacts
you have floating around.
But yeah, Git status moves from that sort of operation
into an operation that only checks what has changed
since the last time you called git status,
which can be a big improvement for a lot of people.
That could go from 20 seconds to 0.1 seconds.
So 20 seconds to 0.1 seconds.
That is amazing.
And most people don't experience it
because you're working in smaller repositories.
I think for comparison,
Chromium or something is like 300,000 files.
Like I remember benchmarking that like a year or two years ago might have
been so,
but at 500,000,
like styling each file definitely adds up to like unbearable times.
It adds up.
Yeah.
You can look into S trace or D trust on Mac D trust on Mac or S trace on
Linux.
That was always how I checked.
I would just be like,
s trace minus C, get status. And then it just gives you a listing of how often every sys call
is called and you look for l stat and it'll be l stat 500,000. That's how you know that you got the
slow one. So who added these like performance or like scaling features to get because I remember
like this thing a long time
ago where like facebook sent an email like this is like a story i don't even know how true it is
like facebook considered adding stuff to get and then they decided to just go with mercurial instead
and that's why they use mercurial yeah yeah well let me go to the other ones after fs monitor
because i think that's the ones i know better so fs monitor was the first like foray into
improving git itself microsoft worked on this thing called vfs for git which you might be able
to look up it only works on windows but specifically it tries to make it into a virtual
file system where you can where it only fetches the files you need i don't actually know a ton
of details about it but essentially the idea is that the file system that your files are on is a virtual file system.
And when you go to access a file, it's running some code that will fetch them as you need them.
And so Microsoft actually changed the Git server and the Git clients so that everything would work virtually,
so that the Windows source and the Microsoft Word source would be able to run more effectively.
So they wanted to move some of that technology so that it would work across different platforms. And Mac recently put out a note saying they're just kind of banning
kernel extensions. And so they actually went a different route. They thought a little bit about
going the route Mac, but they went a different route to augment Git to take advantage of a lot
of these features. So one of the key ones is something called partial clone, where the naming, in my
opinion, is quite confusing. So there's a few features that sound like the same thing, but
they're actually subtly different things. But I'll try to call them out. You got partial clone,
which is when you run git clone, you don't fetch the entire history of the repository,
you just fetch kind of like metadata, and not all of the blocks of the history.
Like you don't fetch all of the historical files, only the metadata and making messages and things like that.
So that's partial clone.
There's this thing called sparse checkout where unlike clone, checkout is where you take files from the.git directory and shove them actually onto your workspace into your file system.
So partial checkout is only doing some,
or it's like, I already confused them.
Sparse checkout is only getting some of the directories.
And so sparse checkout limits the directories,
partial clone limits which historical files you fetch.
Like you only fetch metadata
and not the actual files in the history.
And there's this third feature called,
it's when you pass depth equals one to get clone, I believe it's called shallow clone. And that one actually just fetches less
metadata from your history. And so there's these three different like related features. And I think
two of them were built by folks at GitHub, or Microsoft, GitHub and or Microsoft working
together here. Okay, so I didn't know about partial clone
and that seems super helpful.
What happens when you like,
I'm guessing you can do something like get partial clone
for a repo up to like just a years of metadata.
And then what happens if you check out
to like two years back or something?
Yeah, so partial clone landed like a year and a half ago.
It's pretty new.
Like you're going to have to get a pretty new version
of Git to use it,
but it works.
Yeah, it's git clone dash dash filter equals,
and then you have to read the man page.
Filter equals blob none,
but it's got some,
you can find tutorials.
There's a pretty good tutorial on GitLab.
But yeah, the idea is that
you can clone any repository
where the server supports partial clone
and GitHub itself added support
for partial clone in all of its backend servers pretty recently. And that's not a coincidence.
There's folks at GitHub who work on Git. So that means that you can partial clone stuff
from GitHub.
Yeah, but what happens if you check out something that's like older than what you
have?
Yeah, so if you if you check out something older than what you've partially cloned, and it contains blobs, so blob is the word for a file, a historical file object. If you check out a blob that you don't have, there's this notion of a promiser, which promises to get you the blob, and you have a list of them. And so it'll go check all of your promisers to go fetch the blob dynamically. And so it moves a lot of the logic from the time of git clone
to the time of git checkout.
And it only does it on the checkouts where you actually need it.
Interesting.
So you don't have to worry about, you know,
it entering out in case you just need to check one file out.
That's like super old.
It kind of does that on the fly.
Yeah, it'll fetch them.
It does have this key downside.
When you clone, you get all of the
files for all of history all at once at the beginning, which means you get really good
compression and kind of like packing. And so when you have partial clone, the packing is naturally
like a little bit worse. And so in order to have good space usage, you need to periodically repack
your repository if you're using something like
partial clone. So this is something that VFS for Git built in. And they actually built it into Git
itself. I believe it landed in March 2021. So like the very, very newest version of Git has this
command called Git maintenance, where you can run background repacks, like kind of like a cron job.
So Git is slowly moving in this direction of having a lot of the pre-computation
and work being done offline,
packing, garbage collection, stuff like that.
Interesting. Yeah, I think I saw that
and I just ran Git maintenance just for fun,
even though I have a super tiny repository now
because I'm still interested in these things.
That's fascinating.
I was going to say,
Microsoft has this tool called Scalar,
which their idea is that they implement
a lot of these asynchronous tools into Scalar,
but they're slowly trying to upstream them all into Git.
And they're hoping to completely obsolete Scalar
in favor of just upstreaming everything into Git.
And it seems like it's working.
Just run Git maintenance,
and that will do everything that Scalar does.
But Scalar will be like, you just have to install it on your own and probably run it
and ask your users to run it.
Yeah.
The key challenge that I ran into when deploying improvements to Git was Git often requires
people to run commands in order to, as you said, run Git maintenance.
How do you work around that?
I'm sure for the FS monitor stuff, you have to think about that stuff as well.
Users have to manually configure FS monitor and everything.
Honestly, I think there's a key insight that I learned from working on sync that was helpful
here.
So Dropbox, the product auto update software out to tons of clients and it's software that you know your
users aren't going to be running commands they just need it to work it is really really hard
to write correct software that has persistence like deep persistence on the client side like
if it is not easy for you to wipe all of your local state and be okay, it is really hard to write correct
software. And the more state, more historical state that is stored locally, the harder and
harder it gets. For something like Dropbox, Dropbox needs to, in order to know whether or not a,
so it'll know that there's a difference between your local files and the remote files. In order
to be able to tell whether that's something you should upload or download, like to tell whether the difference is something new on
your side or new on their side, you need to store metadata about what you think the server has and
metadata about what you think you have locally. So you can tell the difference. And that's a lot
of state that you're storing locally. And so you have to be very careful about that state. There's
a super good quote that one of my Dropbox colleagues said that said that a system
is not correct. If the code is correct, the system is correct, if the code is correct,
and the data is correct. So not only is your code correct, your databases need to be consistent and
look the way that you want. Maybe a concrete way to think about this is that if you introduce a
bug into your code, and it writes some bad data to your database, and then you fix the bug, your system is not yet correct, right? You have to fix the bad database as well. For simpler systems, you know, long-term state systems,
it's really hard to do this well.
And I think Git absolutely qualifies.
The index is a long-term local stateful thing.
And I think the new features of Git really exacerbate this
because they're like, let's look at FS Monitor.
FS Monitor, in order for Git status to give you the right answer,
it needs to persist the correct thing
about when the last time
it queried FS monitor was. And then when it gets information from FS monitor, you know, if you hit
control C, like you need to make sure that you either persist the new token and incorporate all
of the new files into your index or not. And if you have a bug there, it means get status is going
to show the wrong answer. And it's not like it'll fix itself it'll just keep showing the wrong answer forever like if you miss a file you'll never see it yeah and so it is really
hard to write this kind of stuff to be super highly correct so this is why git is always super
careful about having you type in commands and like you you know, you're using a beta feature, you have to run this thing that like read, rewrites your index, because they want you to be involved,
because until they're extremely confident that everything is working perfectly,
you need to be involved. And you need to know that, like, if you hit a bug, like,
it's not the kind of thing where you run get status again, and things are okay, you might
have to like clean, you know, clean and restart over. Yeah.
You need like a support channel or something like where people can ask like, oh, it looks like my index got into like a terrible state.
And fortunately, it's okay to like throw away the index and start from scratch.
But of course, you don't want to throw away files or anything like that.
Yeah.
If you know that your local state is clean, like if you know that you haven't been working on it,
then you can throw away your index pretty safely.
But they don't want to automatically throw away the index
because if you are working on something,
then when you automatically throw away the index,
Git can no longer tell.
It couldn't answer the Git status command.
It wouldn't be able to give the correct answer.
Okay.
And it sounds like you have a bunch of stories
from when Dropbox might have messed up people's state on action.
I don't know how much you can share that.
And this is like just going on your experience working on the Dropbox sync engine, which, as you said, it's like remarkably similar to like improving Git in a way.
Yeah.
Honestly, both of them inspired me to work on the other maybe as a bit of an aside i think that a
developer tools engineer one of the best developer tools engineer is someone who formerly was a
developer tools customer just like by switching between the two it's like you understand the
pain points perfectly so that's one of the reasons i flip back and forth but yeah working on sync
it's really hard to write a perfectly stateful, good system.
A huge part of it, you want to cover your ass in as many ways as possible. You want to have good tests so that you cover situations that you anticipate. You want to have sorts of tests that
help you uncover situations that you don't anticipate. You want to have recovery mechanisms so that if
you hit situations that you didn't anticipate, you still have a way to handle it well. And all
three of those are important. Like I think it is super naive to think that for a very stable system,
you can have good test coverage, you can't cover all the cases, it's just there's too many,
because one of your inputs is the state you started in.
And users can mess with your state
in all sorts of terrible ways as well so you need to make sure you're doing things like checksums or
like validity checks i guess on your existing state to make sure that it's not malformed by
someone even on accident right yeah you want to do validity checks, because for example, people could have faulty hard
drives, or maybe they were messing with it. But honestly, that's not even the scary part. Like,
if I were to like, look at how many weird states we ended up in, and how many of them were due to
the user being malicious, it's tiny, like, negligibly tiny. If that's how many are due to
the hardware being bad, it's a little bit higher, but also pretty tiny.
Most of them were due to bugs that were in our code for a transient amount of time.
Really obscure bugs usually.
Like you turned your computer on after not being on for several days.
And while it was off, you renamed a folder that was called A and you put it into B.
And at the same time, you took a folder called b and you put it into a and then like it's trying to resolve this situation the
people writing the software never thought of this situation originally or at least maybe maybe it
was thought of now but it wasn't thought of a year ago like it doesn't matter like some weird
situation like this happens and you write some bad state out. And that bad state is now persisted. And so there's no like, fix, right? One of the things is like software like Dropbox,
or Git status, for example, or Git status with MS Monitor, is that it's trusting this persisted
state as one of its inputs. And so if the input persisted state is bad, it can be really hard to
figure out what the issue is. So a lot of the
protections you want to put in place. One is if there are cheap protections, like if on the fly,
while you're reading stuff in, you can just check it, you should do that. And if something is wrong
with that input state, you should think carefully about what you want to do. Maybe you want to fix
it on the fly. Maybe it's illustrative of a problem and you just want to crash. In some cases,
fixing it on the fly is reasonable.
In some cases, like you might just be causing another problem because you don't actually truly
understand the problem that you're fixing. And so sometimes it's like more dangerous to fix it.
One of the tools that's really helpful is to build in some kind of more expensive
consistency checker. For example, Dropbox has one that ensures that file IDs are unique,
where, you know, if you have a lot of files, you want to make sure that the identifier for the...
Like if you rename a file, it should have the same identifier in the new location.
And if there's some bug, some complicated, weird bug, and you end up with the same file ID in more than one location, that's super bad.
And fixing it on the fly might be hard to do.
So you have things that run full consistency checks over all of your files to
make sure that every file that's in a directory, that directory also exists. You might have one
with permissions, like every file that's in your folder you have access to, like that should be the
case, but you really want to make sure you might run ones that validate that. I'm having trouble
thinking I'm off the top of my head. Yeah, but you want to
have things like this, you might Oh, here's one, every folder should be in a directory. But you
want to make sure that that like, graph that like computer science sense graph is doesn't have any
cycles, like you don't want a to be inside of B and then B to accidentally also be inside of a
tree, not a graph. Yeah, it should be a directed acyclic graph of your files and not
like there shouldn't be circles um but you know that that's an invariant that could be broken by
a bug pretty easily um so yes stuff like that those are expensive invariant checkers but you
want to have those built in and i know dropbox has them running in the background i think once a year
for most people, but for
internal employees who get the everyday code release, it runs every day because if there's
a bug that went in today, we want to find out immediately. And so I think stuff like that is
really critical. In my opinion, I don't think it has enough of this. Like it doesn't do background
internal consistency checks at the index, unless you type a command in explicitly. And so when bad it's like it's hard to even notice so i think this could be an area
where something like git maintenance really has a lot of benefit like you could opt into automatic
background consistency checks yeah i think that kind of stuff is really useful yeah i really want
to know like what what prompted all of this like correctness. I mean, it makes total sense to me like just some incidents or some bugs that were like super hard to like, we don't know what the heck is going on. Drop the
database, re-download everything from the server, like restat every single file in the file system.
Like we're going to do this super slow operation to just try to piece things together from scratch.
And it's a slightly lossy operation in that if you're working on things while the operation
is happening, we might lose that. But but it is usually like that's our hammer
that's like we don't know what's going on we're going to apply this hammer and so we try not to
use that hammer but basically working on sync is a lot of using this hammer it's a lot of like
figuring out what cases that we you know come across where we might have to use something like
this yeah what are some good examples like I'm going to try to think of one
that has a lot to do with persistence locally,
because I know that we had some issues
that are often server-side,
but I think there's a dime a dozen of those.
You hear them from lots of companies,
but I'm going to try to really think of one
that highlights the client-side version.
Okay, this isn't quite client-side,
but I think it'll illustrate the point.
Dropbox has a journal that stores every version of every file and you append rows to this journal
every time there's a new version of a file you'll append a row to the journal and there's this like
index that you can use to identify the latest version of a given file and so at some point
much later into the history of the company, like
10 years into the company, we're like writing a consistency checker. Every file should have
a latest version. Like there's a tombstone when you delete a file. So there's a row that's like,
we call it a tombstone. It's like a row that says like this entry no longer exists.
So we started writing these consistency checkers that just go over all of history and try to figure out like, every file should have a row that is considered the last row for that file, like this index should have something.
And you know, you run this consistency checker, the current code, as far as we can tell, never creates something that has this problem. So we're pretty good about it.
You run the consistency checker, the data is not consistent. Like there are files that have no latest version.
You're like, it only has history.
This is weird.
What's going on?
Yeah, turns out long ago, there was probably a period of time where there was a bug in
the source or maybe an intended behavior, hard to tell.
But at some point in the past, it didn't work this way.
And so you have to go sleuthing to figure out what that is. We actually, I think we used to not write tombstones. And we started writing tombstones
later, but only specifically for I think it was like a shared folder on mount. So if you stopped
being part of a shared folder, we would, it's called unmounting, unmount the shared folder and
not write a tombstone. And then at some point later, we were like, for consistency, we should write a tombstone in this case. So much later, after that
decision was made years after that, we were writing this consistency checker, we saw that
some files don't have a latest row. And so we thought, oh, let's just fix the problem. And we
would go in. So we wrote a fixer that would run the consistency checker and then just make the last row the latest one.
And it turns out that this started just resurrecting.
It would just bring back shared folders that people had left.
And the way we found out about this, I think this is hilarious.
But the way we found out about this is that one engineer's ex-partner, someone they broke up with, they had a folder of pictures with them.
And it reappeared that day.
And they were like, what the heck?
I deleted these for a reason.
Like what's going on?
It was so funny.
Like that's how we found out about it.
And it actually turned out to be a pretty severe thing.
Dropbox is a company that's like very seriously cares about data loss.
Like we very carefully don't want to delete files that unintentionally.
This is a case
of the reverse. It's actually also bad if you thought you deleted it, but you didn't. Yeah.
This was not intentional. So we ended up fixing it and, you know, communicating to the people that
this was due to an issue. Honestly, I'm very appreciative of like, if a company makes a
mistake and then just goes to great lengths to fix the mistake i'm very appreciative of that because having worked on software the mistakes are just an inevitability
they happen so i'm more looking at the response to the mistake rather than the fact that it happened
that's fascinating and i'm guessing there's some part of the code that just like
since there is a latest version of this thing i need to i need to sink it down it that just like, since there is a latest version of this thing, I need to I need to sink it down. It was just like, an expectation that was good if I probably like years before this
consistency check came in and started fixing things. Yeah, so it's not just your code that
has to be correct, your data has to be correct, too. And if you find an inconsistency in your
data, it's probably good to figure out how it got there.
Because just fixing it, you might not actually, you need to like really understand.
Maybe you don't need to perfectly understand how it got there, but you need to understand the consequences of your fix.
Because it might not be as obvious as you think.
And also, yeah, like stakes are important, right?
Like if the stakes are you're bringing back files
people thought they deleted, it's pretty high.
If your stakes are you're bringing back high scores
on Candy Crush Saga that you thought you deleted,
it's probably okay.
Yeah, stuff like that.
People can be pretty serious about their high scores.
An interesting project that I know about about sync
is just the rewriting sync from,
like, the sync engine from engine from like Python to Rust.
So first of all, how do you pull off a rewrite like that safely? I'm not going to go into like making the decision and all that, but like more like how do you make sure your new code base
actually works for all of those edge cases that have existed for like years and years?
Yeah. Well, you got to know the technical shortcomings
of your old system i think one really key like you got to respect your old system like it's very
easy to talk shit about something that's causing you pain you also need to understand and respect
and fully like appreciate what got you there about that system because you you see the pain points, you don't like
it's the second system effect. It's like when you build a second version of something, you forget
what made the first system so good. So you want to make sure that you hold that first system with a
lot of respect and really understand all the good things it did well. And for its shortcomings, you
want to make sure that when you solve them, you're not also damaging the stuff that made the original one good.
You want to understand what changed, like when the first one was built to now, maybe some constraints have changed that make different options more optimal.
One of the most major constraints is like when the first system is being built, there's like usually less engineers and usually their stakes are lower like if it's like how do i put it it's like they're okay
with the thing making more mistakes because it's the first version of the thing and they're just
kind of trying to get something that sort of works uh and so when you look at it later you have to
understand that that's the perspective they came from looking at dropbox for example dropbox was
like you know if it worked for most people most of the time like that's amazing and you just
want to get it working for more people more of the time uh and if it like that's the perspective
you have when you're a 30 person company just trying to get customers when you have a lot of
customers that trust you a lot you have a very different perspective where you're like don't
poke it because like it's dangerous like if we delete something and it affects a tiny fraction
of people,
that person will get very angry and probably post on Twitter about it or whatever it is.
You want to be a little more careful at that point.
So for rewriting the sync engine, yeah,
lots of respect for the original system.
Understand its shortcomings.
Those are the things you know,
but also understand what's good about it.
So maybe I can get concrete.
Sync Engine Classic, which we called it that,
the shortcomings, which were pretty clear to us,
was that it was really hard to build things on top of it.
Particularly, it does a lot of things, right?
It syncs files up, it syncs files down.
It had selective sync on the local side.
It had smart sync where it was dynamically fetching files from the server.
And then it had a bunch of features bolted onto it as well.
Things like the Google integration for Google Docs.
And there are a few other things,
but it had a lot of stuff like attached to it.
And then it was closely tied to the data model of the server.
So this idea of having like a row for each change to a file was pretty key.
And then I think another really fundamental thing
is that it tied a file to a file was pretty key. And then I think another really fundamental thing is that it tied
a file to a path. And so renaming a file was a pretty challenging operation. Renaming it as a
delete plus an add pretty easy. But Dropbox had grown a lot of features that really wanted to
make sure you kept the identity of a file across a reading. For example, if you attach comments to
a file, or if the file was put in by someone
else, the original person who put the file in, you want to have that trail along. And so renames
in the data model were pretty tricky to do. Particularly, if you rename a directory,
just imagine you have a directory with 10,000 things inside of it. If you have a path based file model, then renaming a directory with
10,000 objects is something like 20,000 operations. And so there are these like fundamental constraints
there as well, where I think prior to nucleus, you just couldn't rename things that had more than
a certain number of files in it. Like if you went to the web and renamed directory,
and it had too many things in it would just be like, we can't and if you did it on the desktop it would slowly in a random order issue 10 000
deletes and 10 000 ads which is kind of scary because like maybe it deletes 10 000 things and
then issues 10 000 ads so it was you know there was a lot of like fragility there and we would
put heuristics in there was a lot of heuristics and sync engine classic to protect against these because we didn't want like, every rename to just delete stuff by
accident. And so there was a lot of careful heuristics in there. So like, really make sure
we did this right. And that's where the respect comes in. It's like, the fundamental data model
made it impossible to do this well. But there were a lot, there was a lot of logic in there
that made it possible to do it
pretty well for most cases. And so you really wanted to respect a lot of that. It's like,
let me see if I can give an example. If Okay, if you if Dropbox had to do a lot of operations
on a bunch of files, it would sort all the files by the length of the path. And this would make
sure that all the directory operations
happen before the file operations. So that if you were adding a bunch of directories with files in
them, the directories got added before the files. But for deletes, you wanted to do it in the
opposite order. And so if you have a jumble of ads and jumble of deletes, and they all have paths,
and you want to carefully like figure out what order to send things up, there were a lot of
heuristics in there to make sure you got it right and the workload changes right like you could have
added it and then later change your mind and deleted it and those show up as two entries and
so there's like heuristics that squash your queue and if it sees something that's already there it
would squash it in to make sure you do the right thing it's very easy to mess this stuff up. So, you know, like that system, like pretty fragile, pretty hard to change, worked pretty well.
But, you know, it wasn't going to help us build new things on top of it.
Those are the pain points.
I think I also hit on some of the things we were like really respected about it and really tried to understand.
Like what it took were people who really understood why the system was built the way it was.
Like, you know, if there was a bug and you lost data, you had to understand exactly what happened.
And that was like a really important thing.
And that was more than just reading the current code.
It was reading historical code.
It was understanding which versions of the code went out and which releases.
It was reading the log files from different releases.
You really need to understand a lot of things in order to be able to do well.
Do you think having a complete new set of engineers trying to rebuild this thing would have been basically a total failure?
It was people who were steeped in the existing system for a long time who understood how it worked.
And then them rewriting it for a long period of time is what made sure it could
actually succeed how big a deal was that let me make sure i answer this the right way like i think
my my knee-jerk answer is yes like yes it was really important that you had people who understood
the original system yeah but i think the key constraint i'm missing here because i don't
want to discourage like just building another version like not knowing anything that's how the original system got built i think
the key insight here is like if you're building another version of the system like it will solve
the most immediate problems and you'll probably end up with another thing that's kind of like the
original system like maybe maybe, like,
I think if you tried to build Dropbox from scratch, a lot of people could build something
that worked most of the way good for most of the cases. Like, I think a lot of people could build
that without a lot of prior knowledge. But I think you would end up in a similar situation over time
where you make similar mistakes, and maybe some key insights like the file ID data model versus
the path data model or things like that might help you. But it's really hard to foresee a lot of these issues. And it's really hard to
foresee a lot of the complexity that comes in. And I think deeply understanding your original system,
particularly if you're trying to replace it in place is really important. And for an established
company that has a lot of existing customers that have existing expectations, I think it's
like basically ground stakes.
Like it's pretty critical.
So yeah, what kind of like testing or like verification
or like metrics or whatever did you like monitor
when rolling this out incrementally?
And like how long did the incremental rollout take?
Yeah, these are great questions.
Maybe I want to draw attention to like making the decision
of like whether you should roll it out
or you should add more testing.
And I think it's a very situational decision.
If you're making a brand new product, especially if you don't have product market fit,
absolutely roll it out.
I think this is just universally good advice.
For replacing a system that already has product market fit, I think it
gets harder. Like, I think it gets harder to pick. I think it'd be too easy to say, definitely roll
it out. It'd be too easy to say, definitely make sure you got it perfect before rolling it out.
Like, I think it's pretty situational. I think pretty good heuristics are, if you
understand what the biggest failure modes are and how to protect against them,
I think it is worth delaying the rollout to do that.
Where the biggest failure mode is not, it's not every failure mode.
It's the ones that you know will cause the SEBs, like lose you customers,
like the ones that will lose your money, like the ones that really have a big impact.
If you know what those are, then,
and you can protect against them ahead of time. I think those are the ones that it's worth delaying your rollout to iron those out ahead of time. I also think, what is your remediation?
Know your remediation story. If your remediation is naturally pretty expensive, very stateful
systems have this problem. Like forget the remediation is like wipe it and reclone.
If you have a small repo, no reclone if you have a small repo
no big deal if you have a big repo it's painful so like that's a good analogy like for dropbox
like a lot of the customers have big situations so if we roll out something with a bug we're gonna
have to send an email that's like i'm sorry like you're gonna have to reinstall and people will be
like no you know you don't want to put people in that situation or we'd have to roll out something that did that automatically but either way, you don't want to put people in that situation. Or we'd have to roll out something that did that automatically. But either way,
you don't want to put people in that situation. So know your remediation. If your remediation is
easy, then roll it out because you'll find the bug faster that way. And even in a system like this
one, we had to pick and choose which features we wanted to do early and which ones we wanted to do
later after rolling out. And it was actually pretty hard to carefully pick and choose
these because a lot of it is a careful way of what's your remediation what is the cost if it
goes wrong and like making decisions case by case it was very hard to have a singular mindset
so that was the art was having this like picking and choosing what we wanted to build into it
so there was like i'm guessing a rollout framework like not like something in code but something in
a document where you said these are the set of expectations we need to have in order to roll out
to a certain percentage yeah i'll give you a pretty concrete example dropbox has a feature
called p2p where if you fetch a block from someone who's on the same network as you, it's faster because you'll fetch it from them directly instead of
from the server. That was a feature that we didn't roll out in the first version that we rolled out
to people. That was an example where we're like, the failure mode is that it's a bit slower.
And the remediation is that it goes to the server and fetches the block. So it's okay.
But for our biggest customers, where that's a huge pain point, especially ones in Australia,
where our servers are in California and they're in Australia.
So it can be painful.
For those customers, we're like, actually, this is a killer feature for them.
So those are the ones we shouldn't roll it out to.
So we have to make a lot of careful decisions like that.
So one example was like, large customers, business customers came later.
And business customers had a larger set of features. And within the business customers, the ones in say, Australia would have to wait until we finished
something like P2P. Whereas some of the other ones, we would just roll it out without finishing
P2P, and we'd add it later. And those are all just judgment calls, we had to make case by case.
Okay. And you mentioned that, you know, both your code and data need to be correct.
And for your data, you had things like verifiers or like consistency checkers that would make
sure that it's not extremely terrible.
But like, how do you, I'm still a little confused about how do you make sure that you
are confident just like code wise to roll out a new version of like the sync engine when there's
so many edge cases that you could be dealing with like what made you confident to like turn it on
for like you know 0.1 percent of people yep this is also a great question yeah so we had lots of
layers of testing and we didn't think of testing as like you should write tests because it's good
to know if your thing is working we thought like how do we validate that this thing will work when we
replace it? Like, how do we answer that question? And whatever tools are at our disposal to answer
that, we're happy to use them. So we went through all the tools, like you can write unit tests,
you can write integration tests, you can write randomized tests, you can roll out to a fraction
of the population, you can ask for beta testers, you can roll it out to
internal employees, you can run it in the background and compare results. We went through a whole gamut
of things. And we tried to pick and choose the ones that would help us the most. One of the things I'm
most proud of is that we built a randomized testing framework that would it's kind of like a
fuzzer, where you're like throwing random inputs in. But I think one of the key things here is that because it's a stateful
system, one of the inputs is the state. And so you would, we had this,
we had this component called canopy.
Canopy was the trees within Dropbox.
So that we had a local tree and a remote tree and something we called the
synced tree, which was the base between the two.
And the three trees together was called the canopy. This is a fun name. But we had this randomized checker called
canopy check. And it would randomly generate state of the three trees, and not totally random,
like it would really try to be random, but hit weird cases a lot. So random with some bias towards weird cases. For example, like,
if you rename a file locally, then that file is going to exist on the local side and on the remote
side. And we want to make sure we hit that. If the remote side puts B inside of A and the local
side puts A inside of B, we want a randomized checker to find that and figure something out.
Like, essentially, what it would do is it would create three random trees for like the initial state, and then it would just run the
sync engine and make sure that something reasonable happened. In a lot of cases, it's pretty obvious
what the reasonable thing is. But in a lot of cases, like the A inside B, B inside A, or maybe
even a three cycle, A goes in B, B goes in C, C goes in A. Like all three of those operations just got put in at the same time. What
the heck do you do? Concretely, maybe three remotes put those operations at the same time,
something like that. So if something like that is going on, just do something reasonable,
like any reasonable operation. And so the art to randomized testing, in my opinion, is making your validation step at the end of the randomized test as simple as possible in the validator versus an issue in the code gets too high.
You just start to lose faith in it.
So the key is to make it as simple as possible and have the ratio of like canopy check fails.
What probability is it that it's a real bug versus a validator bug?
You want that to be really high.
So the earliest validator we had was just make sure it's synced.
Make sure all three are
the same this is not good enough on its own because like you could delete all the files from everywhere
and that would be synced congrats you've synced all the files but pissed off all your customers
but yeah we would add a few more checks on top of that but you can get arbitrarily complicated
all the way to the level of writing a sync engine
again in your validator. So we tried to be pretty simple. Like, if a file is there at the beginning,
make sure it's there at the end, unless it was deleted. So we started with something like that.
And then the unless it was deleted, got harder and harder to implement. So we tried to be careful
about it. For example, if you selectively sync it out, when we built the selective sync feature,
that would not be on the local side,
but it would be on the remote side, but it wasn't deleted.
And so we had to add,
the validator was like, if it's not deleted
or selectively synced out.
And so
that's the art, right? I think that check
was cheap enough that it was worth it,
but some checks are too expensive.
One thing that randomized testing
really does, which I love about it,
is that it makes asserts in your code get superpowers.
Asserts suddenly become amazing
because for every assert you write,
that could have been a unit test.
But when you have randomized testing,
suddenly these asserts cover things
that you didn't anticipate.
You can even have expensive asserts.
You can just have them like turned on
in your randomized testing framework. You could have an assert that like consistency checks an
entire data structure or does something that would be unreasonable to do at runtime. So that kind of
stuff is very cool. Asserts get superpowers. That's so interesting. And like, you can just
check for something like does the sync engine crash when trying to sync stuff?
And the simple answer is like,
the sync engine should never crash.
Yeah.
Right?
Yeah.
And that gives a lot of power to like engineers.
But do you feel like engineers got lazy
when they had this in place?
Or do you think that was like just one layer
in like the list of things you needed to do
in order to test things like,
and how often did the randomized checker actually catch stuff?
I would say that it got disproportionate value upfront. Like when we first wrote it,
it caught tons and tons and tons and tons of stuff. And then over time, less and less and
less and further and fewer in between. And sometimes it would find something really obscure, like it found one where the file name was longer than a certain length. Or it might
find one where like the same file has two of the same block in it. And we tried to make these
obscure situations more common, just like by tweaking the way that the random trees were
generated. But that stuff is, it's hard to do because you can only cover what you anticipate
there. But I think inherently randomized testing covers stuff that you didn't anticipate. And I think
that's a wonderful thing about it. It's never going to cover everything. And it's always
important to have validation afterwards as well. So we also covered, we ran stuff in the background,
did other things like that. Let me ask, answer your question about what it did with engineers.
I think it made the barrier to entry to the code base,
or sorry, it made the barrier to landing your first diff
a little bit higher,
but I think it made the barrier to entry a little bit lower.
And this is kind of a weird concept
because it meant that it took longer to land your first diff.
So it felt like the barrier was higher,
but I compare it to my experience on Sick Engine Classic.
It was easy to land your on sick engine classic it was easy
to land your first diff but it was also really likely that your first diff caused a giant bug
or it would like reduce you couldn't assign a new engineer a thing that was in a risky area of the
code like it was just too dangerous like if something went wrong like the effort it takes
to fix the thing that went wrong was 100x higher than actually doing the change. And so it was just super dangerous to like let new engineers near anything important.
And I think one of the cool things is that randomized testing just flipped that,
like new engineers, you could let them near something important. They could, I actually
thought, I've had this corroborated by a few people, but I thought canopy check taught people
how the code worked. Like you would have to make a change that maybe you added something to CanopyCheck,
some kind of validation.
Let me think of a concrete example.
We added randomized X adders on files to the randomized tester.
And there was a new engineer who I was mentoring, and that was one of their intro projects.
When they added it, CanopyCheck started failing.
And it was like, hmm. And they added it, canopy checks started failing. And it was like, and they started
investigating and part of it was because they added it in the wrong place in the code. But in
order to figure that out, they had to understand the code and understand the validator. And so
they just started learning about the system, the system would tell them what was wrong.
And then finally, it turned out that they actually caught an actual bug where I think it was like if
I think it was like if an x adder had
two of the same blocks in it was a super obscure condition it would like crash the engine or
something and we found a real bug and we fixed it so like I thought it was really illustrative in
that way and then also to the not to the new engineer but to the medium engineer taking their
first major project like maybe someone hasn't added a feature before.
They've only done bug fixes, and they're tasked with adding a feature.
That's much harder.
Something like CanopyCheck, you go to add the feature, you do your first diff.
It's kind of a big diff that touches a bunch of areas.
You think you got it, but maybe you missed something.
You run CanopyCheck.
It fails.
The failure tells you a test case that you failed on you read the
logs of that test case really carefully you start to understand exactly what's going on and like why
this is here and what you know and then you can fix it so i thought in that way it was really
enabling but i would not pretend to say that it makes it easy to land your first diff like it can
it's complicated that's why i think you want to use it in systems
where it brings a lot of value.
You wouldn't want to use it in systems
where the cost of moving slow
isn't worth the cost of being more correct.
Yeah, that makes a lot of sense.
And it reminds me of the value
of a really good integration test
in a world where it makes sense to write one of those,
whereas with the sync engine, it totally makes sense that you can't integration test for
everything that just doesn't fly.
People can have all sorts of weird file systems and just weird file system setups.
Yeah.
And it's not even, I mean, there was also a lot of weird file systems.
I'm not going to lie about that.
But even just like weird things the user can do.
Like I just talked about A going in B, B going in C, C going in A.
For every feature Dropbox has, there's some weird thing like that,
that you're like, who the hell would do that?
But then you think about it for a moment.
You're like, oh, there's actually like 10 people simultaneously working on this folder.
Like this could totally happen.
And so that kind of stuff you end up thinking about, particularly in company context,
if you have 10 people on a team, or like 100 people who are all in a shared folder together,
even if it's an obscure action, that happens very rarely. If one person does it, it affects everyone.
And so sometimes it can be a big deal. Like if you're, if your obscure bug
accidentally clears the permissions on a folder, suddenly everyone doesn't have permissions on the
folder. Like it's a, it's a big deal. So often things like, I don't know, like not that many
customers use Linux. So maybe this obscure bug is okay to leave on Linux. That was often not a good
enough answer because if you're at a company with one Linux user and they hit the
obscure bug, it affects everyone. And so the calculus of that changes a bit.
That makes sense. I think this is like a pretty good stopping point. Do you have something for
like listeners, you know, like just venturing out to their first code base or like, you know,
engineers just starting off, like, do you have any advice for any listeners like that?
Yeah, I think don't be afraid of making mistakes. And honestly, this is a really hard one to internalize.
Like, I think everyone's going to tell you this,
but it's really hard to internalize
because you're afraid of making mistakes.
Like, it's just in you.
Like, it's going to be there.
Like, you're going to go try to put your code in, but you're like afraid that the code is going to mess
something up. It's okay. A good mentor will reassure you. This is the case, but often
as a mentor, it can be pretty hard to remember what that fear is like.
And so I think it's just really important as a new engineer don't be afraid to make mistakes
if you make a mistake this is a good thing just remember like this is how you learn this is a good
thing and the mistake is not your fault even if it's like i think even if there's something you
could have done to fix it like software is not about like blaming the person who wrote the code
that made the mistake like it's just a bad way to live life, honestly.
But like, yeah, if you, it's like, try to make the mistake if you can, right?
Like if you're, I mean, how do I put it?
It's like, you don't want to do it on purpose, but what you want to do is you want to put
yourself in situations where it's like, I can't, I'm not good enough to understand what's
going on here.
And then you make the mistake and then you become good enough to understand what's going on there. And the mistake is not your fault. Like, it's
probably honestly fine. Probably everyone around you is having a good time debugging it together.
Like, it's okay. Yeah, that's probably that's a good tip.
Yeah, you should basically like stretch yourself and don't be scared of making
large changes. Because that's the best way learn and like you learn through like code review
and debugging things and going on call and putting out a fire from time to time yeah and maybe
converse advice if you're mentoring a new engineer can be really hard to remember what it was like
being in that situation really really encourage mistake making
encourage taking risks encourage putting that dip like you see a bug you're not sure why this line
of code is here you think it might not need to be here but you're afraid of deleting it like
for someone like a new engineer i feel like taking that risk that's how you learn maybe
maybe something breaks maybe you feel bad for a moment,
but just reassure.
It's okay.
That's why we do this stuff.
Cool.
Well, thank you, Nipun, for being a guest.
I think this was a lot of fun
and people are going to learn a bunch
from all of your experiences.
Thanks for having me.