Software at Scale - Software at Scale 32 - Derrick Stolee: Principal Software Engineer, GitHub
Episode Date: September 15, 2021Derrick Stolee is a Principal Software Engineer at GitHub, where he focuses on the client experience of large Git repositories.Apple Podcasts | Spotify | Google PodcastsSubscribers might be aware ...that I’ve done some work on client-side Git in the past, so I was pretty excited for this episode. We discuss the Microsoft Windows and Office repository’s migrations to Git, recent performance improvements to Git for large monorepo, and more.Highlightslightly edited[06:00] Utsav: How and why did you transition from academia to software engineering?Derrick Stolee: I was teaching and doing research at a high level and working with really great people. And I found myself not finding the time to do the work I was doing as a graduate student. I wasn't finding time to do the programming and do these really deep projects. I found that the only time I could find to do that was in the evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then, I had a child and suddenly my evenings and weekends aren't available for that anymore.And so the individual things I was doing just for myself and for, you know, that was more programming oriented, fell by the wayside. I'd found myself a lot less happy with that career. And so I decided, you know what, there are two approaches I could take here. One is I could spend the next year or two winding down my collaborations and spinning up more of this time to be working on my own during regular work hours. Or I could find another job and I was going to set out.And, I lucked out that Microsoft has an office here in Raleigh, North Carolina, where we now live. This is where Azure DevOps was being built and they needed someone to help solve some graph problems. So it was really nice that it happened to work out that way. I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry.[21:00] Utsav: What drove the decision to migrate Windows to Git?The Windows repository moving to Git was a big project driven by Brian Harry, who was the CVP of Azure DevOps at the time. Previously, Windows used this source control system called Source Depot, which was a fork of Perforce. No one knew how to use this version control system until they got there and learned on the job, and that caused some friction in terms of onboarding people.But also if you have people working in the windows code base for a long time, they only learn this version control system. They don't know Git and they don't know what everyone else is using. And so they're feeling like they're falling behind and they're not speaking the same language when they talk to somebody else who's working commonly used version control tools. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically allow this more free exchange of ideas and understanding. The Windows Git repository is going to be big and have some little tweaks here and there, but at the end of the day, you're just running Git commands and you can go look at StackOverflow to solve questions as opposed to needing to talk to specific people within the Windows organization and how to use this version control tool.TranscriptUtsav Shah: Welcome to another episode of the Software at Scale Podcast, joining me today is Derek Stolee, who is a principal software engineer at GitHub. Previously, he was a principal software engineer at Microsoft, and he has a Ph.D. in Mathematics and Computer Science from the University of Nebraska, welcome. Derek Stolee: Thanks, happy to be here. Utsav Shah: So a lot of work that you do on Git, from my understanding, it's similar to the work you did in your Ph.D. around graph theory and stuff. So maybe you can just walk through the initial like, what got you interested in graphs and math in general?Derek Stolee: My love of graph theory came from my first algorithms class in college my sophomore year, just doing simple things like path-finding algorithms. And I got so excited about it, I started clicking around Wikipedia constantly, I just read every single article I could find on graph theory. So I learned about the four-color theorem, and I learned about different things like cliques, and all sorts of different graphs, the Peterson graph, and I just kept on discovering more. I thought this is interesting to me, it works well with the way my brain works and I could just model these things while [unclear 01:32]. And as I kept on doing more, for instance, graph theory, and combinatorics, my junior year for my math major, and it's like I want to pursue this. Instead of going into the software, I had planned with my undergraduate degree, I decided to pursue a Ph.D. in first math, then I split over to the joint math and CS program, and just worked on very theoretical math problems but I also would always pair it with the fact that I had this programming background and algorithmic background. So I was solving pure math problems using programming, and creating these computational experiments, the thing I call it was computational competent works. Because I would write these algorithms to help me solve these problems that were hard to reason about because the cases just became too complicated to hold in your head. But if you could quickly write a program, to then over the course of a day of computation, discover lots of small examples that can either answer it for you or even just give us a more intuitive understanding of the problem you're trying to solve and that was my specialty as I was working in academia.Utsav Shah: You hear a lot about proofs that are just computer-assisted today and you could just walk us through, I'm guessing, listeners are not math experts. So why is that becoming a thing and just walk through your thesis read in super layman terms, what do you do?Derek Stolee: There are two very different ways what you can mean when you say I have automated proof, there are some things like Coke, which are completely automated formal logic proofs, which specify all the different axioms and the different things I know to be true. And the statement I want to prove and constructs the sequence of proof steps, what I was focused more on was taking a combinatorial problem. For instance, do graphs with certain sub-structures exist, and trying to discover those examples using an algorithm that was finely tuned to solve those things, so one problem was called uniquely Kr saturated graphs. A Kr was essentially a set of our vertices where every single pair was adjacent to each other and to be saturated means I don't have one inside my graph but if I add any missing edge, I'll get one. And then the uniquely part was, I'll get exactly one and now we're at this fine line of doing these things even exist and can I find some interesting examples. And so you can just do, [unclear 04:03] generate every graph of a certain size, but that blows up in size. And so you end up where you can get maybe to 12 vertices, every graph of up to 12 vertices or so you can just enumerate and test. But to get beyond that, and find the interesting examples, you have to be zooming in on the search space to focus on the examples you're looking for. And so I generate an algorithm that said, Well, I know I'm not going to have every edge, so it's fixed one, parents say, this isn't an edge. And then we find our minus two other vertices and put all the other edges in and that's the one unique completion of that missing edge. And then let's continue building in that way, by building up all the possible ways you can create those sub-structures because they need to exist as opposed to just generating random little bits and that focus the search space enough that we can get to 20 or 21 vertices and see this interesting shapes show up. From those examples, we found some infinite families and then used regular old-school math to prove that these families were infinite once we had those small examples to start from.Utsav Shah: That makes a lot of sense and that tells me a little bit about how might someone use this in a computer science way? When would I need to use this in let's say, not my day job but just like, what computer science problems would I solve given something like that?Derek Stolee: It's always asking a mathematician what the applications of the theoretical work are. But I find whenever you see yourself dealing with a finite problem, and you want to know what different ways can this data be up here? Is it possible with some constraints? So a lot of things I was running into were similar problems to things like integer programming, trying to find solutions to an integer program is a very general thing and having those types of tools in your back pocket to solve these problems is extremely beneficial. And also knowing integer programming is still NP-hard. So if you have the right data shape, it will take an exponential amount of time to work, even though there are a lot of tools to solve most cases, when your data looks aren't particularly structured to have that exponential blow up. So knowing where those data shapes can arise and how to take a different approach can be beneficial.Utsav Shah: And you've had a fairly diverse career after this. I'm curious, what was the difference? What was the transition from doing this stuff to get or like developer tools? How did that end up happening?Derek Stolee: I was lucky enough that after my Ph.D. was complete, I landed a tenure track job in a math and computer science department, where I was teaching and doing research at a high level and working with great people. I had the best possible accountant’s workgroup, I could ask for doing interesting stuff, working with graduate students. And I found myself not finding the time to do the work I was doing as a graduate student, I wasn't finding time to do the programming and do these deep projects I wanted, I had a lot of interesting math project projects, I was collaborating with a lot of people, I was doing a lot of teaching. But I was finding that the only time I could find to do that was in evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then I had a child and suddenly, my evenings and weekends aren't available for that anymore. And so the individual things I was doing just for myself, and for that we're more programming oriented, fell by the wayside and found myself a lot less happy with that career. And so I decided, there are two approaches I could take here; one is I could spend the next year or two, winding down my collaborations and spinning up more of this time to be working on my own during regular work hours, or I could find another job. And I was going to set out, but let's face it, my spouse is also an academic and she had an opportunity to move to a new institution and that happened to be soon after I made this decision. And so I said, great, let's not do the two-body problem anymore, you take this job, and we move right in between semesters, during the Christmas break, and I said, I will find my job, I will go and I will try to find a programming job, hopefully, someone will be interested. And I lucked out that, Microsoft has an office here in Raleigh, North Carolina, where we now live and they happen to be the place where what is now known as Azure DevOps was being built. And they needed someone to help solve some graph theory problems in the Git space. So it was nice that it happened to work out that way and I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry, I just said, I did academics, so I'm smart and I did programming as part of my job, but it was always about myself. So, I came with a lot of humility, saying, I know I'm going to learn to work with a team. in a professional setting. I did teamwork with undergrad, but it's been a while. So I just come in here trying to learn as much as I can, as quickly as I can, and contribute in this very specific area you want me to go into, and it turns out that area they needed was to revamp the way Azure Repos computed Git commit history, which is a graph theory problem. The thing that was interesting about that is the previous solution is that they did everything in the sequel they'd when you created a new commit, he would say, what is your parent, let me take its commit history out of the sequel, and then add this new commit, and then put that back into the sequel. And it took essentially a sequel table of commit IDs and squashes it into a varbinary max column of this table, which ended up growing quadratically. And also, if you had a merge commit, it would have to take both parents and interestingly merge them, in a way that never matched what Git log was saying. And so it was technically interesting that they were able to do this at all with a sequel before I came by. But we need to have the graph data structure available, we need to dynamically compute by walking commits, and finding out how these things work, which led to creating a serialized commit-graph, which had that topological relationship encoded in concise data, into data. That was a data file that would be read into memory and very quickly, we could operate on it and do things topologically sorted. And we could do interesting File History operations on that instead of the database and by deleting these Database entries that are growing quadratically, we saved something like 83 gigabytes, just on the one server that was hosting the Azure DevOps code. And so it was great to see that come into fruition.Utsav Shah: First of all, that's such an inspiring story that you could get into this, and then they give you a chance as well. Did you reach out to a manager? Did you apply online? I'm just curious how that ended up working? Derek Stolee: I do need to say I have a lot of luck and privilege going into this because I applied and waited a month and didn't hear anything. I had applied to the same group and said, here's my cover letter, I heard nothing but then I have a friend who was from undergrad, who was one of the first people I knew to work at Microsoft. And I knew he worked at this little studio as the Visual Studio client editor and I said, well, this thing, that's now Azure DevOps was called Visual Studio online at the time, do you know anybody from this Visual Studio online group, I've applied there, haven't heard anything I'd love if you could get my resume on the top list. And it turns out that he had worked with somebody who had done the Git integration in Visual Studio, who happened to be located at this office, who then got my name on the top of the pile. And then that got me to the point where I was having a conversation with who would be my skip-level manager, and honestly had a conversation with me to try to suss out, am I going to be a good team player? There's not a good history of PhDs working well with engineers, probably because they just want to do their academic work and work in their space. I remember one particular question is like, sometimes we ship software and before we do that, we all get together, and everyone spends an entire day trying to find bugs, and then we spend a couple of weeks trying to fix them, they call it a bug bash, is that something you're interested in doing? I'm 100% wanting to be a good citizen, good team member, I am up for that. I that's what it takes to be a good software engineer, I will do it. I could sense the hesitation and the trepidation about looking at me more closely but it was overall, once I got into the interview, they were still doing Blackboard interviews at that time and I felt unfair because my phone screen interview was a problem. I had assigned my C Programming students as homework, so it's like sure you want to ask me this, I have a little bit of experience doing problems like this. So I was eager to show up and prove myself, I know I made some very junior mistakes at the beginning, just what's it like to work on a team? What's it like to check in a change and commit that pull request at 5 pm? And then go and get in your car and go home and realize when you are out there that you had a problem? And you've caused the bill to go red? Oh, no, don't do that. So I had those mistakes, but I only needed to learn them once. Utsav Shah: That's amazing and going to your second point around [inaudible 14:17], get committed history and storing all of that and sequel he also go, we had to deal with an extremely similar problem because we maintain a custom CI server and we try doing Git [inaudible 14:26] and try to implement that on our own and that did not turn out well. So maybe you can walk listeners through like, why is that so tricky? Why it is so tricky to say, is this commit before another commit is that after another commit, what's the parent of this commit? What's going on, I guess?Derek Stolee: Yes the thing to keep in mind is that each commit has a list of a parent or multiple parents in the case of emerging, and that just tells you what happened immediately before this. But if you have to go back weeks or months, you're going to be traversing hundreds or 1000s of commits and these merge commits are branching. And so not only are we going deep in time in terms of you just think about the first parent history is all the merge all the pull requests that have merged in that time. But imagine that you're also traversing all of the commits that were in the topic branches of those merges and so you go both deep and wide when you're doing this search. And by default, Git is storing all of these commits as just plain text objects, in their object database, you look it up by its Commit SHA, and then you go find that location in a pack file, you decompress it, you go parse the text file to find out the different information about, what's its author-date, committer date, what are its parents, and then go find them again, and keep iterating through that. And it's a very expensive operation on these orders of commits and especially when it says the answer's no, it's not reachable, you have to walk every single possible commit that is reachable before you can say no. And both of those things cause significant delays in trying to answer these questions, which was part of the reason for the commit-graph file. First again, it was started when I was doing Azure DevOps server work but it's now something it's a good client feature, first, it avoids that going through to the pack file, and loading this plain text document, you have to decompress and parse by just saying, I've got it well-structured information, that tells me where in the commit-graph files the next one. So I don't have to store the whole object ID, I just have a little four-byte integer, my parent is this one in this table of data, and you can jump quickly between them. And then the other benefit is, we can store extra data that are not native to the commit object itself, and specifically, this is called generation number. The generation number is saying, if I don't have any parents, my generation number is one, so I'm at level one. But if I have parents, I'm going to have one larger number than the maximum most parents, so if I have one parent is; one, now two, and then three, if I merge, and I've got four and five, I'm going to be six. And what that allows me to do is that if I see two commits, and one is generation number 10, and one is 11, then the one with generation number 10, can't reach the one with 11 because that means an edge would go in the wrong direction. It also means that if I'm looking for the one with the 11, and I started at 20, I can stop when I hit commits that hit alright 10. So this gives us extra ways of visiting fewer commits to solve these questions.Utsav Shah: So maybe a basic question, why does the system care about what the parents of a commit are why does that end up mattering so much?Derek Stolee: Yes, it matters for a lot of reasons. One is if you just want to go through the history of what changes have happened to my repository, specifically File History, the way to get them in order is not you to say, give me all the commits that changed, and then we sort them by date because the commit date can be completely manufactured. And maybe something that was committed later emerged earlier, that's something else. And so by understanding those relationships of where the parents are, you can realize, this thing was committed earlier, it landed in the default branch later and I can see that by the way that the commits are structured to these parent relationships. And a lot of problems we see with people saying, where did my change go, or what happened here, it's because somebody did a weird merge. And you can only find it out by doing some interesting things with Git log to say, this merge caused a problem and cause your file history to get mixed up and if somebody resolved the merging correctly to cause this problem where somebody change got erased and you need to use these social relationships to discover that.Utsav Shah: Should everybody just be using rebase versus merge, what's your opinion?Derek Stolee: My opinion is that you should use rebase to make sure that the commits that you are trying to get reviewed by your coworkers are as clear as possible. Present a story, tell me that your commits are good, tell me in the comments just why you're trying to do this one small change, and how the sequence of commits creates a beautiful story that tells me how I get from point A to point B. And then you merge it into your branch with everyone else's, and then those commits are locked, you can't change them anymore. Do you not rebase them? Do you not edit them? Now they're locked in and the benefit of doing that as well, I can present this best story that not only is good for the people who are reviewing it at the moment, but also when I go back in history and say, why did I change it that way? You've got all the reasoning right there but then also you can do things like go down Do Git log dash the first parent to just show me which pull requests are merged against this branch. And that's it, I don't see people's commits. I see this one was merged, this one was merged, this one was merged and I can see the sequence of those events and that's the most valuable thing to see.Utsav Shah: Interesting, and then a lot of GitHub workflows, just squash all of your commits into one, which I think is the default, or at least a lot of people use that; any opinions on that, because I know the Git workflow for development does the whole separate by commits, and then merge all of them, do you have an opinion, just on that?Derek Stolee: Squash merges can be beneficial; the thing to keep in mind is that it's typically beneficial for people who don't know how to do interactive rebase. So their topic match looks like a lot of random commits that don't make a lot of sense. And they're just, I tried this and then I had a break. So I fixed a bug, and I kept on going forward, I'm responding to feedback and that's what it looks like. That's if those commits aren't going to be helpful to you in the future to diagnose what's going on and you'd rather just say, this pull request is the unit of change. The squash merge is fine, it's fine to do that, the thing I find out that is problematic as a new user is also then don't realize that they need to change their branch to be based on that squash merge before they continue working. Otherwise, they'll bring in those commits again, and their pull request will look very strange. So there are some unnatural bits to using squash merge, that require people to like, let me just start over from the main branch again, to do my next work. And if you don't remember to do that, it's confusing.Utsav Shah: Yes, that makes a lot of sense. So going back to your story, so you started working on improving, get interactions in Azure DevOps? When did the whole idea of let's move the windows repository to get begin and how did that evolve?Derek Stolee: Well, the biggest thing is that the windows repository moving to get was decided, before I came, it was a big project by Brian Harry, who was the CVP of Azure DevOps at the time. Windows was using this source control system called source depot, which was a literal fork of Perforce. And no one knew how to use it until they got there and learn on the job. And that caused some friction in terms of well, onboarding people is difficult. But also, if you have people working in the windows codebase, for a long time, they learn this version control system. They don't know what everyone else is using and so they're feeling like they're falling behind. And they're not speaking the same language as when they talk to somebody else who's working in the version control that most people are using these days. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically Git because it allowed more free exchange of ideas and understanding, it's going to be a mono repo, it's going to be big, it's going to have some little tweaks here and there. But at the end of the day, you're just running Git commands and you can go look at Stack Overflow, how to solve your Git questions, as opposed to needing to talk to specific people within the windows organization, and how to use this tool. So that, as far as I understand was a big part of the motivation, to get it working. When I joined the team, we were in the swing of let's make sure that our Git implementation scales, and the thing that's special about Azure DevOps is that it's using, it doesn't use the core Git codebase, it has a complete reimplementation of the server-side of Git in C sharp. So it was rebuilding a lot of things to just be able to do the core features, but is in its way that worked in its deployment environment and it had done a pretty good job of handling scale. But the issues that the Linux repo was still a challenge to host. At that time, it had half a million commits, maybe 700,000 commits, and it's the site number of files is rather small. But we were struggling especially with the commit history being so deep to do that, but also even when they [inaudible 24:24] DevOps repo with maybe 200 or 300 engineers working on it and in their daily work that was moving at a pace that was difficult to keep up with, so those scale targets were things we were daily dealing with and handling and working to improve and we could see that improvement in our daily lives as we were moving forward.Utsav Shah: So how do you tackle the problem? You're on this team now and you know that we want to improve the scale of this because 2000 developers are going to be using this repository we have two or 300 people now and it's already not like perfect. My first impression is you sit and you start profiling code and you understand what's going wrong. What did you all do?Derek Stolee: You're right about the profiler, we had a tool, I forget what it's called, but it would run on every 10th request selected at random, it would run a dot net profiler and it would save those traces into a place where we could download them. And so we can say, you know what Git commit history is slow. And now that we've written it in C sharp, as opposed to a sequel, it's the C sharp fault. Let's go see what's going on there and see if we can identify what are the hotspots, you go pull a few of those traces down and see what's identified. And a lot of it was chasing that like, I made this change. Let's make sure that the timings are an improvement, I see some outliers over here, they're still problematic, we find those traces and be able to go and identify that the core parts to change. Some of them are more philosophical, we need to change data structures, we need to introduce things like generation numbers, we need to introduce things like Bloom filters for filed history, nor to speed that up because we're spending too much time parsing commits and trees. And once we get to the idea that once we're that far, it was time to essentially say, let's assess whether or not we can handle the windows repo. And I think would have been January, February 2017. My team was tasked with doing scale testing in production, they had the full Azure DevOps server ready to go that had the windows source code in it didn't have developers using it, but it was a copy of the windows source code but they were using that same server for work item tracking, they had already transitioned, that we're tracking to using Azure boards. And they said, go and see if you can make this fall over in production, that's the only way to tell if it's going to work or not. And so a few of us got together, we created a bunch of things to use the REST API and we were pretty confident that the Git operation is going to work because we had a caching layer in front of the server that was going to avoid that. And so we just went to the idea of like, let's have through the REST API and make a few changes, and create a pull request and merge it, go through that cycle. We started by measuring how often developers would do that, for instance, in the Azure DevOps, and then scale it up and see where be going and we crashed the job agents because we found a bottleneck. Turns out that we were using lib Git to do merges and that required going into native code because it's a C library and we couldn't have too many of those running, because they each took a gig of memory. And so once this native code was running out, things were crashing and so we ended up having to put a limit on how that but it was like, that was the only Fallout and we could then say, we're ready to bring it on, start transitioning people over. And when users are in the product, and they think certain things are rough or difficult, we can address them. But right now, they're not going to cause a server problem. So let's bring it on. And so I think it was that a few months later that they started bringing developers from source depot into Git.Utsav Shah: So it sounds like there was some server work to make sure that the server doesn't crash. But the majority of work that you had to focus on was Git inside. Does that sound accurate?Derek Stolee: Before my time in parallel, is my time was the creation of what's now called VFS Forget, he was GVFs, at the time, realized that don't let engineers name things, they won't do it. So we've renamed it to VFS forget, it's a virtual file system Forget, a lot of [inaudible 28:44] because the source depot, version that Windows is using had a virtualized file system in it to allow people to only download a portion of the working tree that they needed. And they can build whatever part they were in, and it would dynamically discover what files you need to run that build. And so we did the same thing on the Git side, which was, let's make the Git client let's modify in some slight ways, using our fork of Git to think that all the files are there. And then when a file is [inaudible 29:26] we look through it to a file system event, it communicates to the dot net process that says, you want that file and you go download it from the Git server, put it on disk and tell you what its contents are and now you can place it and so it's dynamically downloading objects. This required aversion approach protocol that we call the GVFs protocol, which is essentially an early version of what's now called get a partial clone, to say, you can go get the commits and trees, that's what you need to be able to do most of your work. But when you need the file contents into the blob of a file, we can download that as necessary and populate it on your disk. The different thing is that personalized thing, the idea that if you just run LS at the root directory, it looks like all the files are there. And that causes some problems if you're not used to it, like for instance, if you open the VS code in the root of your windows source code, it will populate everything. Because VS code starts crawling and trying to figure out I want to do searching and indexing. And I want to find out what's there but Windows users were used to this, the windows developers; they had this already as a problem. So they were used to using tools that didn't do that but we found that out when we started saying, VFS forget is this thing that Windows is using, maybe you could use it to know like, well, this was working great, then I open VS code, or I ran grep, or some other tool came in and decided to scan everything. And now I'm slow again, because I have absolutely every file in my mana repo, in my working directory for real. And so that led to some concerns that weren’t necessarily the best way to go. But it did specifically with that GFS protocol, it solved a lot of the scale issues because we could stick another layer of servers that were closely located to the developers, like for instance, get a lab of build machines, let's take one of these cache servers in there. So the build machines all fetch from there and there you have quick throughput, small latency. And they don't have to bug the origin server for anything but the Refs, you do the same thing around the developers that solved a lot of our scale problems because you don't have these thundering herds of machines coming in and asking for all the data all at once.Utsav Shah: If we had a super similar concept of repository mirrors that would be listening to some change stream every time anything changed on a region, it would run GitHub, and then all the servers. So it's remarkable how similar the problems that we're thinking about are. One thing that I was thinking about, so VFS Forget makes sense, what's the origin of the FS monitor story? So for listeners, FS monitor is the File System Monitor in Git that decides whether files have changed or not without running [inaudible 32:08] that lists every single file, how did that come about?Derek Stolee: There are two sides to the story; one is that as we are building all these features, custom for VFS Forget, we're doing it inside the Microsoft slash Git fork on GitHub working in the open. So you can see all the changes we're making, it's all GPL. But we're making changes in ways that are going fast. And we're not contributing to upstream Git to the core Git feature. Because of the way VFS Forget works, we have this process that's always running, that is watching the file system and getting all of its events, it made sense to say, well, we can speed up certain Git operations, because we don't need to go looking for things. We don't want to run a bunch of L-stats, because that will trigger the download of objects. So we need to refer to that process to tell me what files have been updated, what's new, and I created the idea of what's now called FS monitor. And people who had built that tool for VFS Forget contributed a version of it upstream that used Facebook's watchman tool and threw a hook. So it created this hook called the FS monitor hook, it would say, tell me what's been updated since the last time I checked, the watchmen or whatever tools on their side would say, here's the small list of files that have been modified. You don't have to go walking all of the hundreds of 1000s of files because you just change these [inaudible 0:33:34]. And the Git command could store that and be fast to do things like Git status, we could add. So that was something that was contributed just mostly out of the goodness of their heart, we want to have this idea, this worked well and VFS Forget, we think can be working well for other people in regular Git, here we go and contributing and getting it in. It became much more important to us in particular when we started supporting the office monitor repo because they had a similar situation where they were moving from their version of source depot into Git and they thought VFS Forget is just going to work.The issue is that the office also has tools that they build for iOS and macOS. So they have developers who are on macOS and the team has just started by building a similar file system, virtualization for macOS using kernel extensions. And was very far along in the process when Apple said, we're deprecating kernel extensions, you can't do that anymore. If you're someone like Dropbox, go use this thing, if you use this other thing, and we tried both of those things, and none of them work in this scenario, they're either too slow, or they're not consistent enough. For instance, if you're in Dropbox, and you say, I want to populate my files dynamically as people ask for them. The way that Dropbox in OneNote or OneDrive now does that, the operating system we decided I'm going to delete this content because the disk is getting too big. You don't need it because you can just get it from the remote again, that inconsistency was something we couldn't handle because we needed to know that content once downloaded was there. And so we were at a crossroads of not knowing where to go. But then we decided, let's do an alternative approach, let's look at what the office monorepo is different from the windows monitor repo. And it turns out that they had a very componentized build system, where if you wanted to build a word, you knew what you needed to build words, you didn't need the Excel code, you didn't need the PowerPoint code, you needed the word code and some common bits for all the clients of Microsoft Office. And this was ingrained in their project system, it’s like if you know that in advance, Could you just tell Git, these are the files I need to do my work in to do my build. And that’s what they were doing in their version of source depot, they weren't using a virtualized file system and their version of source depot, they were just enlisting in the projects I care about. So when some of them were moving to Git with VFS Forget, they were confused, why do I see so many directories? I don't need them. So what we did is we decided to make a new way of taking all the good bits from VFS forget, like the GVFs protocol that allowed us to do the reduced downloads. But instead of a virtualized file system to use sparse checkout is a good feature and that allows us you can say, tell Git, only give me within these directories, the files and ignore everything outside. And that gives us the same benefits of working as the smaller working directory, than the whole thing without needing to have this virtualized file system. But now we need that File System Monitor hook that we added earlier. Because if I still have 200,000 files on my disk, and I edit a dozen, I don't want to walk with all 200,000 to find those dozen. And so the File System Monitor became top of mind for us and particularly because we want to support Windows developers and Windows process creation is expensive, especially compared to Linux; Linux, process creation is super-fast. So having hooky run, that then does some shell script stuff to come to communicate to another process and then come back. Just that process, even if it didn't, you don't have to do anything. That was expensive enough to say we should remove the hook from this equation. And also, there are some things that watchman does that we don't like and aren't specific enough to Git, let's make a version of the file system monitor that is entrenched to get. And that's what my colleague Jeff Hosteller, is working on right now. And getting reviewed in the core Git client right now is available on Git for Windows if you want to try it because the Git for Windows maintainer is also on my team. And so we only get an early version in there. But we want to make sure this is available to all Git users. There's an imputation for Windows and macOS and it's possible to build one for Linux, we just haven't included this first version. And that's our target is to remove that overhead. I know that you at Dropbox got had a blog post where you had a huge speed up just by replacing the Perl script hook with a rusted hook, is that correct?Utsav Shah: With the go hook not go hog, yes, but eventually we replace it with the rust one.Derek Stolee: Excellent. And also you did some contributions to help make this hook system a little bit better and not fewer bucks. Utsav Shah: I think yes, one or two bugs and it took me a few months of digging and figuring out what exactly is going wrong and it turned out there's this one environment variable which you added to skip process creation. So we just had to make sure to get forest on track caches on getting you or somebody else edited. And we just forced that environment variable to be true to make sure we cache every time you run Git status. So subsequent with Git statuses are not slow and things worked out great. So we just ended up shipping a wrapper that turned out the environment variable and things worked amazingly well. So, that was so long ago. How long does this process creation take on Windows? I guess that's one question that I have had for you for while, why did we skip writing that cache? Do you know what was slow but creating processes on Windows?Derek Stolee: Well, I know that there are a bunch of permission things that Windows does, it has many backhauls about can you create a process of this kind and what elevation privileges do you exactly have. And there are a lot of things like there that have built up because Windows is very much about re maintaining backward compatibility with a lot of these security sorts of things. So I don't know all the details I do know that it's something around the order of 100 milliseconds. So it's not something to scoff at and it's also the thing that Git for windows, in particular, has difficulty to because it has to do a bunch of translation layers to take this tool that was built for your Unix environment, and has dependencies on things like shell and Python, and Perl and how to make sure that it can work in that environment. That is an extra cost like if windows need to pay over even a normal windows process. Utsav Shah: Yes, that makes a lot of sense and maybe some numbers on I don't know how much you can share, like how big was the windows the office manrico annual decided to move from source depot to get like, what are we talking about here?Derek Stolee: The biggest numbers we think about are like, how many files do I have, but I didn't do anything I just checked out the default branch should have, and I said, how many files are there? And I believe the windows repository was somewhere around 3 million and that uncompressed data was something like 300 gigabytes of like that those 3 million files taking up that long. I don't know what the full size is for the office binary, but it is 2 million files at the head. So definitely a large project, they did their homework in terms of removing large binaries from the repository so that they're not big because of that, it's not like it's Git LSS isn't going to be the solution for them. They have mostly source code and small files that are not the reason for their growth. The reason for their growth is they have so many files, and they have so many developers moving, it moving that code around and adding commits and collaborating, that it's just going to get big no matter what you do. And at one point, the windows monorepo had 110 million Git objects and I think over 12 million of those were commits partly because they had some build machinery that would commit 40 times during its build. So they rein that in, and we've set to do a history cutting and start from scratch and now it's not moving nearly as quickly, but it's still very similar size so they've got more runways.Utsav Shah: Yes, maybe just for comparison to listeners, like the numbers I remember in 2018, the biggest repositories that were open-source that had people contributing to get forward, chromium. And remember chromium being roughly like 300,000 files, and there were like a couple of chromium engineers contributing to good performance. So this is just one order of magnitude but bigger than that, like 3 million files, I don't think there's a lot of people moving such a large repository around especially with the kind of history with like, 12 million objects it's just a lot. What was the reaction I guess, of the open-source community, the maintainers of getting stuff when you decided to help out? Did you have a conversation to start with they were just super excited when you reached out on the mailing list? What happened?Derek Stolee: So for full context, I switched over to working on the client-side and contributed upstream get kind of, after all of the DFS forget was announced and released as open-source software. And so, I can only gauge what I saw from people afterward and people I've become to know since then, but the general reaction was, yes, it's great that you can do this, but if you had contributed to get everyone would benefit and part of the things were, the initial plan wasn't ever to open source it or, the goal was to make this work for Windows if that's the only group that ever uses it that was a success. And it turns out, we can maybe try to say it, because we can host the windows source code, we can handle your source code was kind of like a marketing point for Azure Repos and that was a big push to put this out there and say in the world, but to say like, well, it also needs this custom thing that's only on Azure Repos and we created it with our own opinions that wouldn't be up to snuff with the Git project. And so, things like FS monitor and partial clone are direct contributions from Microsoft engineers at the time that we're saying, here's a way to contribute the ideas that made VFS forget work to get and that was an ongoing effort to try to bring that back but it kind of started after the fact kind of, hey, we are going to contribute these ideas but at first, we needed to ship something. So we shipped something without working with the community but I think that over the last few years, is especially with the way that we've shifted our stance within our strategy to do sparse check out things with the Office monitor repo, we've much more been able to align with the things we want to build, we can build them for upstream Git first, and then we can benefit from them and then we don't have to build it twice. And then we don't have to do something special that's only for our internal teams that again, once they learn that thing, it's different from what everyone else is doing and we have that same problem again. So, right now the things that the office is depending on our sparse Checkout, yes, they're using the GVFs protocol, but to them, you can just call it partial clone and it's going to be the same from their perspective. And in fact, the way we've integrated it for them is that we've gone underneath the partial clone machinery from upstream Git and just taught it to do the GVFS protocol. So, we're much more aligned with because we know things are working for the office, upstream, Git is much more suited to be able to handle this kind of scale.Utsav Shah: And that makes a ton of sense and given that, it seems like the community wanted you to contribute these features back. And that's just so refreshing, you want to help out someone, I don't know if you've heard of those stories where people were trying to contribute to get like Facebook has like this famous story of trying to continue to get a long time ago and not being successful and choosing to go in Mercurial, I'm happy to see that finally, we could add all of these nice things to Git.Derek Stolee: And I should give credit to the maintainer, Junio Hamano, and people who are now my colleagues at GitHub, like Peff Jeff King, and also other Git contributors at companies like Google, who took time out of their day to help us learn what's it like to be a Git contributor, and not just open source, because open source merging pull requests on GitHub is a completely different thing than working in the Git mailing list and contributing patch sets via email. And so learning how to do that, and also, the level of quality expert expected is so high so, how can we navigate that space has new contributors, who have a lot of ideas, and are motivated to do this good work. But we needed to get over a hump of let's get into this community and establish ourselves as being good citizens and trying to do the right thing.Utsav Shah: And maybe one more selfish question from my side. One thing that I think Git could use is some kind of login system, where today, if somebody checks in PII into our repository into the main branch, from my understanding, it's extremely hard to get rid of that without doing a full rewrite. And some kinds of plugins for companies where they can rewrite stuff or hide stuff on servers, does GitHub have something like that?Derek Stolee: I'm not aware of anything on the GitHub or Microsoft side for that, we generally try to avoid it by doing pre received books, or when you push will reject it, for some reason, if we can, otherwise, it's on you to clear up the data. Part of that is because we want to make sure that we are maintaining repositories that are still valid, that are not going to be missing objects. I know that Google source control tool, Garrett has a way to obliterate these objects and I'm not exactly sure how it works to then say they get clients are fetching and cloning and they say, I don't have this object it'll complain, but I don't know how they get around that. And with the distributed nature of Git, it's hard to say that the Git project should take on something like that, because it is centralizing things to such a degree that you have to say, yes, you didn't send me all the objects you said you were going to, but I'll trust you to do that anyway, that trust boundary is something that gets cautious to violate. Utsav Shah: Yes, that makes sense and now to the non-selfish questions, maybe you can walk through listeners, why does it need to bloom filter internally?Derek Stolee: Sure. So let's think about commit history is specifically when, say you're in a Java repo, a repo that uses the Java programming language, and your directory structure mimics your namespace. So if you want to get to your code, you go down five directories before you find your code file. Now in Git that's represented as I have my commit, then I have my route tree, which describes the root of my working directory and then I go down for each of those directories I have another tree object, tree object, and then finally my file. And so when we want to do a history query, say what things have changed this file, I go to my first commit, and I say, let's compare it to its parent and I'm going to the root trees, well, they're different, okay they're different. Let me open them up find out which tree object they have at that first portion of the path and see if those are different, they're different let me keep going and you go all the way down these five things, you've opened up 10 trees in this diff, to parse these things and if those trees are big, that's expensive to do. And at the end, you might find out, wait a minute the blobs are identical way down here but I had to do all that work to find out now multiply that by a million. And you have to find out that this file that was changed 10 times in the history of a million commits; you have to do a ton of work to parse all of those trees. So, the Bloom filters come in, in a way to say, can we guarantee sometimes, and in the most case that these commits, did not change that path, we expect that most commits did not change the path you're looking for. So what we do is we injected it in the commit-graph file because that gives us a quick way to index, I'm at a commit in a position that's going to graph file, I can understand where this Bloom filter data is. And the Bloom filter is storing which paths were changed by that commit and a bloom filter is what's called a probabilistic data structure. So it doesn't list those paths, which would be expensive, if I just actually listed, every single path that changed at every commit, I would have this sort of quadratic growth again, in my data would be in the gigabytes, even for a small repo. But with the Bloom filter, I only need 10 bits per path so it's compact. The thing we sacrifice is that sometimes it says yes, to a path that is the answer is no but the critical thing is if it says no, you can be sure it's no, and its false-positive rate is 2%, at the compression settings we're using so I think about the history of my million commits 98% of them will this Bloom filter will say no, it didn't change. So I can immediately go to my next parent, and I can say this commit isn't important so let's move on then the sparse any trees, 2% of them, I still have to go and parse them and the 10 that changed it they'll say yes. So, I'll parse them, I'll get the right answer but we've significantly subtracted the amount of work we had to do to answer that query. And it's important when you're in these big monitor repos because you have so many commits, that didn't touch the file, you need to be able to isolate them.Utsav Shah: At what point or like at what repository number of files, because the size of file that thing you mentioned, you can just use LFS for that should solve a lot of problems with the number of files, that's the problem. At what number of files, do I have to start thinking about okay; I want to use these good features like sparse checkout and the commit graphs and stuff? Have you noticed a tipping point like that?Derek Stolee: Yes, there are some tipping points but it's all about, can you take advantage of the different features. So to start, I can tell you that if you have a recent version of Git saved from the last year, so you can go to whatever repository you want, and run, Git, maintenance, start, just do that in every [inaudible 52:48] is going to moderate size and that's going to enable background maintenance. So it's going to turn off auto GC because it's going to run maintenance on a regular schedule, it'll do things like fetch for you in the background, so that way, when you run Git fetch, it just updates the refs and it's really fast but it does also keep your commit graph up to date. Now, by default, it doesn't contain the Bloom filters, because Bloom filters is an extra data sink and most clients don't need it, because you're not doing these deep queries that you need to do at web-scale, like the GitHub server. The GitHub server does generate those Bloom filters so when you do a File History query on GitHub, it's fast but it does give you that commit-graph thing so you can do things like Git log graph fast. The topological sorting has to do for that, it can use the generation numbers to be quick, as opposed to before printers, it would take six seconds to do that just to show 10 commits, on the left few books had to walk all of them, so now you can get that for free. So whatever size repo is, you can just run that command, and you're good to go and it's the only time you have to think about it run at once now your posture is going to be good for a long time. The next level I would say is, can I reduce the amount of data I download during my clones and fetches and that the partial clones for the good for the site that I prefer blob fewer clones, so you go, Git clone, dash filter, equals blob, colon, none. I know it's complicated, but it's what we have and it just says, okay, filter out all the blobs and just give me the commits and trees that are reachable from the refs. And when I do a checkout, or when I do a history query, I'll download the blobs I need on demand. So, don't just get on a plane and try to do checkouts and things and expect it to work that's the one thing you have to be understanding about. But as long as you are relatively frequently, having a network connection, you can operate as if it's a normal Git repo and that can make your fetch times your cleaning time fast and your disk space a lot less. So, that's kind of like the next level of boosting up your scale and it works a lot like LFS, LFS says, I'm only going to pull down these big LFS objects when you do a checkout and but it uses a different mechanism to do that this is you've got your regular Git blobs in there. And then the next level is okay, I am only getting the blobs I need, but can I use even fewer and this is the idea of using sparse checkout to scope you’re working directory down. And I like to say that, beyond 100,000 files is where you can start thinking about using it, I start seeing Git start to chug along when you get to 100,000 200,000 files. So if you can at least max out at that level, preferably less, but if you max out at that level that would be great sparse checkout is a way to do that the issue right now that we're seeing is, you need to have a connection between your build system and sparse Checkout, to say, hey, I work in this part of the code, what files I need. Now, if that's relatively stable, and you can identify, you know what, all the web services are in this directory, that's all I care about and all the client code is over there, I don't need it, then a static gets merged Checkout, will work, you can just go Git's sparse checkout set, whatever directories you need, and you're good to go. The issue is if you want to be close, and say, I'm only going to get this one project I need, but then it depends on these other directories and those dependencies might change and their dependencies might change, that's when you need to build that connection. So office has a tool, they call scooper, that connects their project dependency system to sparks Checkout, and will help them automatically do that but if your dependencies are relatively stable, you can manually run Git sparse checkout. And that's going to greatly reduce the size of your working directory, which means Git's doing less when it runs checkout and that can help out.Utsav Shah: That's a great incentive for developers to keep your code clean and modular so you're not checking out the world and eventually, it's going to help you in all these different ways and maybe for a final question here. What are you working on right now? What should we be excited about in the next few versions of Git?Derek Stolee: I'm working on a project this whole calendar year, and I'm not going to be done with it to the calendar year is done called the Sparse Index. So it's related to sparse checkout but it's about dealing with the index file, the index file is, if you go into your Git repository, go to dot Git slash index. That file is index is a copy of what it thinks should be at the head and also what it thinks is in your working directory, so when it doesn't get status, it's walked all those files and said, this is the last time it was modified or when I expected was modified. And any difference between the index and what's actually in your working tree, Git needs to do some work to sync them up. And normally, this is just fast, it's not that big but when you have millions of files, every single file at the head has an entry in the index. Even worse, if you have a sparse Checkout, even if you have 100,000 of those 2 million files in your working directory, the index itself has 2 million entries in it, just most of them are marked with what's called the Skip Worksheet that says, don't write this to disk. So for the office monitor repo, this file is 180 megabytes, which means that every single Git status needs to read 180 gigabytes from disk, and with the LFS monitor going on, it has to go rewrite it to have the latest token from the LFS monitor so it has to rewrite it to disk. So, this takes five seconds to run a Git status, even though it didn't say much and you just have to like load this thing up and write it back down. So the sparse index says, well, because we're using sparse checkout in a specific way called cone mode, which is directory-based, not path file-based, you can say, well, once I get to a certain directory, I know that none of its files inside of it matter. So let's store that directory and its tree object in the index instead, so it's a kind of a placeholder to say, I could recover all the data, and all the files that would be in this directory by parsing trees, but I don't want it in my index, there's no reason for that I'm not manipulating those files when I run a Git add, I'm not manipulating them, I do Git commit. And even if I do a Git checkout, I don't even care; I just want to replace that tree with whatever I'm checking out what it thinks the tree should be. It doesn't matter for what the work I'm doing and for a typical developer in the office monorepo; this reduces the index size to 10 megabytes. So it's a huge shrinking of the size and it's unlocking so much potential in terms of our performance, our Git status times are now 300 milliseconds on Windows, on Linux, and Mac, which are also platforms, we support for the office monitor repo, it's even faster. So that's what I'm working on the issue here is that there's a lot of things in Git that care about the index, and they explore the index as a flat array of entries and they're always expecting those to be filenames. So all these things run the Git codebase that needs to be updated to say, well, what happens if I have a directory here? What's the thing I should do? And so, all of the ideas of what is the sparse index format, have been already released in two versions of Git, and then there's also some protections and say, well, if I have a sparse index on disk, but I'm in a command that has an integrated, well, let me parse those trees to expand it to a full index before I continue. And then at the end, I'll write a sparse index instead of writing a full index and what we've been going through is, let's integrate these other commands, we've got things like status, add, commit, checkout, those things are all integrated, we got more on the way like merge, cherry-pick, rebase. And these things all need different special care to make it to work but it's unlocking this idea that when you're in the office monitoring who after this is done, and you're working on a small slice of the repo, it's going to feel like a small repo. And that is going to feel awesome. I'm just so excited for developers to be able to explore that we have a few more integrations; we want to get in there. So that we can release it and feel confident that users are going to be happy. The issue being that expanding to a full index is more expensive than just reading the 180 megabytes from disk, if I just already have it in the format; it's faster than being to parse it. So we want to make sure that we have enough integrations that most scenarios users do are a lot faster, and only a few that they use occasionally get a little slower. And once we have that, we can be very confident that developers are going to be excited about the experience.Utsav Shah: That sounds amazing the index already has so many features like the split index, the shared index, I still remember trying to like Wim understands when you're trying to read a Git index, and it just shows you as the right format and this is great. And do you think at some point, if you had all the time, and like a team of 100, people, you'd want to rewrite Git in a way that it was aware of all of these different features and layered in a way where all the different commands did not have to think about these different operations, since Git get a presented view of the index, rather than have to deal with all of these things individually?Derek Stolee: I think the index because it's a list of files, and it's a sorted list of files, and people want to do things like replace a few entries or scan them in a certain order that it would benefit from being replaced by some sort of database, even just sequel lite would be enough. And people have brought that idea up but because this idea of a flat array of in-memory entries is so ingrained in the Git code base, that's just not possible. To do the work to layer on top, an API that allows the compatibility between the flat layer and it's something like a sequel, it's just not feasible to do, we would just disrupt users, it would probably never get done and just cause bugs. So, I don't think that that's a realistic thing to do but I think if we were to redesign it from scratch, and we weren't in a rush to get something out fast, that we would be able to take that approach. And for instance, you would sparse index, so I update one file after we write the whole index that is something I'll have to do it's just that it's smaller now. But if I had something like a database, we could just replace that entry in the database and that would be a better operation to do but it's just not built for that right now.Utsav Shah: Okay. And if you had one thing that you would change about Git architecture like the code architecture, what would you change?Derek Stolee: I think there are some areas where we could do some plug ability, which would be great. The code structure is flat, most of the files are just C files in the root directory and it'd be nice if they were componentized a little bit better. We had API layers that could be operating. So we could do things like swap out how refs are stored more easily, or how to swap out how the objects are stored and it is less coupled to a lot of the things across the built-ins and other things. But I think the Git project is extremely successful for its rather humble beginnings, it started as Linus Torvalds, creating a version control system for the Linux Kernel things are in a couple of weekends or however long he took a break to do that. And then people just got excited about it started contributing it and you can tell, looking at the commit messages from 2005 2006 that this was the Wild West, people were just fast in replacing code and building new things and it didn't take very long, definitely by 2010 2011 to get code base is much more solid in its form and composition. And the expectations of contributors to write good commit messages and do small changes here and there have already been built at that time a decade ago. So Git is solid software at this point, and it's very mature, so making these big drastic changes are hard to do. But I'm not going to fault it for that at all, it's good to be able to operate slowly and methodically to be able to build something and improve something that's used by millions of people you just got it, treat it with the respect and care it deserves.Utsav Shah: If you think of software today as you run into bugs and so many different things, but Git is something that pretty much I think all developers use the most probably, and you don't even think of Git having bugs. You think, okay, I messed up using Git, you don't think that we'll get that something interesting. And if it turned out that Git had all sorts of bugs that people will run into, I don't even know what their experience would be like. They just get frustrated and they stop programming or something but yes, well, thank you for being a guest I think I learned a lot of stuff on the show. I hope listeners appreciate that as well and thank you for being a guest.Derek Stolee: Thank you so much it was great to have these chats. I'm always happy to talk about Git, especially at scale and it's been a thing I've been focusing on for the last five years, and I'm happy to share the love. Utsav Shah: I might ask you for like another episode in a year or so once like sparse indexes are out.Derek Stolee: Excellent. Yeah, I'm sure we'll have lots of teachers who had directions. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me today is Derek Stolle, who is a Principal software engineer at GitHub. Previously, he was a principal software engineer
at Microsoft, and he has a PhD in mathematics and computer science from the University of Nebraska.
Welcome. Thanks. Happy to be here. Yeah. So a lot of work that you do on Git,
from my understanding, it's like similar to the work you did in your PhD around like graph theory
and stuff. So maybe you can just walk through the initial like what got you interested in graphs and math
in general yeah I you know my love of graph theory really came from my first algorithms class
in in college my sophomore year just doing simple things like pathfinding algorithms
and I got so excited about I, I actually started clicking it on
Wikipedia constantly. I just read every single article I could find on graph theory. So I learned
about the four color theorem, and I learned about different things like cliques and all sorts of
different graphs, the Peterson graph. And I just kept on discovering it more and more. I thought,
this is really interesting to me. It really works well with the way my brain works.
And I could just really model these things well in my head. And as I kept on doing more,
for instance, I took graph theory and combinatorics my junior year for my math major. And I was like,
you know what? I really want to pursue this. Instead of going into software like I had planned
with my undergraduate degree, I decided to pursue a PhD
in first math. Then I split over to the joint math and CS program and just really worked on
very theoretical math problems. But I also would always pair it with the fact that I had this
programming background and algorithmic background. So I was solving pure math problems using
programming and creating
these computational experiments. The thing I called it was computational combinatorics,
because I would write these algorithms to help me solve these problems that
were hard to reason about because the cases just became too complicated to hold in your head.
But if you could really quickly write a program to then, you know, over the course of maybe a day of computation, discover lots of small examples that can either answer it for you or even just give you more intuitive understanding of the problem you're trying to solve.
And that was really my specialty as I was working in academia.
Yeah, you hear a lot about proofs that are just computer assisted today.
And maybe you could just walk us through like i'm
guessing like listeners are not math experts so like why is that becoming a thing and maybe even
just walk through your thesis right like in super layman terms like what did you do right uh you
know there's two different there's very different ways what you can mean when you say i have an
automated proofs there's some things like coke which are completely automated formal logic proofs, which is I specify all the different axioms and the different things I know to be true and the statement I want to prove.
And it constructs a sequence of proof steps. For instance, do graphs with certain substructures exist? And trying to discover those examples using an algorithm that was finely tuned to solve those sorts of things.
So one problem was called uniquely KR-saturated graphs. A KR was essentially a set of R vertices where every single pair was adjacent to each other. And to be saturated means I don't
have one inside my graph, but if I add any missing edge, I'll get one. And then the uniquely part was
I'll get exactly one. And now we're at this really, really fine line of do these things even exist?
And can I find some interesting examples? And so you can just do, please generate every graph
of a certain size, but that blows up in size.
And so you end up where you can get maybe to 12 vertices, every graph of up to 12 vertices,
you can just enumerate and test. But in order to get beyond that and find the really interesting
examples, you have to be zooming in on the search space to really focus on the examples you're
looking for. And so I generated an algorithm that said, well, how about, you know, I know to start, I know I'm not going to have every edge. So let's
fix one pair and say, this isn't an edge. And then we find R minus two other vertices and put all the
other edges in. And that's the one unique completion of that missing edge. And then let's
continue building in that way by building up all the possible ways you can create those substructures because you know they need to exist as opposed to just generating random little bits.
And that focused the search space enough that we could get to 20 or 21 vertices and see these really interesting shapes show up.
And from those examples, we found some infinite families and then used, you know, regular old school math to prove that these families were infinite.
Once we had those small examples to start from.
No, that makes a lot of sense.
And tell me a little bit about, you know, how might someone use this in like a computer science way?
Like when would I need to use this in, let's say, not my day job,
but just what kind of computer science problems would I solve given something like that?
Oh, boy. It's always asking a mathematician what the applications of the theoretical work are.
But I find whenever you see yourself dealing with a finite problem and you want to know
what different ways can this data
appear? Is it possible with some constraints? So a lot of things I was running into were similar
problems to things like integer programming. Trying to find solutions of an integer program
is a very general thing. And having those types of tools in your back pocket to solve these problems
is extremely beneficial. And also knowing, you know, integer programming is still NP hard.
So you're, you know, if you have the right data shape, it will take exponential amount
of time to work, even though there's a lot of tools to solve most cases when your data
looks really, isn't really particularly structured to have that exponential blow up.
So knowing where those data shapes can arise and how to kind of maybe take a different
approach can be beneficial.
Okay.
And you've had a fairly diverse career after this.
I'm curious, what was the difference or what was the transition from doing this stuff to
Git or developer tools?
How did that end up happening?
I was lucky enough that after my PhD was complete,
I actually landed a tenure-track job
in a math and computer science department
where I was teaching and doing research at a high level
and working with really great people.
I had the best possible combinatorics group I could ask for,
doing really interesting stuff, working with graduate students.
And I found myself not finding the time to do the work I was doing as a graduate student.
I wasn't finding time to do the programming and do these really deep projects I wanted.
I had a lot of interesting math projects that I was collaborating on with a lot of people.
I was doing a lot of teaching, but I was finding that the only time I could find to do that
was in evenings and weekends,
because that's when other people weren't working who could collaborate with me on their projects
and move those projects forward.
And then I had a child.
And suddenly, my evenings and weekends aren't available for that anymore.
And so the individual things I was doing just for myself and that were more know, that were more programming oriented fell by the wayside.
And I found myself a lot less happy with that career.
And so I decided,
you know,
there's two approaches I could take here.
One is I could spend the next year or two winding down my collaborations and
spinning up more of this time to be working on my own during regular work
hours,
or I could find another job.
And I was going to set out, but it
turns out my spouse is also an academic. And she had an opportunity to move to a new institution.
And that happened to be really soon after I had made this decision. And so I said, okay, great.
Let's not do the two body problem anymore. you take this job and we'll move like we
move right in between semesters like during the over christmas break and i said i will find my
own job i will go and i will try to find a programming job hopefully someone will be
interested and um i lucked out that you know microsoft has an office here in Raleigh, North Carolina, where we now live.
And they happen to be the place where what is now known as Azure DevOps was being built.
And they needed someone to help solve some graph theory problems in the Git space. So it was a
really, really nice that it happened to work out that way. So, and I know for a fact that they took a chance on me
because of their particular need.
I didn't have significant professional experience
in the industry.
I just said, hey, I did a lot of academics,
so I'm kind of smart.
And I did programming as part of my job,
but it was always by myself.
So I came with it with a lot of humility saying,
I know I'm going to need to learn to work with the team
in a professional setting.
I did teamwork with undergrad, but it's been a while.
So I'm going to just come in here and try to learn as much as I can
as quickly as I can and contribute in this very specific area
you really want me to go into.
And it turns out that area they needed was to revamp the way
Azure Repo is computed, get commit history, which is definitely a graph theory problem.
The thing that was interesting about that is the previous solution is that they did everything in SQL. And they'd actually, when you created a new commit, it would say, what is your parent?
Let me take its commit history out of SQL and then add this new commit and then put that back into SQL.
And it took essentially a SQL table of commit IDs and squashed it into a var binary max column of this table, which ended up growing quadratically. And also, if you had a merge commit, it would have to take both parents and merge them in
an interesting way, in a way that never really matched what Git log was saying.
And so it was technically really interesting that they were able to do this at all with
SQL before I came by.
But really what needed was we need to have the graph data structure available.
We need to dynamically compute by walking commits and finding out how
these things work, which led to creating a serialized commit graph, which had that
topological relationships encoded in a concise data, yeah, into data that was like a data file
that would be read into memory. And very quickly we could operate on it and do things like
topologically sort it very quickly.
And we could do interesting file history operations on that instead of the database.
And by deleting these database entries that were growing quadratically, we saved something like 83 gigabytes just on the one server that was hosting the Azure DevOps code.
And so it was really great to see that come into fruition.
Yeah. First of all all that's such an
inspiring story that you could get into this and then they give you a chance as well like did you
reach out to like a manager did you apply online like i'm just curious how that ended up working
yeah i you know i need to i do need to say i i have a lot of luck and privilege going into this
because uh i i applied and waited a month and didn't hear anything, right? I had
applied to the same group and said, you know what? I'm here. Here's my cover letter. Heard nothing.
But then I have a friend who was from undergrad, who was one of the first people I knew to work
at Microsoft. And I knew he worked at Visual Studio on the Visual Studio Client Editor.
And I said, well, this thing that's now Azure DevOps was called Visual Studio Online at the time.
It's like, do you know anybody
from this Visual Studio Online group?
I've applied there, haven't heard anything.
I'd love if you could get my resume on the top list.
And it turns out that he had worked with somebody
who had done the Git integration in Visual Studio
who happened to be located at this office
who then got my name on the top of the pile.
And then that got me to the point
where I was having a conversation with who would be my skip level manager. And honestly had a
conversation with me to try to suss out, am I going to be a good team player? There's not a
really good history of PhDs working well with engineers, probably because they just want to
do their own academic work and work in their own space. I remember one particular question. He was like, you know, sometimes we ship software.
And before we do that, we all get together and everyone spends an entire day trying to
find bugs.
And then we spend a couple of weeks trying to fix them.
We call it a bug bash.
Is that something you're interested in doing?
Absolutely.
I 100% wanted to be a good citizen, good team member.
I am up for that. That's what it takes to be a good citizen, good team member. I am up for that.
That's what it takes to be a good software engineer.
I will do it.
And so there was definitely, I could sense the hesitation and the trepidation about looking at me more closely.
But it was overall, once I got into the interview, they were still doing Blackboard interviews
at that time.
And I felt unfair because my phone screen interview was a problem I had assigned my C programming students as homework.
So it's kind of like, are you sure you want to ask me this?
I have a little bit of experience doing problems like this, but okay.
So I really was eager to show up and prove myself.
I know I made some very junior mistakes at the beginning.
Just, you know, what's it like to work on a team?
What's it like to check in a change, you know,
and commit that pull request at 5 p.m.
and then go and get in your car and go home
and realize when you're out there
that you actually had a problem
and you caused the build to go red.
Oh no, don't do that.
So I had those kinds of mistakes,
but I only needed to learn them once.
And then, yeah.
Yeah, that's amazing.
And going to your second point around Git commit history and storing all of that in SQL.
Eons ago, we had to deal with an extremely similar problem because we maintained a custom CI server.
And we tried doing Git branch dash dash contains and try to implement that on
our own. And that's that did not turn out well. So maybe you can walk through like listeners
through like, why is that so tricky? Like, why is it so tricky to say, is this commit before
another commit? Is it after another commit? What's the parent of this commit? Like what,
what's going on, I guess? Yeah, the thing to keep in mind is that each commit
has a list of a parent
or multiple parents
in the case of a merge.
And that just tells you
what happened immediately before this.
But if you have to go back
weeks or months of time,
you're going to be traversing
hundreds or thousands of commits.
And these merge commits are branching.
So not only are you going deep in time
in terms of if you just think about the first parent history
is all the pull requests that have merged in that time.
But imagine that you're also traversing all of the commits
that were in the topic branches of those merges.
And so you go both deep and wide when you're doing this search.
And by default, Git is storing all of these commits as just plain
text objects in their object database. You look it up by its commit SHA, and then you go find that
location in a pack file, you decompress it, you go parse the text file to find out the different
information about, okay, what's its author date, committer date, what are its parents, and then go find them again and keep iterating through that. And it's a very
expensive operation on these orders of commits. And especially when it says the answer is no,
it's not reachable. You have to walk every single possible commit that is reachable before you can
say no. And both of those things cause significant delays in trying to answer these questions.
That was part of the reason for the commit graph file, which first, again, it was started by when
I was doing Azure DevOps server work, but it's now something that's actually a get client feature.
And first, it avoids that going through to the pack file and loading this plain text document
you have to decompress
and parse by just saying, I've got a really well-structured information that tells me where
in the commit graph file is the next one. So I don't have to store the whole object ID. I just
have a little four byte integer. My parent is this one in this table of data. And you can jump
really quickly between them.
And then the other benefit is we can store extra data that's not actually native to the commit object itself.
And specifically, this is called generation number.
The generation number is saying,
if I don't have any parents, my generation number is one.
So I'm just, I'm at level one.
But if I have parents,
I'm going to have a one larger number
than the maximum of those parents. So if I just, my parent, I have one parent and it's one, I'm
now two and then three. If I merge and I've got four and five, I'm going to be six. And what that
allows me to do is that if I see two commits and one is generation number 10 and one is 11,
then the one with generation number 10 can't reach the one with 11 because
that would go in the wrong.
That means an edge would go in the wrong direction.
It also means that if I'm looking for the one with 11 and I started at 20, I can stop
when I hit commits that hit are at 10.
So this gives us extra ways of visiting fewer commits in order to solve these questions.
So maybe a basic question, why does a system care about what the parents of a commit are?
Why does that end up mattering so much?
Oh, yeah, it matters for lots of reasons.
One is, if you just want to go through the history of what changes have happened in my
repository, and specifically file history, the way to find, the way to get them in order
is not just to say,
give me all the commits that changed
and then we sort them by date,
because the commit date
can actually be
completely manufactured.
And maybe something
that was committed later
was actually merged earlier
than something else.
And so by understanding
those relationships
of where the parents are,
you can realize, oh, while this thing was committed earlier, it landed in the default branch later.
And I can see that by the way that the commits are structured through these parent relationships.
And a lot of problems we see with people saying, where did my change go?
Or what happened here? It's because somebody did a weird merge and you can only find it out by doing some really interesting things with Git log
to say, hey, this merge caused a problem
and actually caused your file history to get mixed up.
And somebody resolved the merging correctly
to cause this problem
where somebody's change got erased.
And you need to use these types of relationships
to discover that.
Should everybody just be using rebase versus merge?
What's your opinion on that?
My opinion is that you should absolutely use rebase
to make sure that your commits
that you are trying to get reviewed by your coworkers
is as really clear as possible, right?
Present a story.
Tell me that your commits are really good.
Like tell me in the commit messages why you're trying to do this one small change and how the sequence of commits creates
a really beautiful story that tells me how I get from point A to point B. And then you merge it
into your branch with everyone else's. And then those commits are locked. You can't change them
anymore. Do not rebase them. Do not edit them. Now they're locked in. And the benefit of doing that is, well, I can present this best story that not only
is good for the people who are reviewing it in the moment, but also when I go back in
history and say, hey, why did I change it that way?
You've got all the reasoning right there.
But then also you can do things like go down, do get log dash dash first parent to just
show me which pull requests are merged against this branch.
And that's it.
You don't,
I don't see people's individual commits.
I see this one was merged.
This one was merged.
This one was merged.
And I can see the sequence of those events.
And that's a valuable thing to see.
Interesting.
Yeah.
And then a lot of get GitHub workflows,
just squash all of your commits into one,
which I think is the default,
or at least a lot of people use that.
Any opinions on that?
Because I know the Git workflow for development on Git
does the whole separate by commits and then merge all of them together.
Do you have an opinion just on that?
Squash merges can be beneficial.
The thing to keep in mind is that it's typically beneficial
for people who don't know
how to do interactive rebase. So their topic branch looks like a lot of random commits that
don't make a lot of sense. And they're just kind of, well, I tried this and then I had a break.
So I fixed a bug and I kept on going forward. I'm responding to feedback and that's what it
looks like. That's if those commits aren't going to be helpful to you in the future to diagnose
what's going on. And you'd rather just say this pull request is the
unit of change, then squash merge is fine. It's fine to do that. The thing I find out that is
problematic is that new users also then don't realize that they need to change their branch
to be based on that squash merge before they continue working. Otherwise, they'll bring in
those commits again and their pull request will look very strange. So there is some unnatural bits to using squash merge that requires people to like, let me just start over from the main branch again, to do my next work. And if you don't remember to your story, so you started working on improving Git interactions
in Azure DevOps. When did the whole idea of let's move the Windows repository to Git begin?
And how did that evolve? Well, the biggest thing is that the Windows repository moving to Git
was decided before I came. It was definitely a big project by Brian Harry,
who was the CVP of Azure DevOps at the time.
And it was also motivated by things like
Windows was using this source control system
called Source Depot,
which was a literal fork of Perforce.
And no one knew how to use it
until they got there and learned on the job. And that caused some friction in terms of, well,
onboarding people is difficult. But also, if you have people working in the Windows code base for
a long time, they learn this version control system. They don't know Git. They don't know
what everyone else is using. And so they're feeling like they're
falling behind and they're not speaking the same language as when they talk to somebody else who's
working in the version control that most people are using these days so they saw this as a way to
you know not only update their the way their source control works to a more modern tool but
but specifically get because it allowed
this more free exchange of ideas
and understanding it's going to be a monorepo,
it's going to be big,
it's going to have some little tweaks here and there,
but at the end of the day,
you're just running Git commands.
And you can go look at Stack Overflow
about how to solve your Git questions
as opposed to needing to talk to specific people
within the Windows organization
and how to use this tool.
So that, as far as I understand,
was a big part of the motivation to get it working.
When I joined the team,
we were in the swing of,
let's make sure that our Git implementation scales.
And the thing that's special about Azure DevOps
is that it doesn't use the core Git code base.
It has a complete reimplementation
of the service side of Git in C Sharp.
So it was rebuilding a lot of things
to just be able to do the core features,
but in its own way that worked in its deployment environment.
And it had done a pretty good job of handling scale,
but the issue is that the Linux repo
was still like a challenge to host.
At that time, it had half a million commits,
maybe 700,000 commits.
And its actual site number of files is rather small.
But we were struggling,
especially with the commit history being so deep to do that.
But also even with the Azure DevOps repo
with maybe 200 or 300 engineers working on it in their daily work, that was moving at a pace that was difficult to keep up with.
So those scale targets were kind of things we were on a daily basis dealing with and handling and working to improve.
And we could see that improvement in our daily lives as we were moving forward.
So how do you tackle the problem, right? Like you're on this team now and you know that,
okay, we want to improve the scale of this because like 2000 developers are going to be
using this repository. We have two or 300 people now, and it's already not like perfect.
My first impression is like you sit and you start profiling code and you understand what's
going wrong. What did you all do? Right. You're absolutely right about the profiler. We had
a tool, I forget what
it's called, but it would run
on every 10th request
selected at random, it would actually run
a.NET profiler.
And it would save those traces
into a place where we could download them.
And so we could say, hey, you know what?
Git commit history is kind of slow.
And now that we've written it in C Sharp as opposed to SQL, it's actually the C Sharp fault.
Let's go see what's going on there and see if we can identify what are the hotspots.
We go pull a few of those traces down and see what's identified.
And a lot of it was kind of chasing that, like, oh, I made this change.
Let's make sure that the timings are improvement.
Oh, I see some outliers over here.
They're still problematic.
Let me find some of those traces and be able to go and identify that the core parts to change some of them are
more philosophical like we need to change data structures we need to introduce things like
generation numbers um we need to introduce things like bloom filters for filed history in order to
speed that up because we're spending too much time parsing commits and trees.
And once we get to the idea that once we're that far, it was time to essentially say,
let's assess whether or not we can handle the Windows repo.
And I think it would have been January, February 2017, my team was tasked with doing scale testing in production
they had the
full Azure DevOps server ready to go
that had the Windows source code
in it, didn't have developers using it but it was a
copy of the Windows source code
but they were using that same
server for work item tracking
they had already transitioned their work item tracking to using
Azure boards.
And they said, go and see if you can make this fall over
in production.
It's the only way to tell if it's going to work or not.
And so a few of us got together
and we created a bunch of things to like use the REST API.
And we were pretty confident
that the Git operations are going to work
because we had a caching layer in front of the server
that was going to avoid that.
And so we just kind of went through the idea
of like, let's have through the REST API,
make a few changes and create a pull request and merge it.
Go through that cycle.
And we started by measuring
how often developers would do that,
for instance, in the Azure DevOps,
and then kind of scale it up and see where we're going.
And we ended up actually, yeah,
we crashed the job agents because we found a
bottleneck turns out that we were using libgit2 to do merges and that required going into native code
because it's a c library and we couldn't have too many of those running because they each took like
a gig of memory and so once these native code was running out things were crashing and so we ended
up having to put a limit on how that, but it was like, that was the only
fallout.
And we could then say, you know what, we're ready.
Like bring it on, start transitioning people over.
And when users are actually in the product and they think certain things are rough or
difficult, we can address them.
But right now they're not going to cause a server problem.
So let's bring it on.
And so I think it was that few months later that they started bringing in developers from
Source Depot into Git.
So it sounds like there was some server work to make sure that the server doesn't crash,
but a majority of work that you had to focus on was actually Git client side.
Does that sound accurate?
Right.
Before my time and parallel with my time was the creation of what's now called VFS
for Git. It was GVFS at the time.
Realized that don't let engineers name
things. They won't do it right.
So we've renamed it to VFS for Git.
It's a virtual file system for Git.
A lot of this is motivated because
the source depot version that
Windows was using actually had a virtualized file system in it to allow people to only download a portion of the working tree that they needed.
And they could build whatever part they were in, and it would dynamically discover what files do you actually need to run that build. And so we did the same thing on the Git side, which was let's make the Git client,
let's modify it in some slight ways using our own fork of Git
to think that all the files are there.
And then when a file is actually,
we look through it through a file system event,
it communicates to this.NET process that says,
oh, you actually want that file?
Let me go download it from the Git server,
put it on disk and tell you what its contents are,
and now you can place it. And so it was kind of dynamically downloading objects.
This required a version, a protocol that we call the GVFS protocol, which is essentially an early
version of what's now called Git partial clone to say, oh, you can go get the commits and trees.
That's what you need to be able to do most of your work. But when you need the file contents, the blob of a file, we can download that as necessary and populate it on
your disk. The thing that's really different is that virtualized thing, the idea that if you just
run ls at the root directory, it looks like all the files are there. And that causes some problems
if you're not used to it.
Like for instance, if you open VS Code
in the root of your Windows source code,
it will populate everything
because VS Code starts crawling
and trying to figure out,
I want to do searching and indexing
and I want to actually find out what's there.
So that, you know,
but Windows users were used to this,
the Windows developers,
they had this already as a problem.
So they were used to using tools that didn't do that.
But we found that out when we started saying,
hey, VFS forget is this thing that Windows is using.
Maybe you could use it too.
And they're like, well, this was working great.
Then I opened VS Code or I ran grep
or some other tool came in and decided to scan everything.
And now I'm slow again because I have absolutely every file
in my monorepo in my working directory
for real.
So that led to some
concerns that that wasn't really
necessarily the best way to go.
But it did, specifically with that
GFS protocol, it solved a lot of the scale
issues because we could stick another
layer of servers that
were closely located to the
developers. Like, for instance, you got a lab of build machines.
Let's take one of these cache servers in there.
So the build machines all fetch from there.
And there you have really quick throughput, really small latency, and they don't have
to bug the origin server for anything but the refs.
And you can do the same thing around the developers.
So that's solved a lot of our scale problems because we don't have these thundering herds
of machines coming in
and asking for all the data all at once.
Yeah.
We have a super similar concept of
repository mirrors that would be
listening to some change stream
every time anything changed on an origin.
It would run Git pull
and then all the build servers. So it's remarkable
how similar the kind of problems
that we're thinking about.
Maybe one thing that I was thinking about over there was,
so VFS for Git makes sense.
So how come, what's the origin of the FS monitor story?
So for listeners, FS monitor is the file system monitor in Git
that decides whether files have changed or not
without running syscalls that list every single file.
How did that come about?
Right. There's kind of two sides of the story.
One is that as we're building all these features custom for VFS for Git,
we're doing it inside the Microsoft slash Git fork on GitHub.
It's working in the open, so you can see all the changes we're making.
It's all GPL, but we're making changes in ways that are going really fast and we're
not really contributing to upstream Git, the core Git feature.
And people who are doing, because of the way VFS for Git works, we have this process that's
always running that is watching the file system and getting all of its events.
It made sense to say, well, we can speed up certain Git operations because we don't need
to go looking for things.
In fact, we don't want to run a bunch of LStats because that will trigger the download of objects.
So we need to refer to that process to tell me what files have updated, what's new.
And I created the idea of what's now called FS Monitor.
And people who had built that tool for VFS for Git contributed a version of it upstream that used Facebook's Watchmen tool and threw a hook.
So it created this hook called the FSMonitorHook.
It would say, hey, tell me what's been updated
since the last time I checked.
The Watchmen or whatever tools on the other side would say,
oh, here's the small list of files that have been modified.
You don't have to go walking all of the hundreds of thousands of files
because you just changed these dozen.
And the git command could store that
and be really fast to do things like git status
or git add.
So that was something that was contributed
just mostly out of the goodness of their heart.
Like, hey, we want to have this idea.
This worked really well in BFS for git.
We think it could be working well
for all the people in regular git.
Here we go in contributing and getting it in.
It became much more important to us in particular
when we started supporting the Office Mono repo
because they had a similar situation
when they were moving from their version of Source Depot into Git,
and they thought VFS for Git is just going to work.
The issue is that Office also has
tools that they build for
iOS and macOS. So they have
developers who are on macOS.
And the team
had just started by building a similar
file system
virtualization for macOS using
kernel extensions and
was very far along in the
process when Apple said, you know what?
We're deprecating kernel extensions.
You can't do that anymore.
If you're someone like Dropbox, go use this thing.
If you're someone, if you use this other thing, and we tried both of those things and none
of them work in this scenario.
They're either too slow or they're not consistent enough.
For instance, like if you're in Dropbox and you say, oh, I'm going to populate my files
dynamically as people ask for them,
the way that Dropbox and
OneNote or OneDrive
now do that,
the operating system can just say, you know what, I'm going to delete
this content because the disk is getting too
big. You don't need it because you can just get it from
the remote again. That
kind of inconsistency was something we couldn't
handle because we really needed to know that that content, inconsistency was something we couldn't handle because we really
needed to know that that content once downloaded was actually there.
And so it was like we were kind of at the crossroads of not knowing where to go.
But then we decided, like, let's do an alternative approach.
Let's look at what the Office monorepo is different from the Windows monorepo.
And it turns out that they had a very componentized build system
where if you wanted to build Word, you knew what you needed to build Word.
You didn't need the Excel code.
You didn't need the PowerPoint code.
You needed the Word code and some common bits
for all the clients of Microsoft Office.
And this was really, really ingrained in their project system.
And it's like, well, if you know that in advance, could you just tell Git, these are the files I need to do my work and to do my build.
And in fact, that's what they were doing in their version of Source Depot. They weren't using
a virtualized file system in their version of Source Depot. They were just enlisting in,
these are the projects I care about. So in fact, when some of them were moving to Git with VFS
for Git, they were confused. Like, why do I see so many directories? I don't need them.
So what we did is we decided to make a new way of taking all the good bits from VFS for Git,
like the GVFS protocol that allowed us to do the reduced downloads,
but instead of a virtualized file system to use sparse checkout as a Git feature.
And that allows us, you can say, tell Git, hey, only give me within these directories,
the files and ignore everything outside. And that gives us the same benefits of working in a smaller
working directory than the whole thing without needing to have this virtualized file system.
But now we really need that file system monitor hook that we added earlier, because if I still
have 200,000 files in my disk and I edit a dozen,
I don't want to walk all 200,000 to find those dozen. And so the file system monitor became
top of mind for us. And particularly because we really want to support Windows developers
and on Windows process creation is expensive, especially compared to Linux. Linux process
creation is super fast. So having a hook you run that then does some shell script stuff to communicate to another
process and then come back, just that process, even if it didn't said nothing like, hey,
you don't have to do anything. That was expensive enough to say we should remove the hook from this
equation. And also there's some things that Watchman does that we don't like and
aren't specific enough to get.
Let's make a version of the file system monitor that is in entrenched to
get.
And that's what my,
my colleague Jeff Hostetler is working on right now and getting reviewed
in the core get client right now.
It is available on get for windows.
If you want to try it because the get for windows maintainer is also on
my team.
And so we were able to get an early version in there,
but we really want to make sure this is available
to all Git users.
There's an implementation for Windows and for macOS,
and it's possible to build one for Linux.
We just haven't included it in this first version.
And that's kind of our target,
is to really remove that overhead.
I know that you at Dropbox had a blog post
where you had a huge speed up
just by replacing the Perl script hook
with a Rust hook.
Is that correct?
With the Go hook.
Go hook.
Yeah, but eventually we replaced it
with the Rust one.
Excellent.
And also you did some contributions
to get, I saw,
to help make this hook system
a little bit better
and not and have fewer bugs i think just yeah one or two bugs and it took me a few months of like
digging and figuring out what exactly is going wrong it turned out that there's this one environment
variable which you added to skip process creation so we just had to make sure like get force untracked
cache or something i think you or somebody else added it and we just forced that
environment variable to be true to
make sure we actually cache every time
you run git status. So
subsequent git statuses are not
slow and things worked out great.
So we just ended up shipping a wrapper that turned
out an environment variable and things
worked amazingly well.
So yeah, but that was
so long ago.
Yeah.
How long does process creation take on Windows?
I guess that's one question that I had for you since a while.
Like, why did we skip writing that cache?
Like, do you know what's so slow about creating processes on Windows?
Well, I know that there's a bunch of permission things that windows does it
has much back holes about can you create a process of this kind and what elevation privileges do you
exactly have and there's a lot of things like there that build up because windows is very much
about maintaining backwards compatibility with a lot of these security sorts of things so i don't
know all the details i do know that it's something around the order of
100 milliseconds. So it's not something to scoff at. And it's also another thing that
Git for Windows in particular has difficulty because it has to do a bunch of translation
layers in order to take this tool that was built for a Unix environment and has dependencies on
things like Shell and Python,
Pearl,
and how to make it sure that it can work in that environment.
That is an extra cost that get windows needs to pay over even a normal
windows process.
Yeah.
That makes a lot of sense.
And maybe some numbers on,
I don't know how much you can share,
like how big was the windows monorepo and the office monorepo when you
decided to move from Source Depot to Git?
What are we talking about here?
Yeah, the biggest numbers we think about is how many files do I have?
If I didn't do anything and I just checked out the default branch it had
and I said, how many files are there?
And I believe the Windows repository is somewhere around 3 million.
And that uncompressed data was something like 300 gigabytes.
Those 3 million files taking up that long.
I don't know what the full size is for the Office Monitor repo,
but it is definitely 2 million files it had.
So definitely really, really large projects.
They did their homework in terms of removing large binaries from the repository, so they're not big because of that.
It's not like Git LFS isn't going to be the solution for them.
They have mostly source code and small files.
That's not the reason for their growth.
The reason for their growth is they have so many files
and they have so many developers moving that code around
and adding commits and collaborating
that it's just going to get big no matter what you do.
And at one point, Windows, the Windows monorepo
had 110 million Git objects.
And I think over 12 million of those were commits.
Partly because they had some build machinery
that would actually commit 40 times
during the course of its build.
So they reined that in, and we decided to do a history cut and start from scratch.
And now it's moving a lot.
It's not moving nearly as quickly, but it's still a very similar size.
So they've got more runway.
Yeah, maybe just for comparison to listeners, like the numbers I remember in 2018, the highest,
the biggest repositories that were
open source that had people contributing to git for were chromium i remember chromium being roughly
like 300 000 files and then and there were like a couple of chromium engineers contributing to
git for performance so this is just one order of magnitude but bigger than that like three million
files i don't think there's a lot of people
moving those large like such a large repository around especially with the kind of history with
like 12 million objects that's just a lot um what was the reaction i guess of the open source
community you know like the maintainers of getting stuff when you decided to help out like did you
have a conversation to start off were they just super excited when you reached out on the mailing list?
What happened?
Yeah, so for full context,
I switched over to working on the client side
and contributing to Upstream Git
kind of after all of the DFS for Git was announced
and released as open source software.
And so I can only gauge what I saw from people afterwards and people I've become
to know since then.
But the, the general reaction was, yeah, it's great that you can do this, but I wish you
would actually contribute it to get like, you know, why, if you had contributed to this
get to get everyone would benefit.
And, you know, part of the things were, uh, the initial plan wasn't ever to open source
it or even to like, the goal was to make this work for Windows.
If that's the only group that ever uses it, that was a success.
And it turns out we can maybe try to say it because we can host the Windows source code.
We can handle your source code was kind of like a marketing point for Azure repos.
And that was a big push to put this out there and stay in the world. But to say like,
well, it also needs this custom thing that's only on Azure repos. And we created it with our own
opinions that wouldn't be up to snuff with the Git project. And so things like FS Monitor
and Partial Clone are direct contributions from Microsoft engineers at the time that were saying, here's a way to contribute the ideas that made VFS forget work to Git.
And that was an ongoing effort to try to bring that back.
But it kind of started after the fact, kind of, hey, we are going to contribute these ideas.
But at first, we needed
to ship something. So we shipped something without really working with the community. But I think that
over the last few years, especially with the way that we've shifted our stance within our strategy
to do sparse checkout things with the Office Mono repo, we've much more
been able to align with the things we want to build. We can build them for upstream Git first,
and then we can benefit from them. And then we don't have to build it twice. And then we don't
have to do something special. That's only for our internal teams that, again, once they learn that thing, it's different
from what everyone else is doing. And we have that same problem again. And so right now the things
that office is depending on are sparse checkout. Yeah. They're using the GBFS protocol, but to them,
you could just, you call it partial clone and it's going to be the same to, as from,
from their perspective. And in fact, the way we've integrated it for them is that we've gone underneath the partial clone machinery
from upstream Git and just taught it to do the GVFS protocol.
So we're much more aligned with,
because we know things are working for Office,
upstream Git is much more suited
to be able to handle this kind of scale.
That makes a ton of sense.
And given that it seems like the community wanted you to contribute
these features back and that's just so refreshing like you want to help out someone because I don't
know if you've heard of those stories where there were people trying to contribute to get like
Facebook has like this famous story of contributing trying to contribute to get a long time ago and
not being successful and choosing to go with Mercurial.
I'm happy to see that finally, like we could add all of these nice things to Git.
Yep.
And I should definitely give credit to, you know, the maintainer, Junio Hamano, and people
who are now my colleagues at GitHub, like Pef, Jeff King, and also other Git contributors
at companies like Google, who really took time out of their day to help us learn what's it like to be a Git contributor.
And not just open source, because open source merging pull requests on GitHub is a completely different thing than working in the Git mailing list and contributing patch sets via email.
And so learning how to do that and also the level of quality expected is so high. So how
can we navigate that space as new contributors who have a lot of ideas and are really motivated to do
this good work, but we really needed to get over a hump of let's get in this community and establish
ourselves as being really good citizens and trying to do the right thing.
Mm-hmm. Yeah. And maybe one more selfish question from my side.
One thing that I think Git could use is some kind of like plugin system
where today if somebody checks in PII into our repository,
into like the main branch, from my understanding,
it's extremely hard to get rid of that without doing a full rewrite.
And some kind of plugin for companies where they can rewrite stuff
or hide stuff on servers.
Does Microsoft have something like that?
Does GitHub have something like that?
I'm not aware of anything on the GitHub or Microsoft side for that.
We generally try to avoid it by doing pre-receive hooks
or when you push,
we'll reject it for some reason
if we can.
Otherwise, it's on you
to clear up the data.
Part of that is because
we really want to make sure
that we are maintaining repositories
that are still valid,
that are not going to be missing objects.
I know that Google's source control tool,
Garrett, has a way to obliterate these objects. And I'm not exactly to be missing objects. I know that Google's source control tool, Garrett,
has a way to obliterate these objects.
And I'm not exactly sure how it works to then say that Git clients are fetching and cloning
and they say, oh, I don't have this object.
It'll complain, but I don't know how they get around that.
So with the distributed nature of Git,
it's really hard to say that the Git project should take on something like that because it is absolutely centralizing things to such a degree that you have to say, yeah, you didn't send me all the objects you said you were going to, but I'll trust you to do that anyway.
Like that trust boundary is something that gets really cautious to violate.
Yeah, that makes sense.
And now to the non-selfish questions.
Maybe you can walk through listeners like, why does Git need a Bloom filter internally?
Sure.
Yeah, so let's think about commit history.
And specifically when, say you're in a Java repo, a repo that
uses the Java programming language and your directory structure mimics your namespace,
right? So if you want to get to actually your code, you go down five directories before you
find your code file. Now in Git, that's represented as I have my commit, then I have my root tree,
which describes the root of my working directory. And then I go down for each of those directories,
I have another tree object, tree object, tree object, tree object.
And then finally my file.
And so when we want to do a history query, say what things have changed this file, I
go to my first commit and I say, let's compare it to its parent.
And I'm going to go to the root trees.
Well, they're different.
Okay.
They're different.
Let me open them up, find out which tree object they have at that first portion of the path and see if those are different. Oh, they're different.
Let me keep going. And you go all the way down these five things. You've opened up 10 trees in
this diff in order to parse these things. And if those trees are big, that's expensive to do.
And at the end, you might find out, oh, wait a minute. The blobs are actually identical way
down here, but I had to do all that work to find out. Now,
multiply that by a million, and you have to find out that this file that was changed 10 times in the history of a million commits, you have to do a ton of work to parse all of those trees.
So the Bloom filters come in in a way to say, can we guarantee sometimes, and in the most case,
that these commits did not change that path? We really, that these commits did not change that path, right? We really expect
that most commits did not change the path you're looking for. So what we do is we actually injected
it in the commit graph file, because that gives us a really quick way to index, oh, I'm at a commit
in a position in this commit graph file, I can understand where this Bloom filter data is. And the Bloom filter is storing which
paths were changed by that commit. And the Bloom filter is what's called a probabilistic data
structure. So it doesn't actually list those paths, which would be expensive, right? If I just
actually listed every single path that changed at every commit, I would have this sort of quadratic
growth again, and my data would be in the gigabytes, even for a really small repo. But with the Bloom filter, I only need 10 bits per path.
So it's really, really compact. The thing we sacrifice is that sometimes it says yes
to a path that is really the answer is no. But the really critical thing is if it says no,
you can really be sure it's no. And it's false positive rate is like 2% at the compression settings we're using.
So, you know, I think about the history of my million commits, 98% of them will, this
Bloom filter will say, no, it didn't change.
So I can immediately go to my next parent and I can say, this commit isn't important.
Let's move on.
Didn't parse any trees.
2% of them, I still have to go and parse them.
And definitely the 10 that actually changed it,
they will say yes.
So I'll parse them.
I'll get the right answer.
But we've significantly subtracted
the amount of work we had to do to answer that query.
And it's really, really important
when you're in these big monorepos
because you have so many commits
that didn't touch the file.
You really need to be able to isolate them.
At what point or at what repository number of files?
Because size of files, I think you mentioned, you can just use LFS or that should solve
a lot of problems.
But it's the number of files that's the problem, right?
At what number of files do I have to start thinking about, okay, I want to use these
Git features like sparse checkout
and commit graphs and stuff?
Have you noticed a tipping point like that?
Yeah, there's some tipping points,
but it's all about can you take advantage of the different features?
So to start, I can tell you that if you have a recent version of Git,
say from the last year or so,
you can go to whatever repository you want and run git maintenance start.
Just do that in every repository of even a moderate size.
And that's going to enable background maintenance.
So it's going to turn off auto GC because it's going to run maintenance on a regular schedule.
It'll do things like fetch for you in the background.
So that way, when you run git fetch, it just updates the refs and it's really fast.
But it does also keep your commit graph up to date.
Now, by default, that doesn't contain the bloom filters because bloom filters is an extra data sink.
And most clients don't need it because you're not really doing these really deep queries that you need to do at web scale, like the GitHub server.
The GitHub server does generate those bloom filters.
So when you do a file history query on GitHub, it's really fast.
But it does give you that commit graph thing.
So you can do things like git log graph really fast.
The topological sort it has to do for that.
Yeah, it can use the generation numbers
to be really, really quick,
as opposed to before.
For instance, it would take six seconds to do that
just to show 10 commits on the Linux repo
because I had to walk all of them.
So now you can get that for free.
So whatever size repo you have,
you can just run that command and you're good to go.
And it's the only time you have to think about it.
Run it once, and now your repository
is going to be good for a long time.
The next level I would say is,
can I reduce the amount of data I download
during my clones and fetches?
And the partial clone is really good for this.
I prefer blobless clones.
So you go git clone dash dash filter equals blob colon none.
I know it's complicated, but it's what we have.
And it just says, okay, filter out all the blobs
and just give me the commits and trees
that are reachable from the refs.
And when I do a checkout or when I do a history query,
I'll download the blobs I need on demand.
So yeah, don't just get on a plane
and try to do checkouts and things
and expect it to work.
That's the one thing
you have to be understanding about.
But as long as you are,
you know, relatively frequently
having a network connection,
you can operate as if it's a normal Git repo.
And that can make your fetch times
and your clone times really, really fast
and your disk space a lot less.
So that's kind of like the next level
of boosting up your scale.
And it works a lot like LFS.
LFS says, I'm only going to pull down
these big LFS objects
when you actually do a checkout.
And by use, but it uses
a different mechanism to do that.
This is actually,
you've got your regular Git blobs in there.
And then the next level is, okay, I am only getting the blobs I need,
but can I use even fewer?
And this is the idea of using sparse checkout
to scope your working directory down.
And I like to say that beyond 100,000 files
is where you can start thinking about using it.
I really start seeing Git start to chug along
when you get to 100,000, 200,000 files. So if you can at least max out at that level, preferably
less, but if you can max out at that level, that would be great. Sparse checkout is a way to do
that. The issue right now that we're seeing is you need to have a connection between your build
system and sparse checkout in order to say, hey, I work in this part of the code.
What files do I need?
Now, if that's relatively stable and you can identify, you know what?
All the web services are in this directory.
That's all I care about.
And all the client code is over there.
I don't need it.
Then a static Git sparse checkout will work.
You can just go Git sparse checkout set whatever directories you need, and you're good to go. The issue is if you want to be really close and say, oh, I'm only going to
get this one project I need, but then it depends on these other directories and those dependencies
might change and their dependencies might change. That's when you need that build that connection.
So Office has a tool they call scoper that connects their project dependency system to
sparse checkout and will help them automatically do that. But if your dependencies are relatively Office has a tool they call Scoper that connects their project dependency system to Sparse Checkout
and will help them automatically do that. But if your dependencies are relatively stable,
you could manually run Get Sparse Checkout, and that's going to greatly reduce the size of your
working directory, which means Git's doing less when it runs status, when it runs checkout,
and that can really help out. That's a great incentive for developers to keep your code
clean and modular. So you're not checking out the world.
And eventually it's going to help you out in all these different ways.
And maybe for a final question here, what are you working on right now?
What should we be excited about in the next few versions of Git?
Yeah, I've been working on a project this whole calendar year, and I'm not going to
be done with it till the calendar year is done, called the sparse index.
So it's related to sparse checkout, but it's about dealing with the index file.
The index file is, you know, if you go in your Git repository, you go to.git slash index.
That file is index's copy of what it thinks should be at head and also what it thinks is in your working directory.
So when it does a Git status, it's got, it's, it's,
it's walked all those files and said, Oh,
this is the last time it was modified. This is,
or when I expect it was modified and any difference between the index and
what's actually in your working tree,
get needs to do some work to sync them up.
And normally this is just really fast. It's not that big,
but when you have millions of files
every single file at head it has an entry in the index even worse if you have a sparse checkout
even if you have a hundred thousand of those two million files in your working directory the index
itself has two million entries in it just most of them are marked with the what's called the skip
work tree bit say don't actually write this to disk. So for the Office Monitor repo, this file is 180 megabytes,
which means that every single Git status needs to read 180 gigabytes from disk.
And with the FS Monitor going on, it actually has to go rewrite it
to have the latest token from the FS Monitor.
So it has to rewrite it to disk.
This takes five seconds to run a
get status, even though it didn't say much, and you just have to load this thing up and write it
back down. So the sparse index says, well, because we're using sparse checkout in a specific way
called cone mode, which is directory-based, not file-based, it can say, well, once I get to a
certain directory, I know that none of its files inside
of it matter.
So let's store that directory and its tree object in the index instead.
So it's a kind of a placeholder to say, yeah, I could recover all the data and all the files
that would be in this directory by parsing trees, but I don't want it in my index.
There's no reason for that.
I'm not manipulating those files when I run a Git add.
I'm not manipulating them when I do Git commit.
And even if I do a Git checkout, I don't even care.
I want to replace that tree with whatever I'm checking out,
what it thinks the tree should be.
It doesn't matter for what the work I'm doing.
And for a typical developer in the OfficePano repo,
this reduces the index size to like 10 megabytes.
So it's a huge shrinking of the size and it's unlocking so much potential in terms of our performance, right?
Our Git status times are now like 300 milliseconds on Windows, right?
On Linux and Mac, which are also platforms we support for the Office Monorepo, it's even faster.
So that's what I'm working on.
The issue here is that there's a lot of things in Git
that care about the index.
And they explore the index as a flat array of entries.
And they're always expecting those to be file names.
So there's all these places around the Git code base
that actually need to be updated to say,
well, what happens if I have a directory here?
What's the thing I should do? And so all of the ideas of what is the sparse index
format have been already released in two versions of Git already. And then there's also some
protections that say, well, if I have a sparse index on disk, but I'm in a command that hasn't
integrated, well, let me parse those trees to expand it to a
full index before I continue. And then at the end, I'll write a sparse index instead of writing a
full index. And what we've been going through is like, let's actually integrate these other
commands. We've got things like status, add, commit, checkout. Those things are all integrated.
We've got more on the way, like merge, cherry pick, rebase. And these things all need different special care
in order to make it to work.
But it's really unlocking this idea
that when you're in the Office Monitor repo
after this is done,
and you're working on a small slice of the repo,
it's going to feel like a small repo.
And that is going to really feel awesome.
I'm just so excited for developers
to be able to actually explore that.
We have a few more integrations we want to get in there
so that way we can release it
and feel really confident that users are going to be happy.
The issue being that expanding to a full index
is actually more expensive
than just reading the 180 megabytes from disk, right?
If I just already have it in the format,
it's faster than needing to parse it.
So we want to make sure that we have enough integrations that most scenarios users do are a lot faster.
And only a few that they use occasionally get a little slower.
And once we have that, we can be very confident that developers are going to be really excited about the experience.
That sounds amazing.
Like the index already has so many features like the split index the shared
index i still remember like trying to like vim i think has a special like vim understands when
you're trying to read a git index and it just shows you like the right format and um yeah this
is this is great and do you think at some point you'd you if you had all the time and like a team
of 100 people you'd want to rewrite git in a way that it was aware of all of these different features and layered in a way where, you know, all of these different commands did not have to think about these different operations since they get like a presented view of the index rather than have to deal with all of these things individually? I think the index, because it's a list of files,
and it's a sorted list of files, and people want to do things like replace a few entries,
or scan them in a certain order, that it would benefit from being replaced by some sort of
database, like even just SQLite would be enough. And people have brought that idea up,
but because this idea of a flat array
of in-memory entries is so ingrained
in the Git code base,
that that's just not possible.
To do the work to layer on top an API
that allows the compatibility
between the flat layer
and it's something like a SQL,
it's just not feasible to do.
It would just disrupt users. You would probably never get done and just cause bugs.
So I don't think that that's a realistic thing to do, but I think if we were to redesign it from scratch and we weren't in a rush to get something out really fast, that we would be able to take
that approach. And for instance, even with sparse index, so I update one file, I have to rewrite the whole index. That is something I'll have to do. It's just that it's
smaller now. But if I had something like a database, we could just replace that entry in
the database. And that would be a better operation to do. But it's just not built for that right now.
Okay. And if you had one thing that you would change about gets architecture,
like the code architecture, what would you change?
I think there are definitely some areas where we could do some pluggability,
which would be great.
The code structure is really flat.
Most of the files are just C files in the root directory.
And it'd be nice if they were componentized a little bit better.
We had API layers that could be operating
so we could do things like swap out
how refs are stored more easily or
swap out how the objects are stored
and really
be less coupled
to a lot of the things across
the built-ins and other things.
But I think the Git project
is extremely successful
for its rather humble beginnings, right?
It started as Linus Torvalds creating a version control system for the Linux kernel things
in a couple weekends or however long he took a break to do that.
And then people just got really excited about it and started contributing it.
And you can tell looking at the commit messages from 2005, 2006, that this was really the wild west.
People were just really fast in replacing code
and building new things.
And it didn't take very long,
definitely by 2010, 2011,
to get code base as much more solid
in its form and composition.
And the expectations of contributors
to write really good commit messages
and do really small changes here and there has already been built at that time a decade ago.
So Git is really, really solid software at this point, and it's very mature.
So making these big drastic changes are really hard to do, but I'm not going to fault it for that at all. And it's good to be able to operate slowly and methodically to be able to build something and improve something that's used by millions of people.
You just really got to treat it with the respect and care it deserves.
Yeah, if you think of software today, you run into bugs and so many different things.
But Git is something that pretty much I think all developers use the most probably.
And you don't even think of Git having bugs.
You think, okay, I messed up using Git.
You don't think, oh, Git did something interesting.
And if it turned out that Git had all sorts of bugs that people would run into, I don't
even know what their experience would be like.
They'd just get frustrated and stop programming or something.
But yeah, well, thank you for being a guest.
I think I learned a lot of stuff on this show.
I hope listeners appreciate that as well.
And thank you for being a guest.
Well, thank you so much.
It was great to have these chats.
I'm always happy to talk about Git,
especially at scale.
And it's really been a thing I've been focusing on
for the last five years.
And I'm happy to share the love.
Yeah, I might ask you for like another episode in a year or so once like sparse
indexes are out and people are starting to use them.
Excellent. Yeah.
I'm sure we'll have lots of new features to add and interesting directions.
Thank you.