Signals and Threads - Build systems with Andrey Mokhov
Episode Date: September 16, 2020Most software engineers only think about their build system when it breaks; and yet, this often unloved piece of software forms the backbone of every serious project. This week, Ron has a conversation... with Andrey Mokhov about build systems, from the venerable Make to Bazel and beyond. Andrey has a lot of experience in this field, including significant contributions to the replacement for the Glasgow Haskell Compiler’s Make-based system and Build Systems à la carte, a paper that untangles the complex ecosystem of existing build systems. Ron and Andrey muse on questions like why every language community seems to have its own purpose-built system and, closer to home, where Andrey and the rest of the build systems team at Jane Street are focusing their efforts.You can find the transcript for this episode along with links to related work on our website.
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstreet.
I'm Ron Minsky.
Today, I'm going to have a conversation with Andrei Mohov about build systems.
Build systems are an important, but I think poorly understood and often unloved part of programming.
Developers often end up with only a hazy understanding of
what's going on with their build system, learning just enough to figure out what arcane invocation
they need to get the damn thing working, and then stop thinking about it at that point.
And that's a shame because build systems matter a lot to our experience as developers.
A lot of what underlies a good developer experience really comes out of the build
system that you use. And also there's a lot of beautiful ideas and structure inside of build systems.
Sadly, a lot of that beauty is obscured by a complex thicket of messy systems of different
kinds and a complicated ecosystem of different build systems for different purposes.
And I'm hoping that Andre can help us see through some of that thicket down to
some of the elegance of all of this underneath.
Andre is a great person to have that conversation with because he's had a long and interesting
experience with build systems.
Having done some work on Hadrian, which is the build system for the Glasgow Haskell compiler,
and having written a few important papers that try and analyze and break down the ecosystem
of build systems and help you understand kind of
what's going on fundamentally underneath. And finally, in the last few years, Andre has been
working here, working on Jane Street's build systems. So thanks, Andre, for joining me.
And maybe to start with, you can start by explaining to us a little bit about
what a build system is. Hi, Ron. It's great to be on the podcast.
So what is a build system? Build systems automate execution of various tasks of individual developers like me and whole companies like
Jane Street. So as a software developer, what are the kind of tasks that I need to do every day?
I need to compile and recompile source files, I need to run some tests, I need to keep my
documentation up to date. There are lots of tasks like this. Every task is pretty simple and
straightforward on its own, but there are a lot of them and they're tedious and I need to remember to do them in the right order. So it's great to
have some automation that does it for me. In the old days, like how would you do this automation?
You would just write a script and the script was your build system. And that's how I used to work
myself when I used to work on projects with just a few files. And it works perfectly fine in those
settings. Beyond just using simple scripts,
what's the next step up available developers in terms of automation?
Well, before, I don't know, 1976, when Make was developed, there was nothing else. And people were just writing scripts. And projects were getting bigger and bigger. It became difficult.
And at some point, Make was developed to automate this. So Make was a huge breakthrough. It's very
simple. It allows
you to write down the tasks that you need to execute. And each task has some inputs and some
outputs and the task itself. So what kind of command line you need to run to execute this task.
And this very simple approach worked very well. And it's still being used these days. So these
days, I would say that Make is still the most popular build system out there. If you go to a random open source project, you will find a Make file there.
Which on some level is kind of astonishing, just given how old Make is.
I'm amazed at how successful Make was and still is.
What was the thing that made Make such a big step forward?
Well, it was easy to use, right? You just write a few rules and the format for specifying those
rules is pretty straightforward. It's just a text file and make is available to you on pretty much any system. You type make and it works. I guess it's
that conceptual simplicity which attracted a lot of people to it. And of course, also the pain that
they had experienced before when they had to run all those tasks manually. They kept forgetting to
run things in the right order. And I think this was one of the direct motivation of creating make.
So you talk about having a simple conceptual model. And what is that model on some level
when you configure a job in Make? What you're doing is specifying a collection of rules.
Each rule says what you do, what's the command you run to execute that rule,
what it depends on, and what it produces. And now Make has what is essentially a graph-like model
representing what are the things you need to do and what are the dependencies.
So given that simple model, what are the services that Make now provides to you that you didn't have before?
The main benefit of using Make compared to a script is that Make only reruns the tasks that need to be rerun.
So how does it figure it out?
So when you have like 1,000 files in project, and you edit a single source file, if you run the script, it will basically recompile all 1000 files, and then maybe
relink some executables and rerun all the tests. And obviously, that's very wasteful. If you've
only edited one file, only one file needs to be recompiled. So Make figures it out by just
looking at timestamps. So it just looks at the time of the file that you've just edited, and it's newer
than the file that depends on it. So that file that depends on it needs to be recompiled. And
then it's timestamped and updated. And so this wave of changes propagates through the build graph
that you have. And at the end, for every task that you have in your description, you have the simple
rule that everything that all the inputs of a task produced before the task, the outputs of
the task have been produced. So this is before after relationship is enforced everywhere in the
build graph. Right. So service number one that's provided is incrementality. Small changes require
small amounts of work in response to them. That's right. Yeah. Right. And this is important to
developers kind of for obvious reasons, which is to say a key thing you want from your tools is responsiveness. You want to do something,
see the results, think about whether that was the right thing to do and go back and make other
changes. And so having tools that are responsive is enormously helpful. There's a couple of other
things that come to mind for me in terms of what make helps out with one of them is it helps get
the build, right? One of the tricks with rebuilding your software is making sure that you've done all the things that you need to do. Like you could imagine that you changed a file
and then you remember to rebuild that, but you didn't remember to rebuild everything that
depended on it. And so having a simple language where you can express the dependencies, and then
once you get it right, you have a system that takes care of running things in the right order.
There's just a whole class of errors that you can make, either in manual construction of a script or manual kicking off of build tasks that
just go away. I guess another thing is parallelism. This is the incrementality of only having to do a
small amount of work, but having this graph-structured representation of the build
also helps you kick off things in parallel. Yeah, absolutely. Trying to maintain this
parallelism by hand in a build script would be pretty annoying,
right?
You add a new file and suddenly you have a new dependency and you need to rearrange the
parallelism.
So having some automation here is great as well.
Okay, so this all sounds like a very short story.
We had this problem.
It was how to build things.
A nice system like Make came out.
It has a clean way of expressing dependencies and giving you some services off the back
of that specification.
All done, problem solved.
What's missing?
Why is the story of make from 1976 not enough?
So there are a few things that are missing in make.
One of them, as I mentioned, the way make determines that a task needs to be rebuilt
is by looking at timestamps.
But very often this is too limiting.
Let me give you a common example.
You edit a source file by just adding a comment.
And as a result, you recompile that file and the result doesn't change.
But the timestamp does change anyway.
It means that everything that's dependent on that file needs to be recompiled if you're
using Make.
So this is one of the problems that basically sometimes you want what we call early cutoff
in build systems.
So if you recompile a file and the result is the same, you don't want to
recompile everything else that depends on it. So that is one feature that was missing.
Right. So essentially the problem is that the heuristic that Make uses for figuring out what
needs to be redone is overly conservative, right? It will refire when it doesn't, strictly speaking,
need to refire. That's right. It's overly conservative. And at the same time, I like it a
lot because it basically just repurposes existing information. It's just the information that is out there.
The file system records modification times.
So make itself doesn't store anything at all, which is a very nice property to have.
But if you want to support things like Erlikatov, you now need to start storing some information.
Right.
I guess in addition, makes timestamp-based recomputation is, as you say, overly conservative.
It's also under-conservative,
isn't it? I.e. it will fail to rebuild in some cases where it should.
Yeah. And it's fairly common, for example, that various backups software, for example,
would mess up modification times, which would basically cause your project not to compile
correctly. And then you would have to go for the so-called clean build to just get around
that limitation. So yes, make's reliance on timestamps is problematic.
Something goes wrong with make.
The answer to this is never to figure out what happened.
It is always just throw away all your state,
throw away all incrementality and restart it.
It's like on a computer, like something doesn't work.
Why don't you reboot?
The build system equivalent of this is make clean.
So that's one class of problems.
What else is missing in make?
So make works very well for an individual developer. But one example where it starts to
become too slow is when many people develop software. So in a company like Jane Street,
we have hundreds of developers, and they all work essentially on the same code base.
What it means is that when a person checks out the sources from the central repository,
we need to run Make to build everything. And then another developer does it,
and then another developer needs to build everything. And then another developer does it, and then
another developer needs to build everything again. You start realizing that all of them are just
building exactly the same files all the time. And it makes sense to somehow share this work.
And the way modern build systems do it is by uploading build results to the central
cloud repository, where they can be downloaded and reused by other developers. It means that
you aim for the very same task
not to be executed more than once. You compile it on one machine and then you share the results
with everybody else. This is something that Make can't do and many modern build systems have to do.
In some sense, you're pointing at a scalability limitation with Make. It's a pretty deep
limitation of Make is limited to just the things on your machine. And that limits how effective
it can be because there's sharing that you could have between different people on different
machines. But I think make has more scalability problems than that. Doing large builds in make
can be quite expensive for a few reasons. One of them is just the thing we talked about before
about make clean is a serious problem because the fact that make doesn't always get it right.
And sometimes you have to be like, oh, that didn't work. I'm just going to throw away my
state and redo everything. Redoing everything in a really large code base is a disaster,
right? It can be a huge amount of time. It'd be like, oh, now I have to wait 40 minutes
for this thing to build. I think there are also scalability problems in make in the direction
of complexity. What are the limitations on make trying to specify large complex builds? Yeah, and this is something that I learned about five years ago when
I started working on Hadrian build system for the Glasgow Haskell compiler. GHC was built using Make,
and there were a lot of Make files. So the very first problem that I faced, GHC is an old project,
and it means that the build rules are complicated. Just to give us a sense,
roughly how big is GHC? GHC, well, I think it's a million line of code, so it's pretty big. The
make files themselves are pretty big as well. At least like 20-30 biggish make files in GHC
that take care of compiling it, and a lot of smaller ones carried around the code base.
And once you start going beyond a single make file,
let's say, I don't know, 100 lines long, you start facing the limitations of the programming
language itself. So make, like the language of specifying dependencies, you can also think of it
as some form of a programming language because it has variables, it has some conditional structures.
So you're developing a collection of build rules using this language, and this
language is not very good. It's a single namespace of string variables. They're all mutable. The way
that most complex bits of logic have to be done is macros. So you have to splice strings into
other strings. You have to take care of not using the special characters in splicing, because
otherwise things go terribly wrong. So this is not a very good programming model to use for large
projects. I think the problem you're pointing out here is essentially
that make isn't great from a composability perspective. We want to be able to take a piece
of code and reason about it locally and from that be able to predict how that code will behave when
it's embedded in some larger context. But the fact that make has a single global namespace gets in
the way of that. Since now any two parts of my build specification that just happen to use the same name for a variable for different purposes, well, those two parts of the
build specification are going to interfere with each other. And that more or less destroys the
ability to reason locally about the build. The other thing you talk about being a problem is
macros, which is interesting because macros are also supposed to be a solution to a different
problem, which is the problem that makes build
language is not terribly expressive. So it makes core build language is relatively simple. It's got
the simple dependencies that you express with some set of inputs and some set of outputs and a
command that you invoke to construct the output from the inputs. But that simplicity means that
when you have a really complex build, you can end up having to write really quite a lot of these rules. And a way to avoid this is to extend the expressiveness of the language. And
macros are a way of doing that. Macros are essentially code that generates more code,
or in the context of Make, build rules that generate more builder rules, which really does
help. It lets you write down your intent more compactly. But macros can be incredibly hard
to reason about. And it gets really bad when you
have macros that generate macros that generate macros. You mentioned generating macros. And this
is one other limitation of Make is that Make fundamentally requires you to specify your build
tasks upfront, to specify inputs and outputs and all the build tasks. And if for some reason,
some of the tasks are not known initially, for example, maybe you might want to generate some files, and then maybe those programs that are built are going to generate more files.
And then as soon as you start having these multiple stages of file generation, make files
start to crumble in a new way, right? So now, on top of macros, you have to generate those macros.
And when I was looking at the code base of GHC, I was coming across lines where you had lines where maybe
like 50 characters long, but they contained like $20 in them.
And all those dollars were in direction because of macros.
Completely impenetrable.
And nobody knew how they worked, but they worked and somehow had to be migrated to a
new build system.
And that was my goal, which is still ongoing.
So five years later, that project is still not finished.
The product here is, again, Hadrian,
which is a new build system for GHC.
What actually is the status of Hadrian?
Is it functional at all?
Is it deployed in any way?
How far along has it gotten?
It is functional,
and I'm happy to see that many people are using it.
But the difficulty in switching
from one build system to another
is that the build system
that has been alive for many years
accumulates a lot of craft that supports various unusual use cases. So maybe five years ago,
somebody added a special conditional that takes care of compiling GHC on that particular
architecture. And just to understand how that works requires perhaps talking to that person
who added it because nobody else understands why that is needed. If you just blindly convert it to
the new build system,
you might miss out some special tricks that you need to do. So it's just taking care of all of
this long tail of various features that have been added to the build system over the years
just takes a lot of effort. It requires involving many people into this communication.
I think of this as an example of a more general problem of dealing with legacy systems,
which is I feel like the kind of enthusiastic, idealistic programmer perspective is, there's a problem that needs to be solved.
I'm going to think hard about it and understand what the right solution is. And then I'll write
a piece of software that solves the problem. There's a bad old build system. I'll write a
new one that does the right thing. But the problem is that the old software is not
someone's idea of the right answer converted into code. It is a knowledge base which has
accumulated 25 years of various different people's ideas about how to get this or that problem
solved. And migrating to a new thing isn't just about building the new thing. It's an archaeology
expedition where you have to go in and dig deep and extract that knowledge or resolve the problems
from scratch. And that's why these projects that seem relatively straightforward
can take years.
Archaeology is a very good analogy.
I felt that I was doing archaeology.
Five years ago,
it was just basically talking to people
and trying to figure out
why that line is here
and what does that file do.
And it's still going on.
And I feel like in some sense,
the best way to make progress
is just to switch over
and then maybe the whole project will go down for a few months until everybody figures out
how to use it and fix all the main bugs.
But otherwise, right now, we just maintain two build systems in GHC.
And this is, of course, far from ideal.
It sounds like you solved the first 90% of the problem.
And now you have to solve the next 90%.
And then maybe the 90% after that.
Hadrian was your first foray into working on build systems,
and you learned a lot about the problem from there, and you ended up writing a few interesting
papers about how build systems work more generally. Could you tell us a little bit
more about those papers and what the core lessons were from them?
Yeah, sure. The first paper was about the limitations that we encountered in Make,
and why we decided to write that build system using Shake. So Shake is a Haskell library for developing build systems.
So it's a modern tool.
So it supports various features that we need,
for example, early cutoff and sharing builds in the cloud
and dynamic dependencies.
So being able to generate new build rules as you go.
So it supports all that.
That's why we wanted to rewrite it.
And also on top of that,
it comes with a sensible programming model.
So you just write your build rules in Haskell.
And Haskell has a lot of abstractions that are missing in Make.
The first paper basically just goes through this project and describes it in detail
and shows some of the abstractions that we built on top of Shake.
So Shake is a low-level library, but we built a lot of high-level abstractions
for building Haskell packages.
We described these abstractions in the earlier paper,
which is called Non-Recursive Make Considered Harmful, which is a pun on an earlier paper which was called recursive make considered harmful
so in some sense we're saying well don't use make for big projects at all so after we finished that
paper we also wanted to look at a wider context of build systems there are a lot of build systems
out there and in some sense it feels a bit sad that every community has to redevelop their own
build system and we were trying to figure out what were the differences and commonalities in all the major build systems out there.
So we wrote another paper called Build Systems a la carte, where our goal was to look at these systems and distill the differences, at least the differences in the algorithmic core, to simple models, maybe 20-30 lines of code, so that they become comprehensible to a human being,
unlike looking at a million-line project. And by looking at these small models of every build
system that we looked at, we figured out that there are two main components in build systems.
One takes care of figuring out the right order in which the tasks should be executed,
and we call that component a scheduler. And the second component just takes care of executing a
single task.
So it's where all the caching happens.
It's where the communication with the cloud happens.
It's where the timestamp manipulation happens in Make.
The second component is called Rebuilder.
So these two components, scheduler and Rebuilder,
can be combined together to form a build system.
We discovered or rediscovered four different schedulers and three different Rebuilders.
And you can combine
and mix them in various ways. So this explains the title of the paper. So you have a menu and
you can pick a combination of these two things and out pops a build system. So some of these
combinations were known before, but some combinations were new and we were excited
because we thought that maybe this can lead to new build systems to better understanding of the
whole design space. Can you give us an example of lead to new build systems, to better understanding of the whole design space.
Can you give us an example of a combination that doesn't, or at least at the time, didn't exist?
So one such new combination is Cloud Shake.
This combination comes from combining Shake,
the build system that was developed by Neil Mitchell, who is the co-author of the paper,
and Bazel.
So Bazel provides the cloud build functionality.
So Neil wanted to extend Shake with cloud build functionality for a
long time, but it wasn't clear how to do it. And while writing this paper, we realized that the
cloud build functionality only touches the rebuilder part of the build system, and it can be
combined with the scheduler part that comes from Shake in a very clean and simple way. So, what the
paper does, it basically describes the scheduler of Shake,
it describes the rebuilder of Bazel, and it shows that by combining them, we get CloudShake,
which is a new point in this design space which wasn't known before. CloudShake uses the so-called
suspending scheduler. It supports dynamic dependencies, because essentially it just
suspends tasks as soon as a new dependency is discovered that needs to be built. So it's a bit
more complicated than just the depth-first traversal because we need to take into account the parallelism.
So you cannot just go ahead and build a single task all the way down
because it would be too slow.
There is some extra complexity there.
So this is the scheduler of CloudShake.
And the rebuilder, well, how does CloudShake determine
whether a task needs to be rebuilt?
This rebuilder, we call it the Constructive Traces Rebuilder.
This actually comes from Bazel. It has a task that needs to be executed. It lookser, we call it the Constructive Traces Rebuilder. This actually
comes from Bazel. It has a task that needs to be executed. It looks at the inputs of the tasks
and the task itself, and it creates what's called the task hash or rule hash. It basically accumulates
everything there is about this task. And then once you have this rule hash, you can make a request
in the cloud storage and ask whether somebody else has
run that task before. Since this hash identifies the task uniquely, it allows you to make this
request and receive the result without doing the computation yourself. So if you don't have the
result, you don't have to execute the task yourself. You can just get the result from the
cloud and use the files directly. So what you're essentially in this case, depending on is having
some big lookup table in the sky, which indexes by the hash on the rule that says what to do
and pulls up the artifact. And that's like the key extra bit of infrastructure you need.
You just kind of plug that rebuilder in to the Shake style scheduler and poof,
you have a new kind of build system. That's right. Of course, it's much easier to say than do. So
in our models, it's just a combination say than do. So in our models,
it was just a combination of the scheduler and the rebuilder. But Neil Mitchell, who is the lead
developer of Shake, actually spent quite a lot of time to make it all work. Now it works. But at
least the paper gave him a very nice blueprint of what to do. So before writing that paper,
it was like a very vague task about how to do it wasn't entirely clear. But after studying all the
build systems and figuring out that this is exactly what Bazel does, it was much easier to follow through this blueprint.
One thing that strikes me about the world of build systems is just how many of them there are.
Not only are there lots of successor systems to make, but these systems are broken off into different organizations and communities.
Lots of companies have their own build systems.
There's Bazel, which is an open source build system, but is actually a successor to an older internal one from Google called Blaze. And then Twitter and Facebook built similar systems
called Pants and Buck. And there's a ton of different language-specific build systems.
So Rust has a build system integrated into Cargo, and Haskell has a build system
integrated into Cabal, and OCaml has several build systems, which I expect we'll talk more
about a little later. And there are even build systems that started out
as being specific to an individual open source project.
Ninja is a build system that was built
for Google's Chrome browser.
Again, totally separate from Blaze,
which was Google's internal build system at the time.
So I just wonder why,
why is there this wild collection of incompatible systems
where each different organization or community
has felt the need to create their own? Like the whole thing is not just a ridiculous amount of
work, but it also gets in the way of some natural things you'd like to do. For example, if you want
to build a project that has some Rust and some Haskell and say some Java in it, how do you do
that when each of those communities has their own build system? I guess this all comes from the fact
that different communities have different infrastructures, have different requirements in terms of the kind of
tasks that need to execute. And they also speak different languages, different programming
languages. For example, for developers in OCaml, it's much easier to describe all this build logic,
which gets fairly complex using OCaml because this is the language you speak. Of course,
we could have taken an existing build system like Bazel and write some build rules that are needed for OCaml in Java and Python.
So it will be possible and we would have to maintain a lot of code written in languages
that are not very common in the community, which would probably make it difficult to iterate and
improve those build rules, but it would be possible. I'm actually pretty happy that we
have a lot of build systems out there that keep exploring the design space. Every now and then,
some new idea comes up and I think it's good that many different people are
looking at the same problem from many different angles. I agree with that, but on some level the
fact that programming language is a thing that fractures the space of build systems seems really
regrettable. It's obviously good that there's freedom to create new build systems that solve
problems in new ways, but the idea that every language community ends up creating their own build system just so that they can write build rules in their
own language, that really seems like a shame. Have you thought about the question of whether
there's a way of architecting these systems to make them more language independent?
Some build systems were designed with that in mind. It's pretty common to come across
projects that instead of using a build system directly, they instead generate build scripts from higher level specifications.
For example, you might take a build system like Make or a build system like Ninja,
and you would generate these low-level descriptions of what needs to be built
from a higher-level description that can be written in your own language of choice.
This does happen, and I've seen a few examples like that.
I guess one problem that is not fully solved here is that, again, we're coming back to
the code generation problem.
So very often, you don't know all the build rules up front before you start your build.
So you want to be able to support some kind of back and forth communication between the
build system and the higher level build generating logic.
And the build systems that I know of, I don't think they
provide anything like this. They typically require to generate build rules upfront,
and then the build system goes ahead and builds stuff. But there is no feedback mechanism which
allows you to say, please produce me results so far, and I will generate more rules after I see
them. Maybe there's a scope for kind of a build engine, which is completely language agnostic, which only provides this interface where you can specify some initial
collection of rules, get some feedback, generate more rules, and iterate until you get to some
kind of a fixed point. I feel like on the face of it, it might seem like using one of these
language agnostic build systems like Ninja wouldn't be so bad in the sense that your
dependencies don't change nearly as often as your code does. So you might imagine you can get away without explicitly handling the dependencies
and having to do the occasional build and then build again in order to deal with it.
But I feel like in practice, that often turns out poorly because you end up compromising on
the correctness of the build because you don't always know when the build is up to date. And if
you do have a rule
for figuring out when you need to rebuild, that rule is often very conservative, which is to say
every time you discover that the rules aren't up to date, you're like, oh, now I have no idea what's
going on and you have to rebuild everything. And in a really large system, rebuilding everything
is a disaster. I think it's pretty bad for us at Jane Street where our code base is out in, oh,
15 million lines or 20 million lines in our primary repository.
And the whole thing on a big parallel machine takes maybe an hour to build.
But if you look at a Google scale system, it would be madness, right?
They have billions of lines of code.
And I think, like, I have no idea even how long it would take to do a full rebuild, but I'm sure it's really quite a lot of CPU resources that would need to be thrown at it.
The very simple model of something like Ninja is in some ways very appealing, but it's just not powerful enough to provide the full set of services you want to make a build that really scales.
I've seen a few projects in this space where even trying to parse all the build descriptions in your repository takes so much that you actually want to incrementalize that step as well. So for example, I think there are some open source
build tools from Microsoft that have been released recently, like BuildXL and AnyBuild.
So they actually take some extra steps to make this kind of parsing of build descriptions
incremental. I'm sure there's a lot of development work and also research work to be done in this
area. So now let's switch gears and talk about what you've been doing most recently since in the last year you've joined Jane Street and are working on our
build systems team. Maybe you can start by giving us a thumbnail sketch of how Jane Street's build
system works now. So the current build system is called Jenga and it's been in production for a
long time. And it's a build system that is tailored for a monorepo. So we have a monorepository at Jane Street.
A monorepo is when you put all the projects that you work on in a single repository,
which really simplifies problems with versioning.
The opposite of monorepo is when every project lives in its own repository,
and then it's much easier to iterate on individual projects.
But then if your project depends on some other project,
now you start having problems because you might depend on
some particular version of the project, but that project might move ahead and there's a new version project
available and you might want to switch to that new project, but another dependency that you have
actually depends on an older version of the project and you start getting into the so-called
versioning dependency hell. The joy of the monorepo is you get to build your code without
ever having to say the word constraint solver. So when you say Jenga is oriented towards monorepos, what do you mean? In what way
is Jenga specialized towards that use case? One way it is specialized to this use case is that
the language that we use to describe the tasks that need to be built in the monorepo is not
versioned. So essentially, you have to describe all the rules of all the projects that you need
to build in exactly the same version of the language, which would be very difficult to do if you had multiple repositories, and each project would be
most likely using its own version of the configuration language, so it would be pretty
hard to synchronize. And the second build system that I'm working on is called Dune. This is also
an OCaml build system, and that build system is different in this respect. So in Dune, the
language we use to describe build goals is versioned, which allows you to have multiple
projects developed independently in the OCaml community. Each of these projects has its own
description of build tasks that need to be done. It uses its own particular version of this language.
And those projects can live together happily. And the Dune build system itself can develop
without fearing of breaking all
those projects that depend on it. So I want to talk more about Dune and in particular,
the kind of shocking and embarrassing question of how it is that Jane Street had the ill fate to
write not one but two build systems. But before we do, let's talk a little bit more about Jenga.
Can you give a sense how you actually work with Jenga? Like you talked a little bit about what
the build language is. but before you're talking about
the build language being OCaml.
So I think you explained like
what are the different ways
in which you write build rules
and how are they specified and all of that.
In Jenga, we have two different levels
at which things can be described.
At the front end,
developers describe build tasks
in a very limited, constrained,
and also simple language,
which allows somebody to describe the
name of the library that they want to build, the dependencies of the library, and a few various
configuration flux. But that's it. That is parsed and analyzed by the build system itself, which is
written in Locaml. It has access to all the powerful Locaml features, which allows us to
analyze those build specifications and execute them as efficiently
as possible.
So this front-end language is very limited because we want to be able to analyze it and
we also want to safeguard users from various errors that they can make.
If they had access to the low-level language that we are using, they could accidentally
make the build non-deterministic or make it slow or make it cyclic and various other things
that could go wrong if they had all
the freedom. But we constrain them so that they don't make those mistakes, and we can execute the
build descriptions as efficiently as possible. The basic structure here, where you have two
languages that you use for constructing build rules, is actually pretty common, right? So you
typically have some kind of user-facing language, which is simple and constrained, and that's what
the vast majority of users use to set up their builds. But then you can also write build rules in a full-fledged
programming language, typically the same language that the build system itself is written in.
Not every build system works this way. Make in particular does not. No one specifies their build
in Make by extending Make's implementation in C. You always just add build rules by writing them
in Make's build language. But that's actually a real problem. One of the things you run into with Make is that as you put more and more pressure on
Make's build language, it becomes clearer and clearer that it's just not up to the job. So if
you look at systems like Jenga or Hadrian or Basel, you end up with this two-layer structure.
In Jenga, the inner language is OCaml and the outer language is this simple data-oriented
configuration language we've been discussing. I'm curious how Basel compares. The inner language for Basel is Java, since that's
what Basel is implemented in. And the outer language is this language called Skylark,
which is a language that on the surface looks a lot like Python, but considerably more limited
than Python. Can you say a little bit more about what the actual limitations on Skylark are?
Yeah, it is pretty limited. And you want it to be limited. Like, what are the reasons why you want it to be limited?
Because you want, for example, our builds to be deterministic.
You don't want to have a library that depends on a random number generator.
So you don't want random numbers to be available in your language
that you use to describe the top-level configuration build rules.
You also very often don't want to have any form of recursion in that language,
because that makes it difficult to reason about. You typically limit that language to just description
of data and perhaps occasionally some conditionals because you might want to say that on this
architecture I want to pass this flag which is unavailable on different architecture. Although
users often demand more power and if you look at the github repository of Bazel you will see a lot
of issues where users are requesting, I don't are requesting while loops to be added to the language or the ability to be able
to describe in some form dynamic dependencies right in that user-facing language.
So there is a constant battle between the users demanding more and more power and the
build system developers who try to restrict the users because one reason is because they
want to be able to schedule builds efficiently.
And if the language is restricted, it allows you a lot of static analysis opportunities so that you can actually produce very fast builds. But also, if you give users a lot of power,
they will undoubtedly start shooting themselves in the foot and you will basically have a lot
of broken builds because of non-determinism, for example, and you don't want that.
It's maybe not obvious, but determinism is really important for a bunch of the other
things we talked about before. If you don't have determinism,
then the whole notion of doing cloud builds is very compromised because you want to share
artifacts from different builds. You have to know when the artifacts are supposed to be equivalent.
And if you have non-deterministic things that are feeding into the construction of these build
artifacts, it's very hard to tell if anything is correct at that point. So can you also say something about where Jenga fits in, in the kind of build systems
a la carte taxonomy?
Yeah.
So right now I'm actually working on moving Jenga along one of the axes in the paper.
So I'm working on adding support for cloud builds.
So Jenga right now is still in the world where every software developer has to build everything
from scratch on their machine, which is of course not ideal. So my recent project was adding functionality where
it's possible for developers to exchange their build results via a central repository. And that
project is still ongoing. So Django will be in the same box as Bazel and as Shake, which will
support cloud builds. Or as we say in the paper, it will have this constructive trace rebuilder.
How about in terms of the dynamic static access?
So Django supports dynamic dependencies.
It uses suspending scheduler as well.
So basically, it kind of initiates build tasks.
And as they discover more and more dependencies,
these tasks can be suspended until those dependencies are built.
So it very much looks a lot like Shake does pre-cloud builds.
That's right.
That sounds like a nice build system to have.
Why does Jane Street have two build systems?
In some senses, this is an accident.
OCaml ecosystem was using many different tools to build their projects.
So Jane Street used Django internally,
but external developers were using tools like OCaml build and make
to build their own projects. So we decided to automate the release of our projects to the external world used Django internally, but external developers were using tools like OCaml build and Make to
build their own projects. So we decided to automate the release of our projects to the external world
by making it easy to build our projects. And the way we did it, so Jeremy Dimina, which is the
manager of the build systems team at Jane Street, wrote a very simple tool. That tool was simply
building a project from scratch without any incrementality or caching to produce, for example,
a library that everybody else can link against. A lot of people outside Jane Street started to use it to build their own projects.
So even without support for incremental rebuilds, it was so fast that many people started to use it,
which was very surprising. It was fast when you wanted to build your project from scratch,
but if you wanted an efficient incremental build, the initial version of Dune did not help you.
Yeah, that's right.
But it was a hell of a lot faster.
A lot of work was put into the efficiency of parallel builds.
Another thing that Dune got right from the get-go was portability.
But yeah, I think the primary feature that got people excited was speed. It was something like five to ten times faster than other available solutions.
Yeah, so as this build system got more and more popular, Jeremy started to
improve it bit by bit. For example, it added incrementality, it started supporting more and
more build rules, and everybody using Dune was contributing to making Dune better. And at some
point, it just got even more features than Django. So right now, for example, Dune has some
rudimentary support for cloud builds, whereas Django is still working on that. So now our plan is to eventually switch to Dune.
So we want this new shiny build system
that everybody else uses.
It also happens to be faster than Django,
which is another reason why you want to switch.
So we are in the process of doing this.
And yeah, this project is already been underway
for a long time,
and it's still going to take some time.
Highlighting the point that migration is difficult.
I think it's getting close to being true that Dune has strictly more features than Jenga,
but it's not quite there yet, is it? I think there are a lot of features associated with doing things
at scale that aren't totally worked out. For example, it's not clear to me that Dune is yet
faster than Jenga on very large builds. Maybe that's changed recently, but at least for a while
that was true. Yeah, absolutely. And I think maybe for any two build systems, you just can never compare them. No build system is fully worse
than any other build system because there is always some features that a build system has
that nobody else has. It's just like so many features, it's difficult to compare them.
So one of the hard parts about doing a migration like this, you're trying to change out the wheels
on a car while the car is driving. How do you guys make forward progress on the transition while not stopping the entire organization
from getting improvements as things move on? Whenever we implement a new feature in Django,
we know that we will actually have to implement something similar in Dune as well. So it means
we have to do double work, which slows down the process of migration. What helps is that we have
a lot of external contributors in Dune, so workload can be shared quite efficiently. So we have a very good community. Community helps a lot.
Is there anything that you guys are doing to try and reduce the amount of dual work going forward?
Right now, at least we would like to stage some of this work in the sense that we are splitting
Dune into two components, where we have the backend, which is the incremental computation
engine itself, and the frontend, which describes all the rules for Campaña no Camel, and also
like various other rules which are specific to external projects. So we want to disentangle
these two parts, the frontend and the backend, and then once it's in place, we'll be able to,
for example, swap Jenga's frontend to use the Dune frontend, maybe keeping the backend still.
So we will be able to stage the process of switching
from one build system to the other
by just swapping various parts of these build systems one by one.
Got it.
And I guess one of the key pieces there
is getting it so that you don't have to maintain two sets of rules.
This goes back to your point about Hadrian,
is that one of the things that was hard about migrating
the makefiles over to Hadrian is
there's a huge amount of knowledge
about how GHC
needed to be built in all of these different contexts. The same exact thing is true about
Jane Street and about all of our build rules. There's a huge amount of knowledge about how
Jane Street operates on the software side that's essentially embedded in, I don't know, what are
those, 20,000, 40,000 lines of build rules? And so getting to the point where you can share those
between the two systems
seems like it's a pretty critical step to avoiding having to kind of redo things over and over on the
two sides. This is exactly what we are trying to do. We would like to start sharing some code as
soon as possible. And this sharing of this knowledge is probably the hardest part because
we keep generating this knowledge every day and maintaining it in two different systems is a
nightmare. At least the two systems are in the same language. So that's a little bit.
That's right. So when you've talked about the ways in which build systems have evolved over time,
a lot of the points you're making are about scale. Scale in terms of the complexity of the build
to deal with like large projects and organizations, and scale in terms of performance, of just being able to do large builds with large numbers of people who are constantly interacting
with and changing the code. But there are other ways in which modern development environments
have changed that also play into this, which is there's more and more integration between
the development environments that we work in and the build tools. And it seems
like build systems play a pretty key part of that integration. Can you say a little bit
about how this plays out and how it influences the features that build systems need to provide?
Probably go back to that example of everyday tasks that a software developer needs to do.
As you work on a software project, you keep editing source files and you keep running the build system.
And like the simplest way of doing this
is to you edit the file,
you save the file,
you go to the terminal,
you run the build,
it succeeds or it fails with an error.
Much more common case,
it fails with an error.
You look up the line where the error is
and you go back to your editor
and you change that line.
So this is a fairly long cycle.
It may take seconds just to switch between different windows and looking up the line. So a very
natural idea is to start integrating them somehow. So maybe you start calling the build system when
you click ctrl s in your editor. So as you do it the build system is called and the error message
is displayed right in the editor and you can click on it and immediately jump to the right line and start fixing that line.
So this shortens the iteration loop from multiple seconds to a second,
maybe even below a second, and this dramatically increases the productivity of the developers.
This is a very simple integration, it's just with the editor.
But there are various other integrations.
So you want to integrate your build system with the continuous integration system of the whole company, right?
As we push commits to the monorepo, we would like to make sure that the whole monorepo
still builds, that all tests still pass.
So you want to integrate your build system with that infrastructure too.
You also want to integrate a build system with various documentation and code search
infrastructure.
So if there's some front end where you can type a query about a function
you are trying to find, right? Of course, that index that you're going to search is probably
going to be generated by the build system because the build system will need to process all the
files and spit out some index representation of all the sources. So you start integrating
with all these various tools and this list keeps growing. I feel like some of those integrations
you describe are pretty different from others. So the integration of, oh, and we need to do
documentation generation, on some level, it just feels like another kind of build target and another
kind of role that you need to generate. On the other hand, the kind of editor integration feels
significantly more fraught because there you want finer grained incrementality than you usually get. Like one of the great features of modern IDEs is as you type, you get auto-completion.
So you get suggestions based on where you are in your program, which requires the program
being partially compiled.
It might not even parse, but you still want to be able to get feedback from the IDE.
All of which means that your IDE becomes a kind of specialized pseudo-compiler that needs to be able to run something like the ordinary compilation on a file as it's being edited.
And to do that correctly, it needs configuration information about the build,
which means there's a new integration point where the build system needs some mechanism
for surfacing information to the IDE rather than to the ordinary compiler toolchain.
That's right.
So there's another kind of integration, which I haven't actually seen happen in a general
purpose-built system, but I'm wondering whether it's a direction that one should expect.
The example I'm thinking of is the compiler for a language called Hack.
So Hack is this programming language that's used at Facebook, which is essentially an
extended version of PHP with its own type system that was added on top
to make PHP a more reliable language to engineer in. And I remember already many years ago seeing
an awesome demo where the primary author of the system showed how the whole system had extremely
fast incremental compilation times because users demanded it. People who use PHP were not used to using a compiler.
So their expectation is you write your program
and then it runs and everything works immediately.
And they're not okay with the idea
that they write their program
and hit refresh on their browser
and they get anything other than the latest thing.
And so they wanted really fast updates.
And he showed me how he could do a Git rebase
and thousands of files would change.
And the compiler would just kind of keep up. And then from the time that Git finished doing the rebase and thousands of files would change. And the compiler would just kind of keep up.
And then from the time that git finished doing the rebase,
it was 10 milliseconds until the compilation was complete.
And I was astonished.
It was a very impressive demo.
And this demo is done on the back of a custom compiler
and a whole parallel programming system
that keeps track of dependencies down to the function level,
so that if I modify something in a file in just one function, it will know just what depends on
that function and only recompile the bare minimum of things that need to be recompiled. To be fair,
there's only type checking rather than compilation, but still, there's an impressive
amount of engineering required to get that to happen. Have you seen any movement in the build
system world to provide this kind of
functionality that you have in a system like Hack, but on a more general purpose basis so that you
don't have to rebuild the entirety of that infrastructure for every language that supports
that kind of fine-grained incrementality? In some sense, we start moving from automating
tasks of developers to automating tasks of a compiler and maybe of some other tools. So a compiler also needs to do a lot of tasks. It needs to type check a lot of
functions and needs to generate code. So these tasks, in many cases, are exactly the same because
the file is changed in just a few lines, right, between different commits. So you don't want to
repeat type checking of all the functions because you just need to type check one single function
that changed. So why don't we use build systems to automate that too? It sounds like a
perfectly sensible approach. And I've seen already a few projects like this. So you mentioned Hack.
Yeah, that's a very good example. But I've seen, for example, the Shake build system also used in
this context for improving the incremental compilation of Haskell projects. And so what
needs to be done to support this is basically to provide some kind of APIs where various tools can integrate with the build system
by declaring tasks that are not only file system kind of granularity. Maybe you want to declare
tasks whose granularity is just a single line of code that you need to compile, for example.
So we go down in terms of granularity of the build tasks,
and we need a way to describe these tasks in some way and present those tasks to the build system and get results back.
So I've seen a few examples where the build system is integrated with a compiler in this way,
where you would like to automate some tasks of the compiler.
But I haven't seen a complete well-thought-through API
that lists everything that there is to do for the compiler and presents it in a way that's kind of extensible for
other tools.
So basically right now, I think we are at a stage where multiple groups are trying to
do the same thing.
They are doing it in their own ad hoc way, but the general approach is not yet thought
through.
Yeah.
I also think one of the challenges there is that as you try and drive the granularity of the operations down, you're going to have to abandon some of the standard
structure that's pervasive amongst pretty much all build systems, which is to say build systems
normally assume that they can do their tasks by invoking a program and having it run kind of like
a function. You presented some inputs, you get some outputs, poof, you're done. But that just doesn't work when you're talking about very
fine-grained parallelism and fine-grained incrementality, because the tasks are so small
that the work it takes to deserialize all the state that you need in order to do the computation
is going to very soon start dwarfing the amount of work that's required to actually do the
computation.
And then you just can't profitably reduce the granularity anymore once you hit that boundary.
So I feel like build systems are going to have to come to grip with some kind of story around
like shared memory for keeping data structures around or persistent build servers or something.
So it's not just that you're going to need cooperation with a compiler, but you're going to need quite different models for how compilers operate.
So I think the technical challenges there are not small.
There's a lot to get, but I think it's going to take a while probably to figure out how that all shakes out.
Yeah, for integrating with a compiler, I think at some point we will reach the point where the build system becomes just another standard component in the compiler.
Like we have a parser, we have a type checker, we have code generation.
All of these components are fairly well known.
There are like dozens of hundreds of papers written about each of them.
But the build system is often like an afterthought.
And you edit just to make things work, not because it's designed to be there from the very start.
And I think someday this is going to change
and we will start treating build system
as just another key component of a compiler.
Yeah, although the painful part about that
is I worry it will drive things
even more to be language specific, right?
We understand there's lots of good theory around parsers,
but we don't have cross-language parsers.
Every language implements its own little parser.
And this maybe sadly pushes even more
in the direction of every language building its own little build system, which is sort of sad for the amount of
duplication of work, but also makes the story when you do want to do cross-language development
even more complicated. It's sort of ironic that this whole system, a big part of its job is to
avoid duplicating work is achieved by an enormous amount of duplication of work.
So let me go back to build systems a la carte. One of the things that really struck me about
the paper when I first saw it is that in addition to describing things that are clearly,
everyone would agree are build systems, like make and bozzle and shake, it also describes Excel.
And one doesn't traditionally think of Excel as a build system. So can you
say a few words about how you think Excel fits into this build system model?
Yeah. So Excel superficially might look very different, but it's also an incremental
computation engine, right? You have formulas in the cells, and these formulas have dependencies
that depend on values in other cells. So if you change a value in one cell, you want to recompute
all values that depend on it, and hopefully efficiently without recomputing every cell,
because that would just take too much time.
And many big companies, especially in financial world,
there are Excel spreadsheets that are so huge
that you wouldn't be able to recompile everything from scratch.
So you need a lot of incrementality there.
I should say that this for sure includes us.
We have a ton of massive spreadsheets
and recomputing everything from scratch would be a total non-starter.
It's an especially big deal because in a financial context, you often use spreadsheets as effectively
monitoring tools. So you have data that's streaming in, and your inputs are constantly changing.
And if you redo the entire computation every time, you're just going to lock your computer up.
It's not going to work at all.
So incrementality is really critical in the way that we use spreadsheets and in general,
the way that financial firms tend to use them.
So I guess the main difference is that Excel and other incremental computation systems
that don't operate on the level of files, like build systems do, and there is a lot
of complexity of dealing with files and file systems and various architectures, which Excel
doesn't need to think about. And of course, Excel has its own sources of complexity of dealing with files and file systems and various architectures, which Excel doesn't need to think about. And of course, Excel has its own sources of complexity. For example,
in Excel, it's okay to describe Excels that can depend on each other in a circular way and tell
Excel, okay, I know that there's a cycle here, please repeat this cycle up to 50 times until
it converges to a fixed point. And this is something that typically build systems don't
need to deal with. So one of the interesting things about including Excel in that paper
is I feel like it opens a certain kind of Pandora's box because you've now said, well,
this thing, which isn't exactly a build system, we're going to kind of describe it in the same
language. But suddenly you're talking about a big and wild universe of incremental computation,
where there's a huge number of different kinds of systems that support incremental computation, where there's a huge number of different kind of systems that
support incremental computation, lots of handgrown stuff, and lots of frameworks for making it easier
to construct incremental computations. So an example that we have used for a lot of things
internally at Jane Street, and in fact, have talked about and blogged about over the years,
is a library called Incremental, which lets you construct, in fact, using a language which
is a little bit similar to the languages in things like Jenga and Shake, construct graph-structured
incremental computations. So that's kind of one direction you open up. And there are certain
kinds of things that those systems do and ways in which they operate, which are actually somewhat
different from how build systems tend to work. So that's one direction it opens up.
But another direction it opens up is there are other kinds of
incremental computation tasks that people have that don't look like builds.
One that we also run into at Jane Street is essentially scientific computation workloads,
where you have huge data sets and various stages of computations you want to do on those data sets.
Again, there are tasks, you'd like to have caching of the tasks so that if two people want the same computation, they don't have to redo it. You'd like to be able to dispatch
things in parallel. I'm wondering how you think about the way in which these kind of distributed
computation, you know, scientific workflows, by the way, we see this in a finance context.
It also shows up a lot in bioinformatics, where people are doing enormous multi-day genomic computations and stuff in similar ways. To what degree do you think this
nestles in neatly with built systems? And to what degree is it really its own independent problem?
So I think it's easy to say that conceptually all these things are the same as we do in our paper.
But of course, there are also important practical differences that need
to be considered. If we talk about scientific computation pipelines, typically they operate
on huge datasets, as you say, and it's just not feasible to run them on a single machine, whereas
we typically run builds on a single machine. So typically, when you run a build, all your inputs
are there, you just run a command and it produces the output of the same machine. This is just not
the case with many scientific computation workflows. You have one database here, another database maybe in a different city,
you need to somehow orchestrate the computation. It means you probably are not going to send a lot
of data between these two locations. You're probably going to exchange programs. These
programs are going to run in different locations. You're probably going to scale not in terms of
the jobs on a single processor,
but in terms of like how many hundreds of computers
you're going to take in the cloud.
And there's various questions like this
that you need to start answering,
which you don't need to answer
in the context of build systems.
And of course, there is a lot of complexity
because of that.
Right.
And even when we do extend beyond a single machine
in the context of ordinary build systems,
you still don't have to think about data locality in the same way. You'll have some kind of distributed storage
system for build artifacts, which you can allow everyone to access uniformly without having to
migrate the build to be closer to the data. You can get away with this because there isn't that
much data, because builds, by and large, derive from files written by programmers, and that just
limits the amount of data that's going to be there in the first place. Yes, one other example of a difference that shows up in scientific computations is that
those computations are typically much more long-running, so they can be running for months,
which makes it very important to catch all possible errors up front. So if you have some
kind of a misconfiguration error, you ideally want this to be detected when the build or computation starts rather than
a few weeks later when you actually need to save that result on a disk that doesn't exist or
something like that. It's funny, it reminds me of the way we used to write programs. My father
was a computer scientist and he would tell me stories of how when he was first learning how
to program and first using it for, actually he was studying physics at the time for doing physics simulations, he would get an hour a week to run his program.
So he would spend a long time thinking very hard about getting it exactly right.
And there, there was nothing to help him.
He just had to think hard about getting it right.
And so people got very good at writing programs very carefully and debugging them.
And then, you know, when your time slice arrived, well, then you had your hour and you had better
get it right, or you had to wait another week.
So in some sense, at the two ends of this conversation, we're talking about the kind
of most modern of modern development workflows where we try and get interactivity to be as
good as it can be and give information back to the developer in a fraction of a second
on one end.
And on the other end of the scientific computing end, we're talking about computations that
might take weeks and trying to optimize for those. And I guess it makes sense that the decisions that you make in
optimizing for one case are very different than what you make for the other.
Scientific computation is like on the one side of the range where we have the longest running jobs
and the largest items of data, then those systems are somewhere in the middle. And then we have this
incremental in-memory computation frameworks, like the incremental libraries that you mentioned. And they already
operate on much faster timescales and on much lighter weight inputs and outputs. So you have
additions there, multiplications that you might want to execute. And I feel like in this domain,
what's also interesting is that you start to see various interesting incremental data structures not just algorithms but you start coming across things like okay how
do i represent an incremental map from keys to values that if i add a key to that map how do i
get the map efficiently how do i efficiently sort a list incrementally? So if only like a single input changed, how do I get the
sorted list back efficiently? So questions like this rarely show up in build systems, but I think
they show up pretty frequently in the world of incremental computation. That's right. Build
systems, you normally get down to some fairly coarse grained graph representation, and you don't
worry so much about incrementality within the computation of each individual node. But the
things we were talking about before,
in some sense, are pushing in that direction.
Once you want to keep a compute server around
and have it be faster by dint of remembering
what it did last time,
you're starting to operate in very similar spaces.
So you do see a lot of these ideas converge together
as you start trying to do better and better versions
of the problem and give more performant and more capable build systems.
Anyway, thank you so much, Andrei.
This was a lot of fun, and I feel like I learned a lot.
Yeah, thanks. That was a lot of fun, I agree.
And I'll be tuning to the other podcasts that you have in the pipeline.
You can find links to a bunch of the things we talked about,
including Andrei's papers on build systems,
as well as a full transcript of the episode at signalsandthreads.com.
Thanks for joining us and see you next week.