TLB Hit 💥 - Episode 4: t-r-a-c-/e̅‾\-o-m-p-i-l-e
Episode Date: May 6, 2022Monitor. Compile. Bail. Repeat!...
Transcript
Discussion (0)
Hey, Chris.
Hey, JF. How's it going?
Good. We haven't done a TLB hit in a while. What's up with that?
Yeah, it's season two. Let's pretend that that was totally intentional.
Right. It was totally intentional to start season two. And I'm coming to you from the
land of the rising sun. So, ohayou gozaimasu.
Very exciting. Very exciting. Ohayou gozaimasu. I can't say it.
All right. That's okay. Okay. So, standard disclaimer for our podcast. We only know so
much, but we'll try not to say things that are wrong, and we will follow up with errata
corrections or clarifications as we found out about them. We'll put them on the website,
tlbh.it. So we do this whole lifelong learner thing, and we're always trying to push the
boundaries of things we know and have conversations about what's interesting to develop and hone our
skills. So what are we talking about today, Chris? Well, today we're going to be talking about trace compilers and trace compilation.
Very exciting topic.
That sounds cool.
And what's cool when we were trying to find an episode topic is a lot of people we talk
about, they understand static compilation pretty well, but dynamic compilation, it's
kind of a bit of mystery.
And so we thought we'd dive into one approach for dynamic compilation that's called trace compilation and try to understand what that means. So can you walk us through a
bit of the history of tracing, Chris? Yeah, there's kind of a really long and amazing
background in trace compilation. So some of the history that I'm familiar with goes back to
Joseph Fisher and VLIW machines, which VLIW stands for very long instruction word. The idea was that they
wanted to take a big control flow graph of operations and figure out how to schedule
stuff within the operations horizontally, like as big parallel words that all executed at the
same time. And the question was, how do you do that? And why do you approach it one particular
way? Like what of this big control flow graph
do you schedule to go happen horizontally
all at the same time?
And the idea was that we want to take ops
that would be sequential
and then present them all to the machine at the same time
so that we could get parallelism
because we thought a lot of the performance
would come from instruction level parallelism,
especially at the time.
So early academic systems started doing VLIW.
There was this thing called the ELI 512 at Yale. And then there's a very famous thesis called the
Bulldog Compiler Thesis, which has a lot of the ideas for VLIW compilation in it. And this led
to some commercial computers as well, the trace computer at a company called Multiflow. And
there's actually a book on that, which is a pretty decent book, which is cool.
The optimality of scheduling the control flow graph had some relation to optimality like operations research optimality and job shop scheduling.
And that kind of brings up the classic topic of how do we do optimal scheduling?
How do we do optimal register allocation?
And then what would be the right phase order to do them in?
Or the holy grail of could to do them in, or the
holy grail of could you do both of them at the same time? And it's been an unbounded source of
PhD theses over time, I think. Yeah, what's cool, you're talking about like job shop scheduling,
and even the term just-in-time is a concept that was created in manufacturing by Toyota,
and then borrowed by the compiler space. So kind of a small world phenomenon here,
since I work at Toyota.
Yeah, what's interesting is that, like,
we talk about all this early history of trace compilation,
but eventually there's a lot of concepts
that were shared between something called
dynamic binary translation and trace compilation.
So DBT, dynamic binary translation,
was, you know, affected by things like Dynamo in 2000
and Dynamo Rio in 2002.
And then trace compilation was more
for dynamic language virtual machines.
So something like the France and Gaul tracing
used in TraceMonkey, right, in early Firefox.
And then LuaJIT is another popular one.
And then some PyPy flavor of metatracing.
There's a lot of interesting stuff there.
And there's also a lot of kind of static instrumentation
that's existed. So there's things like pin on x86 and atom on alpha that are kind of really
the original interesting thing. So in like 1994 was when atom came out by Alan Eustace,
who used to work at Google kind of interesting thing. He jumps from space and everything. So
he's kind of an interesting character there. So if you work on compilers, you can also like jump
from space after 20 years or something. He did that in 2014.
Yeah. And this also goes back to something that we had talked about, or at least touched on
previously was that they're, they cashed these micro operation tracelets inside of the Pentium
four architecture. So we're kind of going back to some things that we touched on before,
but didn't get to talk about in a lot of detail. Yeah. And one interesting thing about dynamic compilation compared to static one is you kind
of want to try to discover information, right? So how do you find things out about a program?
We've talked about that in previous episodes. So in static compilation, you look at the code,
and you're like, well, this is probably generally true. So I'm a favorite, but there's a fallback
pattern, whatever. It's pretty rare in static compilation that you end up doing speculative optimizations
at a very large scale, right?
So you always have code that's correct
and you never have to undo any mistakes that you've done.
So you only really speculate on a few instructions
and then you do something like a C move,
like a conditional move or something like that.
Whereas with more dynamic approaches
like trace monitoring,
it's a way to dynamically figure out what the real code is doing while executing.
And so instead of looking statically and saying, well, we think the code is going to do that, but integers can hold a lot of values.
You look at the actual values going through the integers and whatever else, and you figure out, hey, this is what the real code is doing while executing.
But what's interesting when you do that dynamically, it's really workload dependent.
Dynamically, you only measure what you actually see and so that can lead to suboptimal
choices if you make them kind of forever if you record what the program's doing in real life and
then you keep that forever you take it to the bank then if your program has phases where it does some
kind of behavior for a while and then shifts to another behavior after the boards then that
workload that you measured isn't representative anymore, right? So we can do a comparison of this
kind of dynamic tracing recording of information to profile guided optimization that folks use on
static compilation approaches. So if you're, you know, you do PGO, you're supposed to do that on
the entire program, and it's supposed to be representative, whereas when you're tracing,
you're doing that live, right? So when do you know, as you trace dynamically, that what you've
recorded is representative? Yeah, that's a great point. And there's also this inherent trade-off
here that's hiding beneath the surface of you can trace and just-in-time compile things very fast,
and then correct yourself later, or you can wait longer to get more
perfect information or more confidence in the information you've seen by observing the
program for a longer period of time.
So you got to trade these two things off.
And there's an interesting pipeline like nature to tracing and just in time compilation.
So if you can be super low overhead, then you could compile very quickly on things like
mobile devices.
You can make like a no stage compiler. You can make something where as the bytecode operations are executing, they are being kind of just splatted out as native machine code. That's
kind of a cool trade-off point in the space. So the pop-up idea here is how hard are we going
to try to specialize for things that we saw? And what do we choose to bet on continuing to be true over time? And what does it cost for us to fall back and try something
different when our speculation is off? Again, we're tuning that speculation knob and how
aggressively we're going to try to speculate and what we try to speculate on. Right, right. So the
concept here is act on information that you saw. So stuff that you can do is things like
specialization and guard, right? So you in a linear trace, for example, the number of assumptions that you do increases
as you execute. So for example, if you have like an always taken control flow, right? So
equivalently some virtual function call that's always calling the same virtual function,
then a tracing compiler is going to treat all of these back-to-back branches as just one set of
straight line code, right? Because it's always taken or you're always jumping to the same
function. It's kind of like straight line code when you execute it dynamically. And what it's
going to do dynamically is check on the trace if that was a bad assumption, right? So it'll say,
hey, I'm going to execute all this back-to-back and whenever convenient, I'm going to check if
any of those assumptions were wrong. And on the cold path, when it's
convenient again, it's going to undo everything if it turns out that all those branches weren't
true or something like that. That's usually called a bailout, right? And what's interesting
with the bailout, it's you're basically being surprised by new information that you didn't
expect, something that wasn't represented when you were tracing. So what you do is you trust
that everything you've traced is going to happen,
that it's all going to go well, and you're going to be able to optimize for it.
But then at runtime, when convenient, you're going to check and bail if you were wrong.
You're going to try to do that on the cold path and have the rollback on the cold path.
So the hot path, the one you're sure is going to execute, we're pretty sure, is really, really optimal.
Yeah, and I think that really captures the essence of what trace compilers are usually trying to go for here.
There's interesting implementation angles to it as well.
So one of the cool things about trace compilation
is that you can bolt a tracer into an existing interpreter.
So in a lot of language environments,
you'll have an interpreter that's executing ops at a time.
And then what you can do is you can hook in a tracer,
which observes what the interpreter is doing, and then kind of make some machine code on the side
based on how the interpreter ran. And what's also cool is you can implement just a subset of the
operations. Let's say that you had some very common bytecode operations inside of your environment,
and then some very uncommon ones. You could implement only the common ones, and then you
could just end your trace
whenever you hit one that you hadn't implemented
because it was uncommon.
So you could build up this tracing capability over time
because the system is built with this assumption
that you can bail out of the trace for whatever reason
and go back to interpreter land.
So you can imagine making a JIT
that just covered malls and ads
and could make a fused
multiplier adder and specialize for like one particular common type.
So let's say you had many types in your language.
You could specify it just for integers or just for floating point operations for numerical
code and then just bail if you saw any other types show up at runtime, say if you were
in a dynamic programming language.
And so what you need in order to do this
is some way to record what you assumed. And then your trace can say, oh no, I saw something that
I didn't assume and then call out to the runtime for help or go back to the interpreter mode.
Right. And what the trace and variance do is they say when traces call to other traces. So let's say
I recorded some set of ops and I recorded some
set of other ops, and now I'm jumping from the first one to the second one. Now I need to make
sure that the assumptions between those two things are lining up, right? And so this is called trace
fusion or building bridges between traces. And you can do this if you know the invariance, what must
be true on the exit from the first trace and the entry to the second trace are lining up, are compatible with each other. And that's useful to know because traces can
represent the state of the program very differently. So if you have different optimization
levels, you might be holding things in different register and stack locations where one puts
everything on the stack and the other one is assuming things come in in register form and is
aligning stack slots.
And in that case, transitioning from one to another required mapping the one view of the program all register based into the other one that's stack based, for example.
You can also do stuff like burn global properties.
So things that were like static values that were, you know, kind of global state.
You could burn them into the trace. But then you also need
to mark the traces invalid if those global values were changed at some point, right? You need to be
able to say, oh, I'm freezing on this global value being equal to 42. Someone changed it to 43. Let
me go say, okay, that trace is no longer valid. Right? Yeah. And what's interesting when we talk
about these things is there's a lot of terms in dynamic compilation that people might not be familiar with.
So let's go through kind of a big collection of all those concepts in space and try to
give you a picture of the whole elephant.
So one interesting concept is method inlining.
So in a trace compiler, you just kind of execute through methods, right?
You just go through them.
You don't really care that it's a function or not or a method.
And so inlining is kind of a natural path of compilation when you do trace compilation.
It's just linear execution where the jumps disappear.
That's one interesting aspect of trace compilation, that you do method inlining just by construction.
Another interesting concept is called tail duplication.
So as you run through different execution paths, the longer that path is, the more potential avenues that you're forming.
It's like a road and there's all these avenues you're taking down that road as you dynamically execute things.
And so what you end up doing as you do that dynamic execution is you duplicate what's called the tail, right?
The end path of each trace, right?
Because imagine you're tracing through code, different traces will duplicate that tail.
You're going to have multiple copies of it between all these linear traces.
One clear example is if you have a condition and then a common return path, you would form
one trace for one side and another trace for the other side of that condition.
But they shared the return path.
And so it's duplicated in both.
And that's called tail duplication.
Now, what's interesting with that is it can be good because it encodes the provenance as a source of optimization.
So the common path might execute differently
based on where you came from.
But at the same time, what it does is you're duplicating code,
potentially exponentially when you do that,
and it can cause some exponential explosion in trace compilation
because conditions can stack up,
making a quick two to the n stacking,
which is kind of a problem. So you have to find a balance between that. Right. And I think a lot of what we're
discussing here is what happens when you kind of strictly linearly follow where the interpreter
goes, right? So the interpreter executes at a PC, and then it executes at the next PC. When you do
a call, you're kind of changing from one functions program counter to the next functions program counter. This is kind of within a spectrum that we think of as region forming. So a linear region
is one particular region you can form where it just kind of marches through the program order
according to some program counter value. Now there's trace formation, where it's that linear
marching through the program. Then there's region formation, which says, okay, I see a conditional, So you can see how it's a spectrum of chopping things out for each of the different paths. So you can see how it's a spectrum of chopping things out for each of the different paths.
So you can see how it's a spectrum of chopping things out for each of the different paths.
So you can see how it's a spectrum of chopping things out for each of the different paths.
So you can see how it's a spectrum of chopping things out for each of the different paths.
So you can see how it's a spectrum of chopping things out for each of the different paths.
So you can see how it's a spectrum of chopping things out from the original source program in a sense, or maybe inlining plus chopping things out because tracing, like you said, can jump between different functions and, like let's say one trace could jump into another
in its linear sequence, then that would form a region instead of a linear trace, right? Because
a trace is kind of straight line. If they join together at some point, we'd consider that to be
kind of region formation instead. If you jit it only at method boundaries, then you would have
a method jit. That's like term of art, what we call something that takes a method, maybe the
methods it calls to,
maybe inlines them,
but that's like a trivially defined kind of region.
And then kind of for the hybrid approach,
you could imagine forming a trace,
observing what the interpreter did,
forming the little line,
and then trying to merge the lines of execution together,
maybe with join points
and chopping out the paths
we didn't actually take from the method. So like a conditional that was never executed one way after n times we observed
it, and then we'll treat like the other side of that conditional is dead code. This is kind of
this way to think about the spectrum of region formation. There's also this notion of packing
as much known hot code together, which can be more challenging when you're discovering things on the
fly. Like you said, with tail duplication, oh, I saw we went down this path. Let me generate new code
for that. We maybe touched on in the past systems like Bolt that are trying to pack together code
into the closest regions for locality in instruction memory and having them be prefetchable
so that you get instruction cache TLB hits, right? But then at the end of the
day, if we have to spread all this code out because we're discovering it on the fly, that can be a
penalty for our instruction cache locality. Right. And what's interesting here is the idea
of tracing through something like a switch, like a C or C++ level switch statement. You can imagine
only taking one out of N of the paths in that switch, right? That can become pretty rough when you have a pretty large N,
when you have a big switch like an interpreter itself.
If you were trying to trace to an interpreter,
the interpreter tends to have a lot of opcodes.
And that can be rough if you just trace one over N.
So an important concept related to trace formation
is when do you trigger compilation, right?
You were talking about kind of fast and JIT quickly
or record for a long time and then JIT. So when to trigger is really, really important.
And what's interesting is when you look at a variety of papers, they all have different
policies on what they found to make a lot of sense. And honestly, it ends up being very
workload dependent when you should actually compile. So, you know, if you're doing dynamic
binary translation machines, something like Transmeta, then, you know, finding a universal solution is
not obvious, right? But it might be a different solution from when you have a JIT for a very
specific workload, and you can find optimal parameters for your workload, right? So imagine
you have a generic machine that does trace compilation, it's supposed to run everything,
right? So when does it trigger compilation is, you know, really depends on what that everything is, what it is that user actually
does with the machine. Right. And I think we've talked about it in the past, but dynamic binary
translation is a very difficult problem, because you're pretending to be a particular architecture
or kind of machine, and you have to generate new assembly. And you have to do that very quickly,
if you want kind of machine level responsiveness, right? So that's maybe where it gets the most tricky versus something like Java virtual machine or more of a language level virtual machine where told us that the when trigger had to be pretty early for them because they didn't know, for example, when they were jitting, say, the mouse driver code.
And if they didn't jit fast enough, then the mouse wasn't responsive on the screen.
It was kind of janky.
So they had to jit pretty quickly, and then they didn't know what else was being interpreted in their machine.
And so they kind of jitted everything pretty quickly.
And that meant they couldn't have a lot of tracing, couldn't optimize all that much.
All right. So another thing that's interesting here is how do we gather type and control
information? So in some systems, instead of having the compiled code or the jitted directly,
and then rejit over time, you might just start with something like an interpreter,
right? If you're not sure, hey, I want to start executing, but I don't want to jit everything, then just, you know, interpret the values that are coming in,
and then start tracing through that. That's one way to get started.
Yeah, like if your operations are super generic, so for example, getting a property
in Python, for example, can visit like a ton of different objects and do a ton of different things.
And so in order to specialize it, you probably want to see what it does the first time. Once you have confidence, then you might JIT for it. So this is a concept
called monitoring, right? So we look at what happens when we're running an interpreter mode.
So we go into a monitoring mode where we start recording things that happened into a side buffer,
or you can JIT on the fly. But you know, if you want to make sure that when you JIT compile
something into native code, that you did something that probably will be reusable over time, you would wait longer as
we've been discussing. So monitoring mode lets you, it slows you down a lot because recording
stuff into a side buffer, but you're recording all this information so that you can make better
choices and then get return on your investment of writing this stuff down. There's also this idea of
type-based stubs. So like in a dynamic language, you don't know necessarily what the type of objects are.
You just know it is a object. So stubs are things that can record the type that was seen
in line inside of the bytecode potentially, or on the side as part of the baseline emitted
JIT compiled code. So there's a bunch of approaches, but this idea that, okay, the first
times that it's running through, I'm just scribbling stuff down so that later I can JIT compile it. That's a pretty common approach. And nothing stops you from having on disk info about what happened in previous executions that you could mix with your new runtime observations. But often in practice, we just expect that the JIT will discover that stuff over time. We just kind of toss it all away, load it back up and have the interpreter kind of re-derive it from scratch.
There's also the concept of choosing how aggressively to specialize traces for the
values we observed. You touched on this a little bit before, but what we can do is we could just
make like little blobs of assembly, and then we could spit out little function call pointers that
call to these different blobs of assembly all in a row.
This is called a call threaded interpreter.
You kind of build up these ops and these ops probably have some convention for passing
stuff to each other and registers.
And then you just kind of call them in a big sequence.
And what's neat about that is you might not even need to generate native code.
You could just put a bunch of function call pointers into a big array and then call them
in sequence that they can talk to each other.
But then you'd actually need to form the calling conventions
between these little blobs of code
that you form in this pattern.
And you can form them offline based on, for example,
how you see them being used.
But then there's fundamental questions here like,
okay, I saw an integer four when I was monitoring.
I saw it maybe three times.
Do I wanna to assume that that
value is always four or is it just an integer that happened to be the number four these past
three times? It's kind of simple. If it's a parameter, you can think about, do I want to
specialize on the particular value for the function parameter that I saw? Or how many times is it
going to be before I'm convinced? No, I'm pretty sure they're always passing four in that parameter. Yeah. And when we talk about this with folks and you say like,
oh, you know, we can specialize on integer number four showing up. It seems like kind of a far
fetched thing. Like why would you optimize a disinteger, which can hold three, two bits of
value or something would always be four. Well, imagine you're tracing an interpreter and that's
the running program, like kind of inception.
Think of it here like you have, say, Python or whatever, like some dynamic language that is interpreted that you're running inside a trace compilation.
So, again, inception here.
And the numbers in that interpreter determine things like control flow and constant propagation, right?
So, you know, tracing an interpreter with an interpreter and things like that, really inception.
But what you might notice is that four is a really important bytecode number.
And that happens pretty often when you look at actual programs where,
yeah, sure, the integers can span any value,
but a lot of them are just used to determine control and things like that,
or determine particular avenues the code's going to take.
So how many times do you want to see the number four before you believe that it's always going to be four?
What do you do there?
So that's an interesting thing to do with trace compilation.
So another interesting terminology you kind of often hear is the term splat jitting.
And that comes to what you were talking about earlier.
It's a bit like what QMU does.
So the QMU, the virtual machine emulator type of thing.
So it does kind of jitting basic block by basic block, right?
So no jumps in between. And it generates those blocks dynamically, but it kind of jitting basic block by basic block, right? So no jumps in between.
And it generates those blocks dynamically, but it kind of splats them. They're kind of
flat one after the other. And that's why it's usually called a splat jit. And that's opposed
to something that works harder to optimize bigger regions before code generation. So trying to join
multiple basic blocks together. Right. And especially in a jit compiler environment
where you have different tiers
in the splat JIT level, you might measure how many cycles does it take for me to compile
this many ops, right? Like you really want it to be as fast as possible. So you're just blitting
stuff into a buffer that you can use as your JIT compiled program. So it's avoiding the interpreter
dispatch overhead, right? If you have a big switch before every op you implement,
an indirect branch overhead that's very unpredictable,
that's going to cost you something.
And also managing the protocol for, you know,
passing things between these different bytecode operations.
These are the things that usually cost you
that you want to try to avoid as early as possible
because they'll definitely slow you down.
Another thing you can do is chop bigger traces.
We kind of talked about tracing through function calls.
So you can chop bigger linear traces up into trace lits.
You can basically think about tracing basic blocks,
which I think QMU actually does.
And that means that you can revisit
your assumptions made for the basic block
and you pay a less large cost
because you're compiling a smaller amount of stuff, right? You're saying, okay, I only have this many instructions that I'm doing my compilation for.
It's not that expensive to redo compilation for a handful of instructions. So I'm going to chop
my traces into smaller pieces so that I can revisit my assumptions more quickly and easily.
All right. So I think we also want to throw in some of the scary concepts here
because trace compilers are very cool,
but they're also have some difficulties in their implementation. So there was this concept that I
thought was really neat of re-entering the virtual machine when you're in trace. Okay, so you're
executing in your assembly code that's trace compiled, but then you need to call to the runtime
for help. But then the runtime itself needs to call some trace compiled code.
You can imagine a higher order function or something where you say, well, I'm executing
a map.
I want to call this other function.
This function is not compiled yet.
Please help me call it.
And then you have this recursion between code that you compiled, then the runtime that's
helping you, and then code that you compiled.
And you kind of stack these ad infinitum or ad nauseum. And then something goes wrong down way at the
bottom. And TraceMonkey, I think they called these deep bailouts, right? And so what happens is you
have to figure out how to unwind through all those various layers and get back to a safe point where
you know, okay, this is the set of assumptions that's
okay to pass back to the interpreter at this time, even though I went through the runtime and back
into native code and through the runtime. So I think that there's definitely some element to this,
which is complexity that's also present in method JITs. But I think traces are even more
interesting because they can kind of just inline multiple methods calls and burn in more assumptions over time.
And so some of that gets exposed to you more in higher volume when you're building a trace compilation system.
Another scary thing is when you trace compile loops.
So imagine you see a tail call and you wanted to basically optimize the tail call so it ran like a loop.
Now you might have to create
like a thousand stack frames when something goes wrong in the thousand and first iteration.
So reifying all the stack frames can be scary. There's also the fundamental thing of when you
start trace recording, you're assuming that your trace will probably go back to a loop header.
And if it never does, then you just get trace divergence where you ran in monitoring mode for
a really long time. And then you never actually trace divergence where you ran in monitoring mode for a really long time.
And then you never actually got to reuse the trace in the recording that you made.
Yeah, cool.
When you record a bunch of decisions and they're invalidated, so you don't overfit for those decisions again.
Well, you've learned a mistake.
You're like, hey, I traced this and then it never happened or it keeps changing as multiple modes.
You learn something that you don't want to do. but how long do you keep track of these mistakes, these things you don't want to do, right? So how precise do you want to be as well?
That's a really tricky thing, right? Because you don't want to, if you have a bunch of mistakes,
you're like, don't make that mistake, don't make this one, don't make that one, but you don't want
to be using too much memory to keep track of them, right? So how long do you keep track of them and
how much memory you use is pretty tricky.
So when you and I work on the dynamic binary translator,
we use 64 bits to represent this information, right?
So we said like, well, in this region of code,
like these are the types of problems that occur, right?
Like we see a lot of concurrency happening
or we see a lot of that type of floating point math
that's kind of tricky to optimize.
And you use 64 bits and each of those bits
represented a mistake not to make or a particular type of speculation not to make in
that region of code generally. And then you would blow away this list of mistakes from time to time,
because, you know, some programs have phases, right? If you're running a whole machine,
maybe the program's doing something completely different. So you just kind of periodically blow
that information away. Right. And then you have to feel like you're going to rederive it quickly enough that the mistakes are not going to be
biting you if you actually throw that information away. Right. Yeah. I think that also relates to
there's some scariness in having a really long tail of workloads, like the corpus of the whole
web for TraceMonkey as an example. There's kind of an assumption in trace compilers that maybe
most of the time is spent inside of inner loops. And then you kind of run into the tyranny of flat
profiles, right? Like if things actually have flat profiles, then you might not actually get to focus
on the key inner loops that are creating most of the time running in the program. So you kind of
have this question of, I start tracing after I see this loop do n back edges,
n returns back to the top of the loop.
What do you set that n counter to be for the general world?
And there was a lot of fun related to TraceMonkey and JaegerMonkey around what to set these
kind of heuristic values to for the broader web and for benchmark sets.
I think it also ties into things like inlining heuristics and things like this, where in
compiler land, we have some basic premises and some benchmarking that we have to do in order to
make things look on benchmarks. But then fundamentally, it's difficult to find a
truly perfect number for things like inlining heuristics. But it also dovetails with things
like user expectations for performance properties. Like if I'm a user, I want to understand what this compiler system did for me and why. And with tracing, I think
it's been my experience that some people find it's harder to debug. It's harder to have a mental
model for what this underlying system will do and why versus method jitting where maybe methods are
more clearly outlined. This is maybe one of the biggest
downsides relative to a more traditional method jitting approach. But in classic fashion, when
trace compilation works well, it's like magic, right? Because it's specializing perfectly for
your key inner loops and speeding up most of the program with minimal effort and minimal
compilation time. Kind of amazing. All right, so episode zero was on stacks, right? So we got to go back to stacks just a little bit here. So there's native stack. So that's running my C code.
And then sometimes virtual machines will keep a managed stack. And sometimes you can smash the
two together, right? Like you can keep track of your Java machine or your JavaScript machine or
your Python virtual machine stack values on the native stack, unifying the two.
Yeah. And I guess speaking of stacks, there's another thing that we can talk about when we
talk about traces called on-stack replacement. That would be a whole episode unto itself,
right? But the basic idea is you jump into JIT code at a loop entry point when an existing
frame is already present. So that's a kind of huge topic on its own.
Right. And tracers are kind of saying,
okay, at this particular bytecode, I'm going to start doing my bytecode execution with this chunk
of native code I made for the loop body over here. So with tracers, on-stack replacement is maybe
slightly less challenging than it is with method JIT compilation, because tracers naturally have
this kind of on- on stack replacement entry point.
But then there's also on stack invalidation, which is, okay, I'm down in some lower level frame
and my JIT code just became invalid because I made some assumption invalid. That means I have
to nuke that JIT code because it said, you know, the static value is always 42 and it became 43,
but two stack frames above, I have some JIT code that's assuming it's 42.
I have to invalidate that. And that remains as difficult in trace compilation mode. Even when
you're tracing into inline calls, you have to say, okay, I'm going to create two stack frames,
multiple stack frames, and then I have to invalidate everything appropriately.
And we talked a bit about this notion of bolting the tracer on. So you can grab
information from the guts of the interpreter that's not easily possible to re-derive. So things
related to, say, the heap state at a particular time. There was this funny thing where TraceMonkey
had if-defs to compile the same interpreter in three different ways. And all those versions were
compiled into Firefox in a particular way. Yeah, I remember that code was a hairy pile of code.
That was hard to read through.
Yeah, it was always a little tricky.
All right, so quick rundown of recap of kind of the W questions
that are involved with trace compilation.
We're always wondering what's going to get compiled?
What causes JIT compilation to start, continue, and finish?
When do you trigger the monitoring of bytecode execution that you'll trace, for example?
When do you choose to compile after gathering a bunch of data, but how much, how long?
What do you retain long-term if there's phasic program behaviors?
What code or trace or information about mistakes you made do Do you retain across, say, a global garbage
collection phase? And how do you decide when compilation is going to pay off or be interesting
or be stable? Like what's the back edge trip count? I went back to this bytecode as part of
this loop. How many times do I need to do that before I make a decision and how that shows up
in benchmarks? Yeah. And so the aspects in those W questions are really two things here. There's
a question of performance versus user expectations, right? And we're always trying to balance those.
And one thing we didn't mention at all in this whole discussion is security and maintainability,
which are huge issues when you talk about these types of programs, right? Like you want them to
be very secure in most cases. And then, you know, maintainability is pretty tricky as you start adding heuristics and
learn new things about how you should optimize.
It gets pretty hard to maintain things if you're not careful.
Yeah.
And I think we've touched on the fact there's this spectrum from doing method jitting with
chopped out portions of the control graph to region formation and then linear traces
that go through multiple methods, right?
So now we've kind of touched on each point in that spectrum.
Yeah.
And so what's the essential difference between method gitting?
Well, there's a loop focus versus straight line code focus, right?
So focus on chopping unused paths to maximize early assumptions.
Right.
And then there's the resulting fact that if things are stable
and you're friendly to the initial assumptions that you made, then you can save a bunch of useless work or recompilation work,
which is maybe one of the classic trade-offs in aggressive speculation.
Yeah. And so there's probably a bunch of other interesting stuff to look into that we don't
have time to talk about today. So, you know, LuaJIT is interesting. So Slava has this really
interesting talk on it, which is kind of great, right? There's the notion of code bases that are super small, such as LuaJIT, and really compact
and elegant in a way.
But when you get into it, it's really hard to ramp up, right?
You kind of have to be a super genius with a lot of hours to ramp on it because it doesn't
have the traditional modularity that you'd expect out of more kind of enterprise-scale
code bases.
Yeah, when I think about like maximum cleverness,
I often think about LuaJIT.
We know cleverness can be a double edged sword,
but if you love to nerd out, it's also hyper impressive.
And the resulting performance is often really nice as well.
I remember some cool things about LuaJIT,
like a custom application binary interface
between the interpreter ops that was created
to minimize the amount of register shuffling around that was required. I think actually interpreter loops are
one of these examples where writing it in handcrafted assembly can make sense because
traditional register allocation can have a harder time ascribing meaning to registers when there's
this giant soup of code and operations inside of this giant interpreter loop switch. So there are
some cases where register allocation, even though it's a modern Marvel, sometimes we can still do better
by hand. Yeah. And so another kind of cool thing to look into is PyPy, right? So tracing the metal
level type of thing. And so what's interesting there is that you have an interpreter implementation
language, and it can be fairly directly lowered to lower level primitives. But what's cool is
being able to grab the representation of the program to form fragments of a broader trace,
right? So you can annotate an interpreter where packages can be formed and so on.
Yeah. And I think this is kind of a holy grail in a sense, because the objects that you use to
build your interpreter and the objects that you're hosting inside of your interpreter can be unified
in their representation, right? So any implementation level objects are sharing a unified GC
infrastructure with the implementation. So there's always this underlying question of how do you get
reuse between these execution modes, between the interpreter, maybe a baseline JIT, maybe a
optimized JIT. Conceptually, they all have similar semantics in like the shared
runtime model. And there's this idea that you'd have like a copy paster that just took your C++
interpreter code and then turned it into like a splatted JIT mechanism. Sometimes we use self
hosted built-ins like native primitives, like array.indexof we might define in JavaScript,
even though it's supposed to be a native primitive. And this gets towards the idea of like meta circular interpreters. And there's cool systems
that got to play with like Narcissus at Mozilla. And there's cool things I've heard about Maxine
Java virtual machine, which is Java virtual machine implemented in Java. There's other
cool systems out there. We also don't have time to touch on in detail. Like there's the hip hop
virtual machine, which is a tracelet based jitting machine.
And it did some aggressive specializations
for things that they really cared about in their workload.
And that kind of goes back to our discussion
on basic block versus region formation
versus method jitting.
So it's a very rich and cool space out there.
All right, cool.
Well, that was a cool chat.
I'm glad we got to do another podcast again.
And let's try to make sure we don't take another too many months to record the next podcast.
Yeah, hopefully everyone likes this discussion of the basics of trace compilation and the
trade-offs involved there. We'd be happy to hear your questions, comments, and errata on Twitter.
Yeah, TLBHIT is our Twitter handle and TLBH.it is the website.
Great talking to you again, JF.
Yeah, same here.
Later, my friend.
See you later.