TLB Hit 💥 - Episode 4: t-r-a-c-/e̅‾\-o-m-p-i-l-e

Episode Date: May 6, 2022

Monitor. Compile. Bail. Repeat!...

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, Chris. Hey, JF. How's it going? Good. We haven't done a TLB hit in a while. What's up with that? Yeah, it's season two. Let's pretend that that was totally intentional. Right. It was totally intentional to start season two. And I'm coming to you from the land of the rising sun. So, ohayou gozaimasu. Very exciting. Very exciting. Ohayou gozaimasu. I can't say it. All right. That's okay. Okay. So, standard disclaimer for our podcast. We only know so
Starting point is 00:00:25 much, but we'll try not to say things that are wrong, and we will follow up with errata corrections or clarifications as we found out about them. We'll put them on the website, tlbh.it. So we do this whole lifelong learner thing, and we're always trying to push the boundaries of things we know and have conversations about what's interesting to develop and hone our skills. So what are we talking about today, Chris? Well, today we're going to be talking about trace compilers and trace compilation. Very exciting topic. That sounds cool. And what's cool when we were trying to find an episode topic is a lot of people we talk
Starting point is 00:00:56 about, they understand static compilation pretty well, but dynamic compilation, it's kind of a bit of mystery. And so we thought we'd dive into one approach for dynamic compilation that's called trace compilation and try to understand what that means. So can you walk us through a bit of the history of tracing, Chris? Yeah, there's kind of a really long and amazing background in trace compilation. So some of the history that I'm familiar with goes back to Joseph Fisher and VLIW machines, which VLIW stands for very long instruction word. The idea was that they wanted to take a big control flow graph of operations and figure out how to schedule stuff within the operations horizontally, like as big parallel words that all executed at the
Starting point is 00:01:37 same time. And the question was, how do you do that? And why do you approach it one particular way? Like what of this big control flow graph do you schedule to go happen horizontally all at the same time? And the idea was that we want to take ops that would be sequential and then present them all to the machine at the same time so that we could get parallelism
Starting point is 00:01:57 because we thought a lot of the performance would come from instruction level parallelism, especially at the time. So early academic systems started doing VLIW. There was this thing called the ELI 512 at Yale. And then there's a very famous thesis called the Bulldog Compiler Thesis, which has a lot of the ideas for VLIW compilation in it. And this led to some commercial computers as well, the trace computer at a company called Multiflow. And there's actually a book on that, which is a pretty decent book, which is cool.
Starting point is 00:02:31 The optimality of scheduling the control flow graph had some relation to optimality like operations research optimality and job shop scheduling. And that kind of brings up the classic topic of how do we do optimal scheduling? How do we do optimal register allocation? And then what would be the right phase order to do them in? Or the holy grail of could to do them in, or the holy grail of could you do both of them at the same time? And it's been an unbounded source of PhD theses over time, I think. Yeah, what's cool, you're talking about like job shop scheduling, and even the term just-in-time is a concept that was created in manufacturing by Toyota,
Starting point is 00:02:59 and then borrowed by the compiler space. So kind of a small world phenomenon here, since I work at Toyota. Yeah, what's interesting is that, like, we talk about all this early history of trace compilation, but eventually there's a lot of concepts that were shared between something called dynamic binary translation and trace compilation. So DBT, dynamic binary translation,
Starting point is 00:03:20 was, you know, affected by things like Dynamo in 2000 and Dynamo Rio in 2002. And then trace compilation was more for dynamic language virtual machines. So something like the France and Gaul tracing used in TraceMonkey, right, in early Firefox. And then LuaJIT is another popular one. And then some PyPy flavor of metatracing.
Starting point is 00:03:39 There's a lot of interesting stuff there. And there's also a lot of kind of static instrumentation that's existed. So there's things like pin on x86 and atom on alpha that are kind of really the original interesting thing. So in like 1994 was when atom came out by Alan Eustace, who used to work at Google kind of interesting thing. He jumps from space and everything. So he's kind of an interesting character there. So if you work on compilers, you can also like jump from space after 20 years or something. He did that in 2014. Yeah. And this also goes back to something that we had talked about, or at least touched on
Starting point is 00:04:11 previously was that they're, they cashed these micro operation tracelets inside of the Pentium four architecture. So we're kind of going back to some things that we touched on before, but didn't get to talk about in a lot of detail. Yeah. And one interesting thing about dynamic compilation compared to static one is you kind of want to try to discover information, right? So how do you find things out about a program? We've talked about that in previous episodes. So in static compilation, you look at the code, and you're like, well, this is probably generally true. So I'm a favorite, but there's a fallback pattern, whatever. It's pretty rare in static compilation that you end up doing speculative optimizations at a very large scale, right?
Starting point is 00:04:48 So you always have code that's correct and you never have to undo any mistakes that you've done. So you only really speculate on a few instructions and then you do something like a C move, like a conditional move or something like that. Whereas with more dynamic approaches like trace monitoring, it's a way to dynamically figure out what the real code is doing while executing.
Starting point is 00:05:07 And so instead of looking statically and saying, well, we think the code is going to do that, but integers can hold a lot of values. You look at the actual values going through the integers and whatever else, and you figure out, hey, this is what the real code is doing while executing. But what's interesting when you do that dynamically, it's really workload dependent. Dynamically, you only measure what you actually see and so that can lead to suboptimal choices if you make them kind of forever if you record what the program's doing in real life and then you keep that forever you take it to the bank then if your program has phases where it does some kind of behavior for a while and then shifts to another behavior after the boards then that workload that you measured isn't representative anymore, right? So we can do a comparison of this
Starting point is 00:05:48 kind of dynamic tracing recording of information to profile guided optimization that folks use on static compilation approaches. So if you're, you know, you do PGO, you're supposed to do that on the entire program, and it's supposed to be representative, whereas when you're tracing, you're doing that live, right? So when do you know, as you trace dynamically, that what you've recorded is representative? Yeah, that's a great point. And there's also this inherent trade-off here that's hiding beneath the surface of you can trace and just-in-time compile things very fast, and then correct yourself later, or you can wait longer to get more perfect information or more confidence in the information you've seen by observing the
Starting point is 00:06:30 program for a longer period of time. So you got to trade these two things off. And there's an interesting pipeline like nature to tracing and just in time compilation. So if you can be super low overhead, then you could compile very quickly on things like mobile devices. You can make like a no stage compiler. You can make something where as the bytecode operations are executing, they are being kind of just splatted out as native machine code. That's kind of a cool trade-off point in the space. So the pop-up idea here is how hard are we going to try to specialize for things that we saw? And what do we choose to bet on continuing to be true over time? And what does it cost for us to fall back and try something
Starting point is 00:07:09 different when our speculation is off? Again, we're tuning that speculation knob and how aggressively we're going to try to speculate and what we try to speculate on. Right, right. So the concept here is act on information that you saw. So stuff that you can do is things like specialization and guard, right? So you in a linear trace, for example, the number of assumptions that you do increases as you execute. So for example, if you have like an always taken control flow, right? So equivalently some virtual function call that's always calling the same virtual function, then a tracing compiler is going to treat all of these back-to-back branches as just one set of straight line code, right? Because it's always taken or you're always jumping to the same
Starting point is 00:07:48 function. It's kind of like straight line code when you execute it dynamically. And what it's going to do dynamically is check on the trace if that was a bad assumption, right? So it'll say, hey, I'm going to execute all this back-to-back and whenever convenient, I'm going to check if any of those assumptions were wrong. And on the cold path, when it's convenient again, it's going to undo everything if it turns out that all those branches weren't true or something like that. That's usually called a bailout, right? And what's interesting with the bailout, it's you're basically being surprised by new information that you didn't expect, something that wasn't represented when you were tracing. So what you do is you trust
Starting point is 00:08:23 that everything you've traced is going to happen, that it's all going to go well, and you're going to be able to optimize for it. But then at runtime, when convenient, you're going to check and bail if you were wrong. You're going to try to do that on the cold path and have the rollback on the cold path. So the hot path, the one you're sure is going to execute, we're pretty sure, is really, really optimal. Yeah, and I think that really captures the essence of what trace compilers are usually trying to go for here. There's interesting implementation angles to it as well. So one of the cool things about trace compilation
Starting point is 00:08:53 is that you can bolt a tracer into an existing interpreter. So in a lot of language environments, you'll have an interpreter that's executing ops at a time. And then what you can do is you can hook in a tracer, which observes what the interpreter is doing, and then kind of make some machine code on the side based on how the interpreter ran. And what's also cool is you can implement just a subset of the operations. Let's say that you had some very common bytecode operations inside of your environment, and then some very uncommon ones. You could implement only the common ones, and then you
Starting point is 00:09:24 could just end your trace whenever you hit one that you hadn't implemented because it was uncommon. So you could build up this tracing capability over time because the system is built with this assumption that you can bail out of the trace for whatever reason and go back to interpreter land. So you can imagine making a JIT
Starting point is 00:09:42 that just covered malls and ads and could make a fused multiplier adder and specialize for like one particular common type. So let's say you had many types in your language. You could specify it just for integers or just for floating point operations for numerical code and then just bail if you saw any other types show up at runtime, say if you were in a dynamic programming language. And so what you need in order to do this
Starting point is 00:10:05 is some way to record what you assumed. And then your trace can say, oh no, I saw something that I didn't assume and then call out to the runtime for help or go back to the interpreter mode. Right. And what the trace and variance do is they say when traces call to other traces. So let's say I recorded some set of ops and I recorded some set of other ops, and now I'm jumping from the first one to the second one. Now I need to make sure that the assumptions between those two things are lining up, right? And so this is called trace fusion or building bridges between traces. And you can do this if you know the invariance, what must be true on the exit from the first trace and the entry to the second trace are lining up, are compatible with each other. And that's useful to know because traces can
Starting point is 00:10:49 represent the state of the program very differently. So if you have different optimization levels, you might be holding things in different register and stack locations where one puts everything on the stack and the other one is assuming things come in in register form and is aligning stack slots. And in that case, transitioning from one to another required mapping the one view of the program all register based into the other one that's stack based, for example. You can also do stuff like burn global properties. So things that were like static values that were, you know, kind of global state. You could burn them into the trace. But then you also need
Starting point is 00:11:25 to mark the traces invalid if those global values were changed at some point, right? You need to be able to say, oh, I'm freezing on this global value being equal to 42. Someone changed it to 43. Let me go say, okay, that trace is no longer valid. Right? Yeah. And what's interesting when we talk about these things is there's a lot of terms in dynamic compilation that people might not be familiar with. So let's go through kind of a big collection of all those concepts in space and try to give you a picture of the whole elephant. So one interesting concept is method inlining. So in a trace compiler, you just kind of execute through methods, right?
Starting point is 00:12:00 You just go through them. You don't really care that it's a function or not or a method. And so inlining is kind of a natural path of compilation when you do trace compilation. It's just linear execution where the jumps disappear. That's one interesting aspect of trace compilation, that you do method inlining just by construction. Another interesting concept is called tail duplication. So as you run through different execution paths, the longer that path is, the more potential avenues that you're forming. It's like a road and there's all these avenues you're taking down that road as you dynamically execute things.
Starting point is 00:12:32 And so what you end up doing as you do that dynamic execution is you duplicate what's called the tail, right? The end path of each trace, right? Because imagine you're tracing through code, different traces will duplicate that tail. You're going to have multiple copies of it between all these linear traces. One clear example is if you have a condition and then a common return path, you would form one trace for one side and another trace for the other side of that condition. But they shared the return path. And so it's duplicated in both.
Starting point is 00:13:00 And that's called tail duplication. Now, what's interesting with that is it can be good because it encodes the provenance as a source of optimization. So the common path might execute differently based on where you came from. But at the same time, what it does is you're duplicating code, potentially exponentially when you do that, and it can cause some exponential explosion in trace compilation because conditions can stack up,
Starting point is 00:13:22 making a quick two to the n stacking, which is kind of a problem. So you have to find a balance between that. Right. And I think a lot of what we're discussing here is what happens when you kind of strictly linearly follow where the interpreter goes, right? So the interpreter executes at a PC, and then it executes at the next PC. When you do a call, you're kind of changing from one functions program counter to the next functions program counter. This is kind of within a spectrum that we think of as region forming. So a linear region is one particular region you can form where it just kind of marches through the program order according to some program counter value. Now there's trace formation, where it's that linear marching through the program. Then there's region formation, which says, okay, I see a conditional, So you can see how it's a spectrum of chopping things out for each of the different paths. So you can see how it's a spectrum of chopping things out for each of the different paths.
Starting point is 00:14:08 So you can see how it's a spectrum of chopping things out for each of the different paths. So you can see how it's a spectrum of chopping things out for each of the different paths. So you can see how it's a spectrum of chopping things out for each of the different paths. So you can see how it's a spectrum of chopping things out for each of the different paths. So you can see how it's a spectrum of chopping things out from the original source program in a sense, or maybe inlining plus chopping things out because tracing, like you said, can jump between different functions and, like let's say one trace could jump into another in its linear sequence, then that would form a region instead of a linear trace, right? Because a trace is kind of straight line. If they join together at some point, we'd consider that to be kind of region formation instead. If you jit it only at method boundaries, then you would have
Starting point is 00:15:00 a method jit. That's like term of art, what we call something that takes a method, maybe the methods it calls to, maybe inlines them, but that's like a trivially defined kind of region. And then kind of for the hybrid approach, you could imagine forming a trace, observing what the interpreter did, forming the little line,
Starting point is 00:15:18 and then trying to merge the lines of execution together, maybe with join points and chopping out the paths we didn't actually take from the method. So like a conditional that was never executed one way after n times we observed it, and then we'll treat like the other side of that conditional is dead code. This is kind of this way to think about the spectrum of region formation. There's also this notion of packing as much known hot code together, which can be more challenging when you're discovering things on the fly. Like you said, with tail duplication, oh, I saw we went down this path. Let me generate new code
Starting point is 00:15:48 for that. We maybe touched on in the past systems like Bolt that are trying to pack together code into the closest regions for locality in instruction memory and having them be prefetchable so that you get instruction cache TLB hits, right? But then at the end of the day, if we have to spread all this code out because we're discovering it on the fly, that can be a penalty for our instruction cache locality. Right. And what's interesting here is the idea of tracing through something like a switch, like a C or C++ level switch statement. You can imagine only taking one out of N of the paths in that switch, right? That can become pretty rough when you have a pretty large N, when you have a big switch like an interpreter itself.
Starting point is 00:16:29 If you were trying to trace to an interpreter, the interpreter tends to have a lot of opcodes. And that can be rough if you just trace one over N. So an important concept related to trace formation is when do you trigger compilation, right? You were talking about kind of fast and JIT quickly or record for a long time and then JIT. So when to trigger is really, really important. And what's interesting is when you look at a variety of papers, they all have different
Starting point is 00:16:54 policies on what they found to make a lot of sense. And honestly, it ends up being very workload dependent when you should actually compile. So, you know, if you're doing dynamic binary translation machines, something like Transmeta, then, you know, finding a universal solution is not obvious, right? But it might be a different solution from when you have a JIT for a very specific workload, and you can find optimal parameters for your workload, right? So imagine you have a generic machine that does trace compilation, it's supposed to run everything, right? So when does it trigger compilation is, you know, really depends on what that everything is, what it is that user actually does with the machine. Right. And I think we've talked about it in the past, but dynamic binary
Starting point is 00:17:32 translation is a very difficult problem, because you're pretending to be a particular architecture or kind of machine, and you have to generate new assembly. And you have to do that very quickly, if you want kind of machine level responsiveness, right? So that's maybe where it gets the most tricky versus something like Java virtual machine or more of a language level virtual machine where told us that the when trigger had to be pretty early for them because they didn't know, for example, when they were jitting, say, the mouse driver code. And if they didn't jit fast enough, then the mouse wasn't responsive on the screen. It was kind of janky. So they had to jit pretty quickly, and then they didn't know what else was being interpreted in their machine. And so they kind of jitted everything pretty quickly. And that meant they couldn't have a lot of tracing, couldn't optimize all that much.
Starting point is 00:18:28 All right. So another thing that's interesting here is how do we gather type and control information? So in some systems, instead of having the compiled code or the jitted directly, and then rejit over time, you might just start with something like an interpreter, right? If you're not sure, hey, I want to start executing, but I don't want to jit everything, then just, you know, interpret the values that are coming in, and then start tracing through that. That's one way to get started. Yeah, like if your operations are super generic, so for example, getting a property in Python, for example, can visit like a ton of different objects and do a ton of different things. And so in order to specialize it, you probably want to see what it does the first time. Once you have confidence, then you might JIT for it. So this is a concept
Starting point is 00:19:09 called monitoring, right? So we look at what happens when we're running an interpreter mode. So we go into a monitoring mode where we start recording things that happened into a side buffer, or you can JIT on the fly. But you know, if you want to make sure that when you JIT compile something into native code, that you did something that probably will be reusable over time, you would wait longer as we've been discussing. So monitoring mode lets you, it slows you down a lot because recording stuff into a side buffer, but you're recording all this information so that you can make better choices and then get return on your investment of writing this stuff down. There's also this idea of type-based stubs. So like in a dynamic language, you don't know necessarily what the type of objects are.
Starting point is 00:19:49 You just know it is a object. So stubs are things that can record the type that was seen in line inside of the bytecode potentially, or on the side as part of the baseline emitted JIT compiled code. So there's a bunch of approaches, but this idea that, okay, the first times that it's running through, I'm just scribbling stuff down so that later I can JIT compile it. That's a pretty common approach. And nothing stops you from having on disk info about what happened in previous executions that you could mix with your new runtime observations. But often in practice, we just expect that the JIT will discover that stuff over time. We just kind of toss it all away, load it back up and have the interpreter kind of re-derive it from scratch. There's also the concept of choosing how aggressively to specialize traces for the values we observed. You touched on this a little bit before, but what we can do is we could just make like little blobs of assembly, and then we could spit out little function call pointers that call to these different blobs of assembly all in a row.
Starting point is 00:20:45 This is called a call threaded interpreter. You kind of build up these ops and these ops probably have some convention for passing stuff to each other and registers. And then you just kind of call them in a big sequence. And what's neat about that is you might not even need to generate native code. You could just put a bunch of function call pointers into a big array and then call them in sequence that they can talk to each other. But then you'd actually need to form the calling conventions
Starting point is 00:21:08 between these little blobs of code that you form in this pattern. And you can form them offline based on, for example, how you see them being used. But then there's fundamental questions here like, okay, I saw an integer four when I was monitoring. I saw it maybe three times. Do I wanna to assume that that
Starting point is 00:21:25 value is always four or is it just an integer that happened to be the number four these past three times? It's kind of simple. If it's a parameter, you can think about, do I want to specialize on the particular value for the function parameter that I saw? Or how many times is it going to be before I'm convinced? No, I'm pretty sure they're always passing four in that parameter. Yeah. And when we talk about this with folks and you say like, oh, you know, we can specialize on integer number four showing up. It seems like kind of a far fetched thing. Like why would you optimize a disinteger, which can hold three, two bits of value or something would always be four. Well, imagine you're tracing an interpreter and that's the running program, like kind of inception.
Starting point is 00:22:11 Think of it here like you have, say, Python or whatever, like some dynamic language that is interpreted that you're running inside a trace compilation. So, again, inception here. And the numbers in that interpreter determine things like control flow and constant propagation, right? So, you know, tracing an interpreter with an interpreter and things like that, really inception. But what you might notice is that four is a really important bytecode number. And that happens pretty often when you look at actual programs where, yeah, sure, the integers can span any value, but a lot of them are just used to determine control and things like that,
Starting point is 00:22:38 or determine particular avenues the code's going to take. So how many times do you want to see the number four before you believe that it's always going to be four? What do you do there? So that's an interesting thing to do with trace compilation. So another interesting terminology you kind of often hear is the term splat jitting. And that comes to what you were talking about earlier. It's a bit like what QMU does. So the QMU, the virtual machine emulator type of thing.
Starting point is 00:23:00 So it does kind of jitting basic block by basic block, right? So no jumps in between. And it generates those blocks dynamically, but it kind of jitting basic block by basic block, right? So no jumps in between. And it generates those blocks dynamically, but it kind of splats them. They're kind of flat one after the other. And that's why it's usually called a splat jit. And that's opposed to something that works harder to optimize bigger regions before code generation. So trying to join multiple basic blocks together. Right. And especially in a jit compiler environment where you have different tiers in the splat JIT level, you might measure how many cycles does it take for me to compile
Starting point is 00:23:30 this many ops, right? Like you really want it to be as fast as possible. So you're just blitting stuff into a buffer that you can use as your JIT compiled program. So it's avoiding the interpreter dispatch overhead, right? If you have a big switch before every op you implement, an indirect branch overhead that's very unpredictable, that's going to cost you something. And also managing the protocol for, you know, passing things between these different bytecode operations. These are the things that usually cost you
Starting point is 00:23:57 that you want to try to avoid as early as possible because they'll definitely slow you down. Another thing you can do is chop bigger traces. We kind of talked about tracing through function calls. So you can chop bigger linear traces up into trace lits. You can basically think about tracing basic blocks, which I think QMU actually does. And that means that you can revisit
Starting point is 00:24:17 your assumptions made for the basic block and you pay a less large cost because you're compiling a smaller amount of stuff, right? You're saying, okay, I only have this many instructions that I'm doing my compilation for. It's not that expensive to redo compilation for a handful of instructions. So I'm going to chop my traces into smaller pieces so that I can revisit my assumptions more quickly and easily. All right. So I think we also want to throw in some of the scary concepts here because trace compilers are very cool, but they're also have some difficulties in their implementation. So there was this concept that I
Starting point is 00:24:50 thought was really neat of re-entering the virtual machine when you're in trace. Okay, so you're executing in your assembly code that's trace compiled, but then you need to call to the runtime for help. But then the runtime itself needs to call some trace compiled code. You can imagine a higher order function or something where you say, well, I'm executing a map. I want to call this other function. This function is not compiled yet. Please help me call it.
Starting point is 00:25:15 And then you have this recursion between code that you compiled, then the runtime that's helping you, and then code that you compiled. And you kind of stack these ad infinitum or ad nauseum. And then something goes wrong down way at the bottom. And TraceMonkey, I think they called these deep bailouts, right? And so what happens is you have to figure out how to unwind through all those various layers and get back to a safe point where you know, okay, this is the set of assumptions that's okay to pass back to the interpreter at this time, even though I went through the runtime and back into native code and through the runtime. So I think that there's definitely some element to this,
Starting point is 00:25:55 which is complexity that's also present in method JITs. But I think traces are even more interesting because they can kind of just inline multiple methods calls and burn in more assumptions over time. And so some of that gets exposed to you more in higher volume when you're building a trace compilation system. Another scary thing is when you trace compile loops. So imagine you see a tail call and you wanted to basically optimize the tail call so it ran like a loop. Now you might have to create like a thousand stack frames when something goes wrong in the thousand and first iteration. So reifying all the stack frames can be scary. There's also the fundamental thing of when you
Starting point is 00:26:34 start trace recording, you're assuming that your trace will probably go back to a loop header. And if it never does, then you just get trace divergence where you ran in monitoring mode for a really long time. And then you never actually trace divergence where you ran in monitoring mode for a really long time. And then you never actually got to reuse the trace in the recording that you made. Yeah, cool. When you record a bunch of decisions and they're invalidated, so you don't overfit for those decisions again. Well, you've learned a mistake. You're like, hey, I traced this and then it never happened or it keeps changing as multiple modes.
Starting point is 00:27:09 You learn something that you don't want to do. but how long do you keep track of these mistakes, these things you don't want to do, right? So how precise do you want to be as well? That's a really tricky thing, right? Because you don't want to, if you have a bunch of mistakes, you're like, don't make that mistake, don't make this one, don't make that one, but you don't want to be using too much memory to keep track of them, right? So how long do you keep track of them and how much memory you use is pretty tricky. So when you and I work on the dynamic binary translator, we use 64 bits to represent this information, right? So we said like, well, in this region of code,
Starting point is 00:27:33 like these are the types of problems that occur, right? Like we see a lot of concurrency happening or we see a lot of that type of floating point math that's kind of tricky to optimize. And you use 64 bits and each of those bits represented a mistake not to make or a particular type of speculation not to make in that region of code generally. And then you would blow away this list of mistakes from time to time, because, you know, some programs have phases, right? If you're running a whole machine,
Starting point is 00:27:58 maybe the program's doing something completely different. So you just kind of periodically blow that information away. Right. And then you have to feel like you're going to rederive it quickly enough that the mistakes are not going to be biting you if you actually throw that information away. Right. Yeah. I think that also relates to there's some scariness in having a really long tail of workloads, like the corpus of the whole web for TraceMonkey as an example. There's kind of an assumption in trace compilers that maybe most of the time is spent inside of inner loops. And then you kind of run into the tyranny of flat profiles, right? Like if things actually have flat profiles, then you might not actually get to focus on the key inner loops that are creating most of the time running in the program. So you kind of
Starting point is 00:28:40 have this question of, I start tracing after I see this loop do n back edges, n returns back to the top of the loop. What do you set that n counter to be for the general world? And there was a lot of fun related to TraceMonkey and JaegerMonkey around what to set these kind of heuristic values to for the broader web and for benchmark sets. I think it also ties into things like inlining heuristics and things like this, where in compiler land, we have some basic premises and some benchmarking that we have to do in order to make things look on benchmarks. But then fundamentally, it's difficult to find a
Starting point is 00:29:14 truly perfect number for things like inlining heuristics. But it also dovetails with things like user expectations for performance properties. Like if I'm a user, I want to understand what this compiler system did for me and why. And with tracing, I think it's been my experience that some people find it's harder to debug. It's harder to have a mental model for what this underlying system will do and why versus method jitting where maybe methods are more clearly outlined. This is maybe one of the biggest downsides relative to a more traditional method jitting approach. But in classic fashion, when trace compilation works well, it's like magic, right? Because it's specializing perfectly for your key inner loops and speeding up most of the program with minimal effort and minimal
Starting point is 00:29:59 compilation time. Kind of amazing. All right, so episode zero was on stacks, right? So we got to go back to stacks just a little bit here. So there's native stack. So that's running my C code. And then sometimes virtual machines will keep a managed stack. And sometimes you can smash the two together, right? Like you can keep track of your Java machine or your JavaScript machine or your Python virtual machine stack values on the native stack, unifying the two. Yeah. And I guess speaking of stacks, there's another thing that we can talk about when we talk about traces called on-stack replacement. That would be a whole episode unto itself, right? But the basic idea is you jump into JIT code at a loop entry point when an existing frame is already present. So that's a kind of huge topic on its own.
Starting point is 00:30:43 Right. And tracers are kind of saying, okay, at this particular bytecode, I'm going to start doing my bytecode execution with this chunk of native code I made for the loop body over here. So with tracers, on-stack replacement is maybe slightly less challenging than it is with method JIT compilation, because tracers naturally have this kind of on- on stack replacement entry point. But then there's also on stack invalidation, which is, okay, I'm down in some lower level frame and my JIT code just became invalid because I made some assumption invalid. That means I have to nuke that JIT code because it said, you know, the static value is always 42 and it became 43,
Starting point is 00:31:21 but two stack frames above, I have some JIT code that's assuming it's 42. I have to invalidate that. And that remains as difficult in trace compilation mode. Even when you're tracing into inline calls, you have to say, okay, I'm going to create two stack frames, multiple stack frames, and then I have to invalidate everything appropriately. And we talked a bit about this notion of bolting the tracer on. So you can grab information from the guts of the interpreter that's not easily possible to re-derive. So things related to, say, the heap state at a particular time. There was this funny thing where TraceMonkey had if-defs to compile the same interpreter in three different ways. And all those versions were
Starting point is 00:32:01 compiled into Firefox in a particular way. Yeah, I remember that code was a hairy pile of code. That was hard to read through. Yeah, it was always a little tricky. All right, so quick rundown of recap of kind of the W questions that are involved with trace compilation. We're always wondering what's going to get compiled? What causes JIT compilation to start, continue, and finish? When do you trigger the monitoring of bytecode execution that you'll trace, for example?
Starting point is 00:32:30 When do you choose to compile after gathering a bunch of data, but how much, how long? What do you retain long-term if there's phasic program behaviors? What code or trace or information about mistakes you made do Do you retain across, say, a global garbage collection phase? And how do you decide when compilation is going to pay off or be interesting or be stable? Like what's the back edge trip count? I went back to this bytecode as part of this loop. How many times do I need to do that before I make a decision and how that shows up in benchmarks? Yeah. And so the aspects in those W questions are really two things here. There's a question of performance versus user expectations, right? And we're always trying to balance those.
Starting point is 00:33:11 And one thing we didn't mention at all in this whole discussion is security and maintainability, which are huge issues when you talk about these types of programs, right? Like you want them to be very secure in most cases. And then, you know, maintainability is pretty tricky as you start adding heuristics and learn new things about how you should optimize. It gets pretty hard to maintain things if you're not careful. Yeah. And I think we've touched on the fact there's this spectrum from doing method jitting with chopped out portions of the control graph to region formation and then linear traces
Starting point is 00:33:41 that go through multiple methods, right? So now we've kind of touched on each point in that spectrum. Yeah. And so what's the essential difference between method gitting? Well, there's a loop focus versus straight line code focus, right? So focus on chopping unused paths to maximize early assumptions. Right. And then there's the resulting fact that if things are stable
Starting point is 00:34:02 and you're friendly to the initial assumptions that you made, then you can save a bunch of useless work or recompilation work, which is maybe one of the classic trade-offs in aggressive speculation. Yeah. And so there's probably a bunch of other interesting stuff to look into that we don't have time to talk about today. So, you know, LuaJIT is interesting. So Slava has this really interesting talk on it, which is kind of great, right? There's the notion of code bases that are super small, such as LuaJIT, and really compact and elegant in a way. But when you get into it, it's really hard to ramp up, right? You kind of have to be a super genius with a lot of hours to ramp on it because it doesn't
Starting point is 00:34:37 have the traditional modularity that you'd expect out of more kind of enterprise-scale code bases. Yeah, when I think about like maximum cleverness, I often think about LuaJIT. We know cleverness can be a double edged sword, but if you love to nerd out, it's also hyper impressive. And the resulting performance is often really nice as well. I remember some cool things about LuaJIT,
Starting point is 00:34:58 like a custom application binary interface between the interpreter ops that was created to minimize the amount of register shuffling around that was required. I think actually interpreter loops are one of these examples where writing it in handcrafted assembly can make sense because traditional register allocation can have a harder time ascribing meaning to registers when there's this giant soup of code and operations inside of this giant interpreter loop switch. So there are some cases where register allocation, even though it's a modern Marvel, sometimes we can still do better by hand. Yeah. And so another kind of cool thing to look into is PyPy, right? So tracing the metal
Starting point is 00:35:35 level type of thing. And so what's interesting there is that you have an interpreter implementation language, and it can be fairly directly lowered to lower level primitives. But what's cool is being able to grab the representation of the program to form fragments of a broader trace, right? So you can annotate an interpreter where packages can be formed and so on. Yeah. And I think this is kind of a holy grail in a sense, because the objects that you use to build your interpreter and the objects that you're hosting inside of your interpreter can be unified in their representation, right? So any implementation level objects are sharing a unified GC infrastructure with the implementation. So there's always this underlying question of how do you get
Starting point is 00:36:15 reuse between these execution modes, between the interpreter, maybe a baseline JIT, maybe a optimized JIT. Conceptually, they all have similar semantics in like the shared runtime model. And there's this idea that you'd have like a copy paster that just took your C++ interpreter code and then turned it into like a splatted JIT mechanism. Sometimes we use self hosted built-ins like native primitives, like array.indexof we might define in JavaScript, even though it's supposed to be a native primitive. And this gets towards the idea of like meta circular interpreters. And there's cool systems that got to play with like Narcissus at Mozilla. And there's cool things I've heard about Maxine Java virtual machine, which is Java virtual machine implemented in Java. There's other
Starting point is 00:36:58 cool systems out there. We also don't have time to touch on in detail. Like there's the hip hop virtual machine, which is a tracelet based jitting machine. And it did some aggressive specializations for things that they really cared about in their workload. And that kind of goes back to our discussion on basic block versus region formation versus method jitting. So it's a very rich and cool space out there.
Starting point is 00:37:19 All right, cool. Well, that was a cool chat. I'm glad we got to do another podcast again. And let's try to make sure we don't take another too many months to record the next podcast. Yeah, hopefully everyone likes this discussion of the basics of trace compilation and the trade-offs involved there. We'd be happy to hear your questions, comments, and errata on Twitter. Yeah, TLBHIT is our Twitter handle and TLBH.it is the website. Great talking to you again, JF.
Starting point is 00:37:44 Yeah, same here. Later, my friend. See you later.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.