TLB Hit 💥 - Episode 0: mov fp, sp
Episode Date: November 2, 2020The stack, and how it relates to TLB Hits. ...
 Transcript
 Discussion  (0)
    
                                         Hey, JF, you want to do a super nerdy podcast?
                                         
                                         Of course, in computer science, the hard problem is naming things.
                                         
                                         So what should we call it?
                                         
                                         Oh, hey, Chris.
                                         
                                         Yeah, that sounds cool.
                                         
                                         Let's go with something like TLB hit.
                                         
                                         And we can get one of those sweet.it domain names, TLBH.it for TLB hit.
                                         
                                         Yeah, that's a good one.
                                         
    
                                         All right.
                                         
                                         I guess if we name it that, we should probably start by talking about memory subsystem stuff. Yeah, sure. Like TLB hits, what better way to
                                         
                                         foster TLB hits than the stack? The stack basically always TLB hits. So let's talk about the stack.
                                         
                                         Yeah. Stacks are awesome. Yeah. We should get like a Twitter handle. I think that's important.
                                         
                                         And I think at TLB hit is available. So it's not
                                         
                                         anymore. And if people want to send us feedback, they can hit us up on Twitter. See what I did
                                         
                                         there? Yeah. Hit. I got it. All right. I guess we should also give a disclaimer that, you know,
                                         
                                         we only know so much and we'll try not to say things that are wrong. And we'll also try to
                                         
    
                                         follow up with errata, you know, like corrections and clarifications as we find out about things that we said wrong.
                                         
                                         And we'll put them up on the website. We try to do that lifelong learners thing. And we're always
                                         
                                         trying to push the boundaries of things we know about and have conversations about what we're
                                         
                                         learning and what's interesting to develop and hone our understanding. So hopefully everybody's
                                         
                                         on board with that.
                                         
                                         Yeah, that's good. That's good. Just like a good CPU will have erratas and side channels,
                                         
                                         but in this case on Twitter. Okay, so let's get started.
                                         
                                         A good CPU always has side channels. Yeah.
                                         
    
                                         I know. I know. Yeah, it's for that speculation. Let's get started. What's the stack? So we named this episode moveFPSP and you usually find that in the prologue of ARM functions. And in the epilogue, you have the reverse, right?
                                         
                                         moveSPFP. And those are special instructions that manipulate the stack. Let's try to figure out
                                         
                                         what that means. The compiler spills values that registers can't hold onto the stack
                                         
                                         because it needs to do work. It has like say, you know, 16 or 32 registers to hold values and it
                                         
                                         just, you know, it has too much work to do. It'll put stuff on the stack and then load it back later.
                                         
                                         Functions do that a lot because they have their own state and sub functions they call have their
                                         
                                         own state. Since subroutines can recurse without bounds, you potentially just would need
                                         
                                         an infinite set of registers if you didn't do that. And there's ALU registers for arithmetic,
                                         
    
                                         and then there are FPU registers for floating point. Those registers contain stuff, right? It
                                         
                                         has like pointers to data, pointers to code, return addresses, stuff like that. And it's
                                         
                                         usually contiguous memory. The stack itself is contiguous.
                                         
                                         It's usually allocated on a per thread basis.
                                         
                                         So that's a high level where the stack is.
                                         
                                         Is that right?
                                         
                                         Yeah.
                                         
                                         So I guess, so ALU and FPU registers.
                                         
    
                                         So sometimes people talk about these GPRs, right?
                                         
                                         General purpose registers that could hold either one as like scalar values.
                                         
                                         But then on some of the processors we tend to work with, you can have these dedicated registers for floating point values as well, right?
                                         
                                         So they can kind of split the data paths.
                                         
                                         Or SIMD.
                                         
                                         Or SIMD, yeah, for really wide ones, right?
                                         
                                         And then I guess later we'll talk about calling conventions and stuff too. So one question I had was, so I think you said the prolog,
                                         
                                         we're moving the current stack pointer
                                         
    
                                         to the base pointer, right?
                                         
                                         And then on the epilogue,
                                         
                                         we're moving what was the base pointer
                                         
                                         back into the stack pointer.
                                         
                                         So we're kind of undoing the thing that we did.
                                         
                                         So we locally kind of manipulate the stack pointer
                                         
                                         and then we kind of undo everything,
                                         
                                         kind of roll it back to where it previously was
                                         
    
                                         when we're done.
                                         
                                         Right, right.
                                         
                                         Okay, cool.
                                         
                                         So then I guess there's a question of like,
                                         
                                         what mechanisms actually do this stack stuff
                                         
                                         in the processor?
                                         
                                         And so, like you said, there's this frame pointer,
                                         
                                         BP or base pointer, and then there's the stack pointer SP.
                                         
    
                                         So on x86, there's RBP and RSP.
                                         
                                         And x86 has either the R prefix for the 64-bit mode or the E prefix for the 32-bit mode.
                                         
                                         And then on ARM, you have FP and SP registers. Right, right, right. The usual convention is
                                         
                                         the frame pointer doesn't change during a function's execution. The frame pointer points
                                         
                                         at the function's frame,
                                         
                                         its scratch pad for stuff, like the place where it puts data
                                         
                                         when it needs to spill it.
                                         
                                         And the generated code addresses slots on the stack
                                         
    
                                         based on offsets from the frame pointer.
                                         
                                         So from the frame pointer, you say like plus 4, plus 8,
                                         
                                         or something like that.
                                         
                                         Those are slots off of the frame pointer for that function.
                                         
                                         And when that function calls another function,
                                         
                                         there's a code sequence which sets the frame pointer
                                         
                                         to the previous stack pointer.
                                         
                                         And when a function returns, the inverse is done.
                                         
    
                                         So in that sense, the stack is kind of like a linked list.
                                         
                                         Right, because it's putting this pointer on the stack
                                         
                                         that basically says this is where the frame pointer used to be
                                         
                                         before I came into
                                         
                                         my routine. Exactly. Cool. Yeah, I guess you can also compare this kind of operation to this notion
                                         
                                         of a stack machine. So if you, you know, in CS class or whatever, learn about machines where you
                                         
                                         can kind of push two things onto a stack and then do an add operation that'll consume the top two
                                         
                                         things on the stack, you could compare
                                         
    
                                         that to what's happening in a traditional processor that we use today, where it's kind of
                                         
                                         expanding the stack as a single operation. It's making a bunch of slots by making the stack
                                         
                                         pointer move down. And then it's kind of playing with those slots relative to the frame pointer
                                         
                                         or base pointer, whatever you want to call it, right? playing with those slots relative to the frame point or base point
                                         
                                         or whatever you want to call it. Right. So, so it doesn't have to strictly do everything in stack
                                         
                                         order for pushing and popping, consuming these values. Right. And so then there's kind of this
                                         
                                         distinction of a stack machine versus kind of scratch pad area style that happens in a stack
                                         
                                         like fashion for subroutine calls. Right? And there's things that matter also like
                                         
    
                                         whether instruction cache pressure plays a role
                                         
                                         and these kinds of things where push instructions
                                         
                                         could be really short on x86,
                                         
                                         like a single byte is the push instruction.
                                         
                                         So if you're frequently doing these operations,
                                         
                                         manipulating the stack,
                                         
                                         if it's very common for programs to do that,
                                         
                                         we might even optimize our instruction sets to try to make these operations extremely fast. Or maybe on ARM, you'll have like a push multiple
                                         
    
                                         values instruction, even though that's a little bit sisky. You know, that's something that you
                                         
                                         do so commonly, it might actually make some sense, right? Yeah, in ARM v7, it had an instruction
                                         
                                         that allowed you to push like 16 registers. So all the general purpose registers, you could just
                                         
                                         push all of them if you wanted to, and increment the stack pointer it was kind of magical push all in just like poker
                                         
                                         yeah exactly it was like a bit mask and you say here's the ones i want to push and it just pushed
                                         
                                         whatever but it's risk right it's kind of cool right exactly it's reduced yeah it's a reduced
                                         
                                         set of registers compared to other sets of registers you could use. Cool. Okay. So I guess there's
                                         
                                         also a bunch of compiler optimizations that are related to stackiness, right? So by moving things
                                         
    
                                         onto the stack, the stack is often a very hot region because the code is constantly manipulating
                                         
                                         within your given subroutine what it's working with in the stack. So moving things into the
                                         
                                         stack can be
                                         
                                         more efficient than heap allocating, both because of locality, but also because then you can avoid
                                         
                                         potentially costly memory allocation subroutines like calling into malloc, which could be,
                                         
                                         you know, hundreds or thousands of cycles, depending. And so when they're in that scratchpad
                                         
                                         area, you know, that has this hot locality, the values are also tracked precisely in a data
                                         
                                         flow sort of style. So if you kind of say, oh, these are random heap pointers and they might
                                         
    
                                         alias as you bring them kind of closer in to be more known to the computation in the compiler,
                                         
                                         the values might become more trackable kind of as, you know, if you're familiar with SSA at all,
                                         
                                         SSA values instead of just being arbitrary memory references.
                                         
                                         So then when structs end up making it onto the stack,
                                         
                                         then the component fields inside of the struct
                                         
                                         can actually be broken apart
                                         
                                         and tracked as individual values that are manipulated
                                         
                                         and stored into the stack.
                                         
    
                                         So this is called a scalar replacement of aggregates,
                                         
                                         which is a super cool optimization pass
                                         
                                         where you break the struct apart
                                         
                                         and track the individual fields inside of it. And so when we home those on the stack, we can reason about those fields
                                         
                                         like values that we can, you know, common sub-expression eliminate or dead code eliminate
                                         
                                         and things like this. Whereas if they were on the heap, it might be a lot harder to do that.
                                         
                                         And so then the moral equivalent in, you know, managed languages. Like I sometimes like to talk about JavaScript or
                                         
                                         languages like Java. You do like escape analysis to make sure that the object doesn't actually
                                         
    
                                         leave from the heap into some other subroutine. And then once it's put on the stack, then you can
                                         
                                         actually eliminate whole objects and just track the subfields inside of it. So there's a whole
                                         
                                         area of compiler optimizations that's also related to the stack. So that's pretty fun.
                                         
                                         Yeah, that's really cool
                                         
                                         because it allows you to just explode the object itself
                                         
                                         and think about its component fields individually
                                         
                                         and get rid of whatever doesn't matter in there.
                                         
                                         So that's really sweet.
                                         
    
                                         And it turns out that some compilers
                                         
                                         can also sometimes optimize heap allocations.
                                         
                                         If it sees like it's a really small one
                                         
                                         and you allocated the startup function
                                         
                                         and then you delete at the end,
                                         
                                         then it can like just stack allocated sometimes. And C++ explicitly allows you to do that as of somewhat recently, a few years ago. And so client does that.
                                         
                                         So if you knew an object, it's not guaranteed that you're actually going to put that object
                                         
                                         on the heap that it's going to call the underlying allocator, right?
                                         
    
                                         Right, right, right. And it's surprising to some people, like they see their allocation disappear and they're like, wait, I called my allocator. What happened? Well, it went
                                         
                                         on the stack instead. So that's kind of cool. And then it does SROA and other stuff and might get
                                         
                                         rid of the entire computation. So that's kind of neat unless that's not what you're trying to do.
                                         
                                         Yeah, it seems like a really key optimization to do. You know, if the heap is really part of
                                         
                                         your programming language environment, like if you're thinking about things as just objects instead of as like raw bytes and exactly where
                                         
                                         they go, having that higher level understanding that you can optimize based off of is pretty key,
                                         
                                         I think. Right. Right. And I remember like when I started programming, there's this thing called
                                         
                                         frame pointer emission that was really cool at the time because optimizers weren't
                                         
    
                                         as good as they were now. And in some cases, like in leaf functions or when you enable frame
                                         
                                         pointer emission in the compiler, so you don't need the frame pointer anymore and you can just
                                         
                                         use the stack pointer, right? So it frees up an extra register for the compiler to use. So that's
                                         
                                         kind of a neat optimization too. Yeah. Back when you only had eight registers, I guess, for x86 before x86-64, that extra
                                         
                                         register could go a long way potentially because the stack, even though it's hot in cache,
                                         
                                         it's doing stores and loads to memory locations, right?
                                         
                                         Right, exactly.
                                         
                                         Yeah, I think I used to know of FPO as that flag that makes the debugger way worse, right?
                                         
    
                                         Because, you know, without the ability to
                                         
                                         know how things refer to the base location, then the debug information has to be a lot more
                                         
                                         prescriptive about how you go and grab the value at any given point inside of a function, right?
                                         
                                         Right. Debugging used to be way worse too. And I think nowadays people don't really eliminate
                                         
                                         front pointers. And so debuggers don't really have to deal with it, but they've gotten better
                                         
                                         from what I can see anyways. Yeah, I guess because modern CPUs are
                                         
                                         doing registry renaming and cool stuff under the hood, they can actually have like a micro
                                         
                                         architecturally larger register set. And you're not worried so much about saving that one register
                                         
    
                                         a lot more of the time, although in hot code, you still might. Yeah. Yeah. I guess another
                                         
                                         interesting thing is
                                         
                                         like, there's this idea of leaf functions, right? Like you talked about. And so when you inline
                                         
                                         things, which inlining is kind of the key compiler optimization, as we know, right, it makes these
                                         
                                         big regions for compilers to do their kind of local flow analysis and optimizations on. And so
                                         
                                         we make these big fat leaf functions. And I
                                         
                                         actually wonder, I don't know the answer to this, but how much of program time is generally spent
                                         
                                         in leaf functions in the course of, you know, some set of applications? It would be interesting to
                                         
    
                                         know. Yeah, I don't actually know. I guess we never defined what a leaf function is, right?
                                         
                                         So it's like the function at the end of your call tree. Yeah. So if your subroutine doesn't call any other subroutines, that's kind of a nice property,
                                         
                                         right? Because now you know that basically everything at the end of this stack belongs
                                         
                                         to you and you're just going to pop back up to whoever called you.
                                         
                                         Exactly. And inlining really unlocks that, right? Because you start inlining into,
                                         
                                         you know, not leaf functions and they become the leaf now.
                                         
                                         That ends up unlocking a lot of stuff, right? As long as your working set doesn't become too big,
                                         
                                         right? So that function doesn't end up doing everything. Then it's kind of nicely segmented.
                                         
    
                                         The compiler can know everything it does and it has a good amount of work to do. So that's kind
                                         
                                         of neat. It's kind of a small region in which you can analyze everything. It's like tiny little
                                         
                                         whole program analysis. Right. But then like, why do we even have a stack in that case? Right. So
                                         
                                         like, imagine you inlined everything, couldn't you just do that? Right. And I,
                                         
                                         I used to think that and I, I've realized there's really two main issues to that. One of them is
                                         
                                         you don't know what the call graph is for the entire program and recursion,
                                         
                                         right? And so the first one is like, if you knew where all the calls go, right? So all the virtual
                                         
                                         calls, indirect calls and other stuff in your translation unit, as well as the other ones in
                                         
    
                                         your program. And if you had no recursion, then it kind of seems like you wouldn't need a stack, right?
                                         
                                         Because you know a perfect call graph.
                                         
                                         And even with some of these, you could still avoid having a stack, right?
                                         
                                         Like if you have virtual functions, but there's like, you know, only ever one or two implementations of it,
                                         
                                         then you could reify those virtual calls as two like test and branch to the indirect thing.
                                         
                                         So you could just have no stack at all ever in a program
                                         
                                         if you had all that information, right?
                                         
                                         Do you know if anyone does that?
                                         
    
                                         Yeah, so I guess,
                                         
                                         so first let me just ask about the switch thing.
                                         
                                         So, or sorry, with the virtual.
                                         
                                         So if you have a fully analyzable virtual dispatch,
                                         
                                         effectively it just becomes a switch, right?
                                         
                                         Because you're like, which one of the targets is it? And then you can basically inline what the potential targets are.
                                         
                                         That's, that's cool. That control flow analysis that says like, oh, this is an indirect branch.
                                         
                                         So hypothetically, it could go anywhere. But you know, with things like within a
                                         
    
                                         de-virtualization within inside a translation unit, you could actually enumerate like the
                                         
                                         full set of possibilities, right?
                                         
                                         That's always pretty cool.
                                         
                                         Yeah, so this fully analyzable call graph thing is an interesting computer history property, I'd say.
                                         
                                         So Fortran 77, I think, was classically able to do this
                                         
                                         or the programs were restricted enough
                                         
                                         that you could analyze it.
                                         
                                         And also the compiler that I worked on,
                                         
    
                                         the XLA machine learning array programming compiler
                                         
                                         has the same property
                                         
                                         where the whole call graph is analyzable.
                                         
                                         So you can actually just create a slab
                                         
                                         that's basically the stack frame
                                         
                                         for this whole program you're optimizing.
                                         
                                         And all of the allocations are known fixed size.
                                         
                                         So yeah, there's some of that lives on even today
                                         
    
                                         in some more niche use cases of this
                                         
                                         whole program analyzability, not needing to have strict call frames. So JavaScript engines too,
                                         
                                         I think they would sometimes recur from their call frames through the virtual machine,
                                         
                                         and then it would have to create like another stack or sub stack or things like this these are kind of interesting multi-stack
                                         
                                         problems as well beyond just being able to analyze a single stack yeah definitely that's really cool
                                         
                                         and i i remember programming in fortran and that was kind of cool because like you as a programmer
                                         
                                         you're trying to solve a very specific you know physics problem or whatever and you don't usually
                                         
                                         need any of those tools like recursion or virtual
                                         
    
                                         functions or anything like that when you're like deep in your code, right? And I remember it's kind
                                         
                                         of a neat programming model. Like I actually non-ironically like Fortran for that. It's kind
                                         
                                         of neat to program in. Yeah, when everything is kind of monomorphized, right? Like when you have
                                         
                                         big arrays of fixed value types, then you kind of can know everything about the world and really optimize everything based off of it.
                                         
                                         That was, that's always kind of a fun mode to be in for scientific computing code.
                                         
                                         Yeah, exactly.
                                         
                                         And so like, are those the only issues like recursion and direct calls, right?
                                         
                                         Because there's some languages that use the stack for say fast thread switching, right?
                                         
    
                                         So as well as like something like full stackful coroutines
                                         
                                         or something.
                                         
                                         So if you look at stacks in Go, they're not contiguous.
                                         
                                         They're more like a C++ deck or whatever,
                                         
                                         like a linked list of lists, instead of just being
                                         
                                         one contiguous stack.
                                         
                                         And they're linked to each other.
                                         
                                         It has this really cute, clever x86 code sequence,
                                         
    
                                         which makes it fast to find the previous and next frame.
                                         
                                         So it allows the ghost stacks
                                         
                                         to be just kind of distinct allocations, right?
                                         
                                         Like each page-wise or something is one frame,
                                         
                                         and then the next function has another frame.
                                         
                                         You can put multiple functions in one frame,
                                         
                                         so that's kind of neat.
                                         
                                         Like I remember it can,
                                         
    
                                         or at least it used to
                                         
                                         be able to have really bad perf if you were in a hot loop and you happen to straddle that boundary
                                         
                                         but otherwise it was kind of neat right it did some cool stuff and i think coroutines in some
                                         
                                         language ends up having some stackless stuff like this right when their closure is heap allocated
                                         
                                         instead so i know like the c++ um the c++ coroutines they try to do away with all of the heap allocation,
                                         
                                         but it really depends on your optimization level whether they can do that or not.
                                         
                                         And it's kind of similar for Objective-C blocks.
                                         
                                         I think by default until pretty recently, Objective-C blocks were just always heap allocated,
                                         
    
                                         and they started being stack allocated in the last few years when they could.
                                         
                                         So that's kind of neat, too, where the language doesn't really say
                                         
                                         if stuff lives on the stack or the heap,
                                         
                                         it just kind of lives somewhere.
                                         
                                         And it's kind of neat.
                                         
                                         And because the stack is less constrained,
                                         
                                         it can kind of live in different places,
                                         
                                         especially in Go, right?
                                         
    
                                         And then in some cases, like, you know,
                                         
                                         you remove the allocation entirely, right?
                                         
                                         So it's kind of neat.
                                         
                                         And conceptually, if you want to be able to scale
                                         
                                         your assumptions about concurrency to millions of threads or something like that, you can see how
                                         
                                         you wouldn't want to have huge stacks, right? Because each thread has a stack. And if you have
                                         
                                         millions of threads, then you don't want to be allocating all those stacks, right? So you want
                                         
                                         to be able to have really small stacks and switch between those threads pretty quickly. And so, you
                                         
    
                                         know, that kind of asks the question, question, how do you usually size those stacks
                                         
                                         in the per-thread context that you have?
                                         
                                         Yeah, because if you're doing just tiny little operations,
                                         
                                         like if every single operation in your program
                                         
                                         was conceptually a thread,
                                         
                                         you wouldn't want to allocate 512 KB
                                         
                                         every single time you did a tiny little atomic operation.
                                         
                                         Yeah, it's a really interesting
                                         
    
                                         point. Yeah. So actually the, the term stackless is also funny because I think Python's greenlets.
                                         
                                         So greenlets is like lightweight thread terminology, right? I think it's actually
                                         
                                         called stackless Python. One of the attempts that was made to make greenlit oriented Python
                                         
                                         and in managed languages like Python, all the frames are actually allocated on the heap,
                                         
                                         like even in CPython.
                                         
                                         The frames are allocated on the heap,
                                         
                                         though you still have your native stack
                                         
                                         for your native program
                                         
    
                                         that's doing like the managed VM execution.
                                         
                                         So something that JITs do
                                         
                                         is instead of the interpreter
                                         
                                         kind of running off the native stack
                                         
                                         and then keeping a different stack for the interpreter,
                                         
                                         like that the interpreter manages,
                                         
                                         they can unify together the native stack and the virtual machine's stack space
                                         
                                         so that it puts it all back into like a native thread worth of stack space.
                                         
    
                                         So a frame is really just some space and spaces get linked together,
                                         
                                         you know, in this abstract concept. And both of those things can also happen on the heap. There's no reason that,
                                         
                                         you know, this one virtual memory region that we call the stack is the place that all that
                                         
                                         operation has to happen. Right, right. And like, you know, we're talking about stack size,
                                         
                                         I imagine for GPUs, which have a bunch of warps, like you end up not wanting to have huge stacks.
                                         
                                         But typically on a CPU, I guess the stack is usually
                                         
                                         between, say, like 512 KB and 2 megs on common systems,
                                         
                                         like Mac OS, Linux, and Windows.
                                         
    
                                         And usually they allocate at thread startup.
                                         
                                         So it's part of, say, POSIX or Win32,
                                         
                                         what the default size is.
                                         
                                         And then most of the platforms have an API
                                         
                                         that allow you to change the size.
                                         
                                         And it's kind of weird, because some of them
                                         
                                         can only do that at runtime, and some of them
                                         
                                         can only do it at startup.
                                         
    
                                         So being able to change the stack size
                                         
                                         is kind of this uneven thing.
                                         
                                         And then that's like in user space,
                                         
                                         whereas if you compare the kernel or embedded systems,
                                         
                                         usually the stack size is way smaller, right?
                                         
                                         I remember for XNU it was 16 KB, which is super tiny.
                                         
                                         And blowing the stack is kind of easy in some cases
                                         
                                         when you're handling interrupts and other stuff,
                                         
    
                                         and then some stuff happens.
                                         
                                         You end up blowing up the stack pretty easily,
                                         
                                         so you've got to be careful there.
                                         
                                         The reverse is interesting.
                                         
                                         I wonder what's the biggest stack programs that exist? Like, yeah, that's a good try to think
                                         
                                         like who, who does that, right? There's probably people who ginormous stacks, but I couldn't figure
                                         
                                         out what the biggest one would be. Yeah, that's a good question. I know I've worked in some
                                         
                                         environments that set the stack limit quite low. And then if you compile
                                         
    
                                         in debug mode, debug mode is often spilling a bunch of stuff to the stack or homing it on the
                                         
                                         stack. It's not as aggressive. And then sometimes you'll get a random segmentation fault and it's
                                         
                                         because the frame got bigger than the initial stack allocation was assuming, or maybe you hopped
                                         
                                         over the guard page, which maybe we can talk about in a little bit. Yeah, I guess, but you know,
                                         
                                         when you're allocating this default start size for the stack,
                                         
                                         it brings up this point that I guess you have this default start size,
                                         
                                         but it doesn't all have to be mapped in at the start.
                                         
                                         You can potentially grow this region that you're calling the stack on the fly
                                         
    
                                         as you see that it needs more space.
                                         
                                         Couldn't you do that?
                                         
                                         Yeah, totally.
                                         
                                         Yeah, so I guess there's interesting mechanisms
                                         
                                         to potentially do that,
                                         
                                         both at the operating system level
                                         
                                         and for managed languages,
                                         
                                         you could potentially even like relocate
                                         
    
                                         where your stack segment is.
                                         
                                         Right, like just allocate a page
                                         
                                         and then just virtually map the other ones,
                                         
                                         fault when they get access
                                         
                                         and then map them in as as needed instead right
                                         
                                         yeah that's a potential way or i guess because you know where all the pointers are you could also
                                         
                                         relocate all the pointers that were anything to do with the stack locations yeah right right only in
                                         
                                         managed languages yeah so i guess something that we we were referring to earlier that maybe we can
                                         
    
                                         provide some more context on is the the stack works nicely with this least recently used cache that modern processors have. So younger frames are recently touched,
                                         
                                         right? So there's spatial and temporal locality are two of the things that caches exploit, right?
                                         
                                         When I touch something, I might touch something right next to it. When I touch something,
                                         
                                         I might actually come back and touch it again very quickly, you know, very quickly,
                                         
                                         close in number of cycles or time.
                                         
                                         And so the last in first out nature of a stack and the predictable access pattern, right, where
                                         
                                         we kind of like go down in addresses and then we pop back up if the stack is growing down,
                                         
                                         that lines up with CPU features like LRU caches and address predictors, right, that are trying
                                         
    
                                         to prefetch things that you're going to be pulling in. Right, right. I guess that means that a function's frame is usually in the cache in
                                         
                                         memory as recently used. And then it kind of brings up this broader question of how does
                                         
                                         memory on a computer work at all, right? There's physical addresses. So like when you're talking
                                         
                                         to a stick of DRAM, there's like some values that are going over the wires
                                         
                                         that are talking to the stick of DRAM.
                                         
                                         And those are the physical addresses, right?
                                         
                                         And then what the computer sees
                                         
                                         is this notion of virtual addresses.
                                         
    
                                         And virtual memory was introduced originally
                                         
                                         so that we could share the physical memory
                                         
                                         that was on a system more easily
                                         
                                         in shared environments and things like this.
                                         
                                         But it also offers facilities like protection and things like this, but it also offers facilities like protection
                                         
                                         and things like this.
                                         
                                         But virtual memory really means,
                                         
                                         okay, so I have some address.
                                         
    
                                         And let's say most modern systems,
                                         
                                         I think have like effectively 47 bit addresses, right?
                                         
                                         And that's a value that's really an alias.
                                         
                                         It turns into a real physical location
                                         
                                         and the alias is potentially
                                         
                                         unique to each program. So two different programs hypothetically could have these 47-bit values that
                                         
                                         if the mapping said, oh, they turn into the same physical location, that's actually possible on a
                                         
                                         modern system, right? Right. And so each one of these memory access instructions, like a load or a store,
                                         
    
                                         needs to translate that 47-bit virtual address into a corresponding physical address. And maybe
                                         
                                         you have, or you very likely have less than 47 bits worth of memory in your system, right?
                                         
                                         So it actually might be-
                                         
                                         I find each access too, right?
                                         
                                         Yeah. Like I might have only four gigs of memory, only 32 bits of physical space. But, you know, I can pretend that the space is larger and like page stuff out to disk.
                                         
                                         I can take things and that would be backed by physical memory and store them on disk.
                                         
                                         This is kind of the magic of a virtual memory subsystem, right?
                                         
                                         And so each one of these memory accesses where I take my virtual address and turn it into a physical address has to translate that virtual thing into the physical thing. And that can be expensive because the
                                         
    
                                         operating system has to maintain a big table of how that happens. But there's a translation cache
                                         
                                         in the hardware called the TLB. And because the stack, yeah, yeah, yeah. You had to figure we
                                         
                                         were going to talk about this at some point. But because the stack is accessed so frequently, it is small, the stack, in terms of number of pages.
                                         
                                         And the pages are very often used.
                                         
                                         So each stack access with this virtual location for where the stack lives will usually hit inside of this cache that translates our virtual address into our physical address. And that would be a TLB hit as opposed to a miss.
                                         
                                         Oh, wow.
                                         
                                         Yeah.
                                         
                                         Yeah.
                                         
    
                                         That's genius.
                                         
                                         Genius.
                                         
                                         Yeah.
                                         
                                         It's smart that they do that.
                                         
                                         Those caches, that cache stuff.
                                         
                                         Yeah.
                                         
                                         So yeah, we should do an episode that talks about this more.
                                         
                                         The stack is basically this LRU cache-y amenable kind of structure.
                                         
    
                                         And yeah, there's other benefits to it as well.
                                         
                                         I guess that was kind of a long-winded way of describing how we ultimately arrive at
                                         
                                         TLB hits.
                                         
                                         But the stack is very amenable to them, like you said at the beginning.
                                         
                                         Yeah, cool.
                                         
                                         And there's other advantages to the stack too, right?
                                         
                                         Like registers are really fast to access for the CPU
                                         
                                         and then caches are pretty fast too.
                                         
    
                                         And then you get to memory after that.
                                         
                                         But if you look like the CPU itself,
                                         
                                         more modern CPUs usually have store-to-load forwarding, right?
                                         
                                         So there's another cache in between the registers
                                         
                                         and the caches
                                         
                                         that forward values.
                                         
                                         When you do a store and then you do a load after,
                                         
                                         the load will look up recently stored stuff to say, hey,
                                         
    
                                         did you store that recently?
                                         
                                         Let me just get it from you instead of asking the cache.
                                         
                                         I don't need to bother with the cache.
                                         
                                         And so in a way, stored load forwarding
                                         
                                         is effectively an L0 cache between registers and the L1
                                         
                                         cache. And the stack pretty
                                         
                                         much always sits in there, right? Like most stack XSCs effectively act as an L0 cache. That's kind
                                         
                                         of neat. Yeah, that's really neat. But then, so let's say I'm a CPU in a multi-core die, right?
                                         
    
                                         So I'm one processor and I do a store and I do store to load forwarding, but then some other processor
                                         
                                         stored into that same location. Shouldn't I have seen the store that that other processor did?
                                         
                                         Right, exactly. Right. And it's stuck in the L0 cache, right? It hasn't been written to the cache
                                         
                                         yet. And the way coherency works between those multiple CPUs is at the cache level, not at the store to load,
                                         
                                         the store buffer, basically, right? Because the store buffer just buffers stuff before it goes
                                         
                                         to the cache. I think we should do an episode on that. Yeah, I think we have to do a whole episode
                                         
                                         on how this kind of thing works and how you can think about it, because it seems like once you
                                         
                                         get into this multiple processors world, stuff like store-to-load forwarding can be even more interesting than just like a common optimization.
                                         
    
                                         Yeah, it's great until it's not, right?
                                         
                                         So we'll have to figure that out for the next episode, I guess.
                                         
                                         But the other interesting thing about stack usage is, you know, we said the stack is contiguous and each function has a frame.
                                         
                                         But how a function uses the stack is actually super sparse.
                                         
                                         So the compiler allocates slots for values.
                                         
                                         So it's like, effectively, if you
                                         
                                         have variables in your original code,
                                         
                                         it allocates slots for them.
                                         
    
                                         Without optimizations, that's what it does.
                                         
                                         Every named value in your program effectively has a slot.
                                         
                                         And then the compiler gets in
                                         
                                         and it starts optimizing later if you turn that on.
                                         
                                         And it does what's called coloring, right?
                                         
                                         So it's like a graph coloring problem
                                         
                                         where it says like, you know,
                                         
                                         this part of the function doesn't use those values
                                         
    
                                         at the same time as this other one.
                                         
                                         So they're not live anymore.
                                         
                                         And so it colors them, right?
                                         
                                         It literally like solves the graph column problems.
                                         
                                         And it figures out-
                                         
                                         Probably pretty similar to register allocation,
                                         
                                         I guess in that way, right?
                                         
                                         They're just kind of like another-
                                         
    
                                         Yeah, exactly.
                                         
                                         Another set of resources that's on the stack memory.
                                         
                                         Right.
                                         
                                         And it figures out the ones that are mutually exclusive,
                                         
                                         exactly like register allocation.
                                         
                                         And say it's a diamond, right?
                                         
                                         Two sides of a branch,
                                         
                                         the if branch and else branch, then variables
                                         
    
                                         declared in those blocks are mutually exclusive.
                                         
                                         And so the compiler can reuse those slots.
                                         
                                         But in reality, even if you reuse the slots that way,
                                         
                                         you end up with part of the stack that's often unused,
                                         
                                         depending on what the dynamic behavior of the program is.
                                         
                                         Like say you always go down one branch
                                         
                                         that happens to have more live values than the other one,
                                         
                                         then a bunch of it ends up being dead just for reasons, right?
                                         
    
                                         Just because the compiler didn't know
                                         
                                         which way you're going to go and something like that.
                                         
                                         So it's kind of interesting because it's sparse.
                                         
                                         I guess you could split up.
                                         
                                         You could say, if I go down this path,
                                         
                                         then the stack frame will end up being this large.
                                         
                                         And if I go down that path, the stack frame will end up being this large and if i go down that path the stack frame will end up being that large but then that's probably a lot
                                         
                                         more uh overhead for the compiler to track what all these different paths are and what the stack
                                         
    
                                         implications are and know how to roll them back uh it would be interesting if any compiler is like
                                         
                                         splitting based on stack size behavior i'm not sure that anyone would do that.
                                         
                                         Yeah, I've never really heard of that.
                                         
                                         Like I know compilers do partial outlining, right?
                                         
                                         Where I'll look at cold codes
                                         
                                         and like put it somewhere else, right?
                                         
                                         And force a bunch of spilling from the hot function
                                         
                                         so that if it needs to call the cold code,
                                         
    
                                         it needs to go through the stack and whatever.
                                         
                                         But that, you know, I don't know
                                         
                                         if it ends up reducing the the used
                                         
                                         stack space or not yeah i think it probably will it probably will try to always unify it so that
                                         
                                         at a given program point at a given point in the program the stack depth is always the same right
                                         
                                         that's like pretty nice sanity check or at least if you have control flow when the control flow
                                         
                                         converges that the stack depth is always the same It shouldn't be two different values at any given time.
                                         
                                         Cool.
                                         
    
                                         Yeah.
                                         
                                         Yeah, and I guess the next question is, where does the stack go?
                                         
                                         And not in a philosophical way, but more like,
                                         
                                         do the addresses go up or grow down?
                                         
                                         If function A calls function B, is function B's frame addresses
                                         
                                         bigger than function A's frame addresses. And that just
                                         
                                         really depends on the ABI, right? The application binary interface. So these days it's usually down,
                                         
                                         right? Because what that means is the function earlier in the call stack has a higher address
                                         
    
                                         for its frames than the callees of that function, right? Those have lower addresses.
                                         
                                         And what's interesting is if you look at, say,
                                         
                                         an architecture like ARM, it's traditionally
                                         
                                         both bi-endian and bi-directional in its stack
                                         
                                         growth.
                                         
                                         And that's kind of interesting.
                                         
                                         And direction is interesting in CPUs in general.
                                         
                                         If you look at x86, it has a direction register.
                                         
    
                                         It literally has a bit that changes the direction of certain instruction,
                                         
                                         the rep instructions, the repeat instructions.
                                         
                                         And it has nothing to do with push or pop,
                                         
                                         but it's kind of interesting, right?
                                         
                                         Like direction in a CPU is just this convention
                                         
                                         that we adhere to based on ABI.
                                         
                                         And hardware just kind of deals with it.
                                         
                                         And different implementations do different stuff
                                         
    
                                         when they implement the ABI.
                                         
                                         So that's kind of cool.
                                         
                                         These seemingly arbitrary choices are always fun in computer architecture and computer programming.
                                         
                                         Of like little Indian versus big Indian.
                                         
                                         You know, does the address map have zero at the top or zero at the bottom?
                                         
                                         Yeah, these are always fun.
                                         
                                         And who was saying that like Indian-ness comes from Gulliver's Travel or something?
                                         
                                         Oh, yeah. are always fun and who was saying that like indianess comes from gulliver's travel or something oh yeah so i think if you reread gulliver's travels you'll find that they crack an egg on one side or the other and it led to this giant holy war among the lilliputians and so like gulliver
                                         
    
                                         i guess gets there and he's like why are you fighting this giant war over how you crack an egg
                                         
                                         and i guess we have this giant holy war over
                                         
                                         you know whether the low bytes of a multi-byte word go at the low memory or the high memory so
                                         
                                         we're maybe similarly uh similarly i'm not sure what the charitable word is for it but we fight
                                         
                                         these holy wars over very interesting small issues right right? And actually another thing that I found interesting is
                                         
                                         for endianness,
                                         
                                         if you only had the ability to access a word,
                                         
                                         if you couldn't access a byte within a word
                                         
    
                                         as like a memory access,
                                         
                                         you actually wouldn't know
                                         
                                         what the endianness of the machine is, right?
                                         
                                         Because, oh, when I load it out,
                                         
                                         I always get the same bits out.
                                         
                                         And when I store something in, I'm storing the bits in,
                                         
                                         but I don't know how it's stored actually in the byte storage of the underlying memory.
                                         
                                         I always thought that was kind of funny of like, if you get rid of byte accesses,
                                         
    
                                         then you don't even know what the endianness is. It would fix all religious battles ever
                                         
                                         for computers anyways, right? That's right. Well, that in alignment,
                                         
                                         you would need to fix alignment as well. Everybody would just walk out into the streets and hug each other. Yeah. You
                                         
                                         would need to fix alignment as well. That's right. Cause then you could do, if you could access a
                                         
                                         unaligned look, or if you only had word addresses, that's another possibility, right? If you could
                                         
                                         only load and store words and you only had word addresses. Yeah. Yeah. Word. Word. All right. So,
                                         
                                         okay. We're talking about the stack stack we got to talk about unwinding
                                         
                                         so let's say that you want to you want to grab a backtrace from your program because it crashed
                                         
    
                                         or there's a bug or you just want to know where it is for diagnostic purposes or let's say uh in
                                         
                                         c++ you throw an exception and you need to unwind it then you've got to figure out like where are
                                         
                                         all the frames in c++ you have to to do RAII, destructors,
                                         
                                         things like this.
                                         
                                         You need to figure out if there's exception guards to trigger, exceptions are their own
                                         
                                         whole thing.
                                         
                                         But even just if you're programming normal C, you might often want to capture a backtrace
                                         
                                         to emit a good message on segfault or something, or to send crash diagnostics back to home
                                         
    
                                         base like in the case of browsers or something.
                                         
                                         So for backtrace, conceptually, you just want to walk that linked list-like structure that you described before. So you use the frame pointer to find the next frame previous to you in the stack,
                                         
                                         the older frame. And then you look next to it to find what the code's return address is, right?
                                         
                                         This is what the linking is between the frames and also the correspondence find what the code's return address is, right? This is what the linking is between the
                                         
                                         frames and also the correspondence to what the code is that caused the frame to be on the stack.
                                         
                                         And there's lots of cool libraries people have developed to do this. There's various lib
                                         
                                         unwinds, like there's libunwind, LLVM libunwind, GCC libunwind, a whole bunch of them. And there's
                                         
                                         crash reporting libraries that wrap up unwinding facilities. Iwind, a whole bunch of them. And there's crash reporting libraries that wrap
                                         
    
                                         up unwinding facilities. I think Breakpad is one of them. And that tells you what code was being
                                         
                                         executed when something bad like a crash happened. But this ability to walk the stack structure and
                                         
                                         ask questions about it is super key. Right. And what's interesting is if you think of it as a
                                         
                                         basic unwinder, all you have to do is walk the stack itself.
                                         
                                         You look at the current frame in the stack pointer,
                                         
                                         and that tells you how to tiptoe through the linked list.
                                         
                                         But imagine you alighted frame pointers or did stuff like that.
                                         
                                         Then you really need a side data structure
                                         
    
                                         which tells you what the frame size for each function is.
                                         
                                         But the compiler knows what that is,
                                         
                                         right? So it can obviously emit that data somehow in the program. And then the unwinder needs to
                                         
                                         look at the current function pointer, right? Or the program counter. And that also gets spilled
                                         
                                         to the stack, right? Because when you want to return, you spill the location to return to the
                                         
                                         stack. And then when you pop, you then jump to that location to go back to where you were.
                                         
                                         And so it means that that PC, the program counter,
                                         
                                         can be used to do that lookup in the linked list,
                                         
    
                                         even though you don't know the frame pointer in all cases.
                                         
                                         So that's kind of interesting.
                                         
                                         And it's like a side data structure
                                         
                                         that you have on side of the stack
                                         
                                         that's, you know, read only.
                                         
                                         And it's something that compiler generated
                                         
                                         and says, well, if you're at this address,
                                         
                                         then the size of the function is this.
                                         
    
                                         So the frame side is this
                                         
                                         and this is where you are, right?
                                         
                                         So it's kind of interesting
                                         
                                         because it's less stuff to carry on the stack
                                         
                                         and you only need to look up
                                         
                                         those side data structures
                                         
                                         if you're trying to do unwinding.
                                         
                                         It's the only case you use it.
                                         
    
                                         I guess that goes into like a binary section,
                                         
                                         like the dwarf info, if you're using elf binaries
                                         
                                         on Linux or Linux.
                                         
                                         Yeah, or Peacoff or whatever on Windows.
                                         
                                         And so if you're doing unwinding for exceptions,
                                         
                                         say in C++ or something, then as you were saying, you need to call the destructors or even like the finally blocks if you're in a language like Java.
                                         
                                         So it's not just exceptions.
                                         
                                         It's also like you have finalizers and stuff like that or AI, as you were saying.
                                         
    
                                         And then there's really a few ways to implement that.
                                         
                                         The main one is to have that metadata.
                                         
                                         But in other cases, you might want to do something like set jump long jump and have
                                         
                                         landing pads. And I think Win32 did that in x86 to implement exception handling, which is kind of an
                                         
                                         interesting life's choice there. But yeah, as you unwind the stack, you somehow have to figure out
                                         
                                         where your landing pads are because those are the ones that have the code that do the cleanup right so you don't just you know pop the stack pop the stack pop the stack but for
                                         
                                         exceptions you have to run code right and so that's a list of exceptions uh destructors that
                                         
                                         you need to run or something like that right whatever ends up being in the catch block or
                                         
    
                                         something and then you do that then you unwind the next frame right so you like unwind one
                                         
                                         look up the landing pad do some work then then unwind another one, do that.
                                         
                                         And obviously, if you have an exception that happens during that time, you have to figure it out.
                                         
                                         So that's kind of interesting there.
                                         
                                         And then on Windows, it also mixes SCH, structured exception handling, with C++ exceptions.
                                         
                                         And structured exception handling can do more than just regular C++ exceptions. Like if you have
                                         
                                         faults or something like that, like, or divide by zero for a flowing point or something that you can,
                                         
                                         I think, handle in SEH. And so if I remember properly, if you do like catch dot, dot, dot,
                                         
    
                                         you can catch kind of unrelated stuff in Windows, which is kind of weird.
                                         
                                         Yeah. So like in a C program, so I'm just trying to understand the distinction of seh i don't know too much about seh but i guess you could make
                                         
                                         like a trap handler even in a c program uh that did interesting things using these seh facilities
                                         
                                         that they have yeah okay cool yeah and it's what you'd use signals for in other environments oh i
                                         
                                         see but it's kind of an odd mix. I see. So instead of
                                         
                                         going the signal route then this provides kind of a signal alternative. Very cool. Yeah. Yeah. And
                                         
                                         so maybe could you describe briefly what set jump and long jump are? Right. Like set jump and long
                                         
                                         jump right. Set jump is basically a way to save the current state of your registers and where you
                                         
    
                                         are in the stack and whatever,
                                         
                                         right? So you have a buffer and it's like a set jump buffer and you do set jump and saves it.
                                         
                                         And then when you long jump, you give it that buffer and it puts the state back as where you
                                         
                                         were before, right? So in a way, set jump is a function that can return twice, right? When you
                                         
                                         call set jump, it saves the state. And when eventually you end up
                                         
                                         calling long jump, it resets the states where you were. So set jump comes back twice, right?
                                         
                                         I think the only function that can ever return twice. So it's like the local state inside of a
                                         
                                         CPU, right? So I've sometimes heard it called like a micro context, which has like the registers in
                                         
    
                                         it and this kind of stuff that you could restore to make the local processor state kind of look like it was. Of course, the heap doesn't go back
                                         
                                         to the state that it was in, but like the local processor state, I guess, can go back to where it
                                         
                                         was. And that's maybe kind of similar to what we do when we enter kernel space in a way. We're kind
                                         
                                         of stashing a bunch of the state away so that the kernel can start doing its thing and then you can
                                         
                                         resume. i guess
                                         
                                         it kind of plays into this notion of continuations which we might have to talk about uh that's at
                                         
                                         some future point right right right and yeah i guess what's interesting there also about the
                                         
                                         stack is is you know we're talking about uh frames being of a set size, but there's these functions in C and C++, well, kind of in C++
                                         
    
                                         called allocate, and you also have variable length arrays, right? So an array where you say like n
                                         
                                         is the size that you declare on the stack. And it's kind of interesting, right? Like how are
                                         
                                         they optimized? What's their behavior when it's dynamic and stuff like that? And if you look at
                                         
                                         C++, you don't actually
                                         
                                         have variable length arrays at all.
                                         
                                         It's a common extension for compilers
                                         
                                         to support CVLAs in C++, but they're not officially
                                         
                                         a thing that exists.
                                         
    
                                         So that's kind of neat, being able to change the size
                                         
                                         of the stack at runtime.
                                         
                                         So your function comes in, and you're like, oh, actually I want to have X things
                                         
                                         and you put them there.
                                         
                                         And I've seen a lot of programs
                                         
                                         do like small stack buffer optimizations doing that, right?
                                         
                                         So let's say like, well,
                                         
                                         I want to have a string or a vector,
                                         
    
                                         but only have like say 256 bytes of stuff
                                         
                                         and I'll put it on the stack if that happens.
                                         
                                         And otherwise I'll just heap allocate it, right? is kind of an extension to the language in a way. And so it'll either do allocate
                                         
                                         or malloc, and then it'll do nothing or free when you exit the scope of the stack. So it's kind of
                                         
                                         interesting. And then there's interesting weird corner cases like allocate of zero is, I think,
                                         
                                         technically undefined here.
                                         
                                         And so are VLAs of zero.
                                         
                                         So it's kind of weird, right?
                                         
    
                                         So I've seen compiler optimizations break code
                                         
                                         that had those, or I've written some,
                                         
                                         and they had to fix them because I realized like,
                                         
                                         oh, zero is probably a valid value,
                                         
                                         even though people didn't intend it.
                                         
                                         But like, what if you do allocate of a negative value, right? Like that,
                                         
                                         that's clearly undefined behavior. You can't ungrow the stack. So I, you know, I wonder how
                                         
                                         that's, that's handled in different programs. Right. And what's interesting is like allocate
                                         
    
                                         and VLA's just don't fail. There's not a return code or whatever. It's just like, it happens.
                                         
                                         So yeah, that's, that's kind of neat. Yeah, that's interesting.
                                         
                                         I guess, so is it allocate or aloca?
                                         
                                         Because I think I've been mispronouncing it.
                                         
                                         If you can't, I mean, maybe it's both,
                                         
                                         but I've been calling it aloca.
                                         
                                         I'm French Canadian.
                                         
                                         I call it allocate.
                                         
    
                                         Yeah, allocate sounds good.
                                         
                                         So if I do this allocate and it returns me a pointer to the stack location
                                         
                                         that it was able to acquire.
                                         
                                         So could that be null?
                                         
                                         Well, I don't think it can, right?
                                         
                                         I think it always returns whatever like stack pointer
                                         
                                         or frame pointer plus X that you asked to.
                                         
                                         I don't think it can fail.
                                         
    
                                         Oh, I see.
                                         
                                         So it would have to be slower probably
                                         
                                         if it nominally could fail
                                         
                                         because you'd have to say like check some conditions.
                                         
                                         It's intended to never fail.
                                         
                                         Is that part of the language
                                         
                                         or part of like the open group POSIX standard?
                                         
                                         What level does Allocay live at?
                                         
    
                                         Does C programming language include a notion of Allocay?
                                         
                                         I don't remember, honestly.
                                         
                                         I think Allocay is part of POSIX, but it's also how, at least within LLVM, like everything is implemented as quote unquote allocate.
                                         
                                         And then it uses constant values for a lot of stuff and, you know,
                                         
                                         creates the stack out of multiple allocates and stuff like that.
                                         
                                         So like, if you look at the internal IR for LLVM,
                                         
                                         everything's an allocate that goes on the stack.
                                         
                                         Yeah, it makes sense.
                                         
    
                                         And then it tries to take everything off of the stack and put it into registers
                                         
                                         that it can,
                                         
                                         or combine these different stack allocations into a
                                         
                                         global frame that it can index slots out of and things like this exactly yeah cool well i mean
                                         
                                         it's kind of funny we we mix all this stuff on the stack and sometimes things are variable size
                                         
                                         sometimes you even try to put a fixed size thing on the stack and you think that you're writing
                                         
                                         into it the right way but then things can sometimes go wrong as well.
                                         
                                         So like the OG buffer overrun exploit was to overrun an array that someone
                                         
    
                                         put on the stack.
                                         
                                         They were like,
                                         
                                         Oh,
                                         
                                         this will only be,
                                         
                                         you know,
                                         
                                         eight characters long,
                                         
                                         this,
                                         
                                         this little string here.
                                         
    
                                         And then someone gets you to clobber the return location,
                                         
                                         which also lives on the stack by,
                                         
                                         you know,
                                         
                                         saying,
                                         
                                         Oh,
                                         
                                         you know,
                                         
                                         if I index past character
                                         
                                         seven and I do some writes, and if you didn't check the bounds on that, I could potentially
                                         
    
                                         go find where the return address lives and write some memory location into there.
                                         
                                         And if I can write enough data onto the stack, and if the stack was actually executable,
                                         
                                         if I could jump to it, I could actually jump to the data that I had written.
                                         
                                         So it's always funny to think about how these exploits used to work back in the day. Now the
                                         
                                         stack is never marked as executable. That's like a big security no-no, right? So pages and memory,
                                         
                                         like we talked about virtual memory, they have these execute bits on them that says whether
                                         
                                         you can actually pull instructions into your CPU out of this memory region. And those are marked. That's fairly new, right? Like I think CPUs
                                         
                                         20 years ago couldn't do that. I think it's a 20 year old technology, like the NX bit or whatever.
                                         
    
                                         Like it's pretty new. Yeah. Yeah. I mean, it's a pretty fascinating, all the mitigations and
                                         
                                         things that have been added pretty recently in computer history, I guess. Right.
                                         
                                         Yeah.
                                         
                                         I mean, I remember I originally learned about that buffer overrun exploit in school, and
                                         
                                         I think it was nominally not really possible to happen anymore, especially because compilers
                                         
                                         could automatically insert guards and things when they were in protected mode.
                                         
                                         Yeah.
                                         
                                         So I think that kind of shows like all the stuff mixes together in the stack both the instruction
                                         
    
                                         pointers and the and the data that you potentially want to home there and then if you're working in
                                         
                                         a memory unsafe kind of environment then potentially you can take advantage of the
                                         
                                         fact that these things all mix together things like virtual table pointers and stuff may also
                                         
                                         be in there right yeah yeah it's pretty cool yeah and i i
                                         
                                         guess like there's a lot of extra technologies that came in later not just on the cpu side but
                                         
                                         also just from the compiler right trying to protect stuff like there's this thing called
                                         
                                         stack protector that makes sure that when you create a large frame right so so say you have a
                                         
                                         large fixed size allocation on the stack say like you like 8K or something, then either statically or dynamically,
                                         
    
                                         it'll touch all the pages that you might access.
                                         
                                         So that you don't allocate, say, like an 8K buffer
                                         
                                         and then just access the end, but that's part of guard page.
                                         
                                         So say you have a 4K guard page, if you
                                         
                                         were to access past the stack, so not use any
                                         
                                         of that buffer except the end, then you'd have a stack overrun effectively, like not technically
                                         
                                         your programs accessing valid stack, but the implementation of the abstract machine doesn't
                                         
                                         have that much stack. And so it's actually outside of the stack that the program really has, but not
                                         
    
                                         outside of the stack that your source code contains.
                                         
                                         So that's kind of interesting.
                                         
                                         Yeah, can you tell us more about guard pages?
                                         
                                         I think we might have mentioned them earlier,
                                         
                                         but I don't think we really covered how those guard pages work.
                                         
                                         Right, yeah.
                                         
                                         So I guess when the OS creates a thread,
                                         
                                         it'll map a certain amount of stack, right?
                                         
    
                                         And as we were saying, it could do that lazily
                                         
                                         or it could do that eagerly.
                                         
                                         And so if it, imagine, does that eagerly, right?
                                         
                                         So it'll give you like 16 KB of stack.
                                         
                                         And at the end, it'll put a 4K guard page, right?
                                         
                                         So it's a page that's marked not readable, not writable.
                                         
                                         And the intent is, well, if you go out of stack,
                                         
                                         it'll just fall to there, right?
                                         
    
                                         But you don't map like, you know, four gigabytes of guard pages at the end of your stack.
                                         
                                         You just mark 4K or 8K or something like that.
                                         
                                         And so there have been exploits in the past where programs just jump over the guard page
                                         
                                         and the exploit goes and uses the edge of that code to jump over it and do malicious stuff because, you know,
                                         
                                         the stack is just memory. And so that happened to be, say, another stack or heap memory or
                                         
                                         something like that. And so it ends up clobbering some other valid thing way after the stack over
                                         
                                         the guard page. It brings up this fast, I guess, two fascinating things. One is a struct offset
                                         
                                         technically could be any value, right? It could be like, I have a struct with an array
                                         
    
                                         with four gigabytes of data in the inline array there, right?
                                         
                                         So accessing that offset could take you arbitrarily far
                                         
                                         and you could be like, yeah, I'm placing that thing
                                         
                                         on the stack in an abstract sense.
                                         
                                         So that's kind of a funny idea.
                                         
                                         But then there's also the notion it brings up that,
                                         
                                         you know, nominally the stack kind of
                                         
                                         grows down, maybe if that's what you're used to, and the heat maybe grows up from the bottom of
                                         
    
                                         the address map. And at some point they might meet, and then you might be able to like jump
                                         
                                         from one into the other, you know, because these are just kind of locations in memory. And maybe
                                         
                                         if you have multiple stacks, like you were saying, maybe your stack at some point will collide with
                                         
                                         another stack. So there's always these funny ideas of at the end of the day, we just have this big
                                         
                                         flat address space in some abstract sense on modern machines. And so jumping around inside,
                                         
                                         you have to make sure you know the extents and stuff, which also brings up segment registers
                                         
                                         and things, which I don't know if we should talk about today. But maybe, you know, if we talk about
                                         
                                         knackle or something fun like this in the future, we could talk about today. But maybe, you know, if we talk about Knackle or something
                                         
    
                                         fun like this in the future, we could talk about how segmentation has evolved on architectures
                                         
                                         like x86. Yeah, lots of interesting stuff there. Speaking of x86 evolution, maybe you want to talk
                                         
                                         a little bit about the red zone? The red zone? Yeah, yeah, the red zone sounds cool, right? So it's a cool idea that you can guarantee some space underneath your frame, right?
                                         
                                         So like 120 bytes on x86-64.
                                         
                                         And so when you're a leaf function, you know there's like this extra space that's just there and it's guaranteed to be there.
                                         
                                         So you can avoid doing accounting and just kind of clobber stuff that's related to your frame pointer.
                                         
                                         That's kind of neat, right?
                                         
                                         So it's just part of your ABI, your application binary interface, that extra space just kind
                                         
    
                                         of exists there.
                                         
                                         Yeah, that is super cool.
                                         
                                         Yeah.
                                         
                                         And there's other interesting stuff about the stack related to that.
                                         
                                         Like, for example, alignment is interesting too, right?
                                         
                                         Like, usually you always try to hide in the line stack right
                                         
                                         because having a byte aligned stack is weird when the whole point is to spill registers and your
                                         
                                         registers are like say four bytes or eight bytes or something like that and what's interesting is
                                         
    
                                         i remember when when armed v8 was still being designed they they had this interesting thing
                                         
                                         they come up with at some point where all the stack-specific instructions, right?
                                         
                                         So instructions that were really manipulating the stack, aligning it or something like that, were allowed to fault or had to fault if the stack itself wasn't aligned properly, right?
                                         
                                         So an ARMv8, it just does that.
                                         
                                         And it's expected that any stack manipulation spills at least like two 64-bit registers, right?
                                         
                                         So it's expected that your stack is aligned to that 16-byte boundary all the time.
                                         
                                         And so it's kind of interesting there, right, where instructions aren't just specialized to the stack.
                                         
                                         They have these extra things that try to find incorrect usage or things like that.
                                         
    
                                         It's kind of neat.
                                         
                                         Yeah, I guess the more...
                                         
                                         And confusing. Yeah, the more things you know
                                         
                                         about how your system should be working,
                                         
                                         the easier it would be to detect
                                         
                                         if something is trying to be exploited
                                         
                                         or going off the rails potentially.
                                         
                                         Your invariance about how the system
                                         
    
                                         should be operating gets stronger.
                                         
                                         But then it also prevents people
                                         
                                         from doing super creative,
                                         
                                         if somewhat dangerous in the general case stuff. I guess self-modifying code
                                         
                                         always comes to mind, right? There's this idea of futurist programming where everything is
                                         
                                         self-modifying code, tight inner loops to get all the efficiency of a modern machine.
                                         
                                         So some of those things can become more difficult as we start trying to create this like general process around how things are supposed to work.
                                         
                                         Yeah, I think this idea of like processes interacting with performance, like to make implications for how a machine is used, is pretty generally interesting as like a theme.
                                         
    
                                         So there's also, you know, the classic considerations that you learn in computer organization class around what registers
                                         
                                         are saved by the caller and which ones are saved by the callee. And caller versus callee saved
                                         
                                         conventions can have a big impact on performance because if you save too much in the caller that
                                         
                                         the callee doesn't end up using most of the time, then you're just kind of wasting performance.
                                         
                                         And if the callee has to save everything, but pretty much all the callees want to spill at
                                         
                                         least a couple of different things,, that's also wasteful.
                                         
                                         And also on architectures that used to have by convention purposes for different registers, like ECX is used in very special ways for multiplications or, you know, EDI is used for special things for repeated instructions and things like this. These also played into the constraints
                                         
                                         that would lead you to interesting caller
                                         
    
                                         versus callee saved conventions.
                                         
                                         There's also, what do we save away conventionally
                                         
                                         as part of that micro context
                                         
                                         when we switch from user space to kernel space?
                                         
                                         The kernel, on an interrupt or assist call or what have you
                                         
                                         will probably want to work with some set of resources
                                         
                                         that's not too small,
                                         
                                         but also you don't want it to be too large
                                         
    
                                         because then you're saving a bunch of stuff
                                         
                                         for no reason that the kernel routine
                                         
                                         ends up not needing to use.
                                         
                                         So, you know, for syscalls,
                                         
                                         you have to have like kind of this alternative stack
                                         
                                         in kernel space.
                                         
                                         And in general, the kernel threads
                                         
                                         have like a full separate stack
                                         
    
                                         once they leave interrupts
                                         
                                         to go into like a full kernel thread. So there's like a whole space of conventions around performance in the stack that
                                         
                                         you know create this whole ecosystem that we're working in which is always fascinating right and
                                         
                                         what about managed virtual machines i guess we didn't talk too much about that is there
                                         
                                         cool stuff there like maybe in web assembly assembly land around how stacks work? Right, yeah. In WebAssembly, it's interesting because the return address
                                         
                                         and most of, well, all the code is completely separate from the data, right?
                                         
                                         So WebAssembly is meant to be like a secure VM
                                         
                                         that can run inside a browser.
                                         
    
                                         And so, you know, computers nowadays, most like C and C++ programs,
                                         
                                         they just operate on what's called a von Neumann machine where data and code are just kind of mixed together.
                                         
                                         And WebAssembly really follows what's traditionally called the Harvard architecture, where those are really, really separate.
                                         
                                         And if you're writing a C or C++ program or Rust or whatever, in WebAssembly, you don't know where any of the code is for real.
                                         
                                         And so this stack isn't really a thing in WebAssembly,
                                         
                                         or rather WebAssembly, when you compile C++ to it, really has two stacks, right? It has the call
                                         
                                         stack with some internal state and whatever that might have function pointers and other things like
                                         
                                         that. And then it has whatever state that needs to be reified into your WebAssembly program.
                                         
    
                                         So say if you take the address or something of a variable, say you call printf or whatever else, then it needs to see the values.
                                         
                                         And so there's a separate stack that's in the WebAssembly memory, effectively your heap.
                                         
                                         And that's separate from the call stack. And if WebAssembly is running inside a web browser, it's interleaved with JavaScript.
                                         
                                         There's a JavaScript stack interleaved with a WebAssembly stack. And if WebAssembly is running inside a web browser, it's interleaved with JavaScript. Like, you know, there's a JavaScript stack interleaved with a WebAssembly stack. And
                                         
                                         within the WebAssembly memory, there's a completely separate stack as well for your program that's
                                         
                                         being reified there. And that's kind of part of the ABI there again. And what's extra neat with
                                         
                                         WebAssembly is, like, if you look at the VMm spec itself right the the virtual instruction set architecture
                                         
                                         it doesn't have a finite set of registers it just has infinite registers so functions can just have
                                         
    
                                         infinite registers functions can return an infinite set of values as well right so so
                                         
                                         functions don't just return one thing you can just return you a dozen so it's virtual register and variadic return value arity that's exactly exactly and variadic input in a way
                                         
                                         or or you know functions have to have a set number of inputs but it's not limited to you know 12 or
                                         
                                         something you can give a function 100 values or a values if you want to. How does it do printf? How do you do valists?
                                         
                                         Yeah, so that's why printf and valists, the way you have to, well, not you the programmer,
                                         
                                         but the way you the compiler has to make that actually happen is you have to create a stack
                                         
                                         in user accessible memory. So that's why you have a stack in the WebAssembly memory, so that you can spill to that
                                         
                                         location that's accessible. And then printf, which is itself implemented in user code within the
                                         
    
                                         WebAssembly VM, is able to access the percent i, percent f, or whatever you're trying to print.
                                         
                                         And so yeah, VLists are implemented as stack spills in the WebAssembly memory.
                                         
                                         Cool, cool.
                                         
                                         So it's kind of neat. Yeah.
                                         
                                         I mean, I always find this fascinating.
                                         
                                         I guess in the full web browser,
                                         
                                         now we talked about there's the native stack, there's the JavaScript stack,
                                         
                                         which the JIT can unify with the native stack, right?
                                         
    
                                         And then there's also the WebAssembly stack.
                                         
                                         And then WebAssembly is also being JIT compiled,
                                         
                                         potentially pulling stuff onto the native stack again.
                                         
                                         So, you know, it's funny
                                         
                                         how this whole ecosystem of things, you know, they all have interesting and potentially unifying in
                                         
                                         certain ways treatment of this general notion of stacks. That's like, it's like the glue that binds
                                         
                                         program execution together in a way. I don't know, maybe it's over romanticizing a little bit, but
                                         
                                         stacks are super integral to how we think
                                         
    
                                         about how machines are going to work today. Right. In managed VMs, there's also like cool features
                                         
                                         that certain, you know, when you're managed, you potentially can put more invariance on the program
                                         
                                         because you are managing what the language is doing. That's the idea of a managed virtual
                                         
                                         machine. So in places like the CLR, which is the.NET JIT execution managed environment,
                                         
                                         you actually have a type stack. And so there have been projects like, I think it was called Pidgin
                                         
                                         that was mapping Python onto this type stack. And Python has both a data and a control stack
                                         
                                         per frame. And so, you know, figuring out how to mux the Python execution
                                         
                                         onto the CLR notion of stackiness in this managed world,
                                         
    
                                         it creates all these interesting ideas
                                         
                                         and interactions around what a stack should be,
                                         
                                         how it should operate,
                                         
                                         what invariance we know about it,
                                         
                                         like whether the type of a given slot is always the same,
                                         
                                         or is it just a big bag of bytes?
                                         
                                         Or if you're thinking about a managed language environment,
                                         
                                         like with typed arrays, are the typed arrays words or are they bytes?
                                         
    
                                         You can have all these different variations on how things think about them.
                                         
                                         Yeah, that's cool.
                                         
                                         And there's other interesting stuff.
                                         
                                         Again, in JavaScript, what's interesting, like in C++,
                                         
                                         when you stack overflow, it's not really
                                         
                                         defined when the stack end is and what happens when you run out of stack. And realistically,
                                         
                                         what happens is you'll trap nowadays, right? Because there's often a guard page in most
                                         
                                         implementations. But in JavaScript, it's also not defined when you stack overflow, but it is defined
                                         
    
                                         that if you stack overflow, there's is defined that if you stack overflow,
                                         
                                         there's an exception that's thrown,
                                         
                                         like a JavaScript exception,
                                         
                                         and you can catch it, right?
                                         
                                         And that's kind of neat as a programmer. Now, it's kind of hard to handle.
                                         
                                         And as a VM implementer, it's kind of painful, right?
                                         
                                         Because imagine, like, when you JIT code,
                                         
                                         you're usually pretty careful about the
                                         
    
                                         JIT code not running out of stack. So you'll do a check or something like that. And you'll throw
                                         
                                         an exception if it runs out of stack, kind of like what Stack Protector does. But if you're writing
                                         
                                         runtime functions that are supposed to support the JIT, right? So, you know, you as a VM
                                         
                                         implementer write C++ code that JITs code, but then at some point you're like,
                                         
                                         eh, this math function or whatever,
                                         
                                         or this helper function,
                                         
                                         it's kind of hard to write in jitted assembly,
                                         
                                         so I'm going to just write a C++ function that does that.
                                         
    
                                         I'm going to call it from the jitted code.
                                         
                                         That C++ function itself has a stack.
                                         
                                         So imagine the virtual machine is, you know, calling JavaScript code,
                                         
                                         and eventually it calls this runtime function, and you run out of stack right then and there,
                                         
                                         right? So JavaScript had stack, it did all the checks it needed to, but then the helper doesn't,
                                         
                                         right? How do you check that you're not out of stack, right? That's kind of an interesting thing.
                                         
                                         And it's been the source of, at a minimum,
                                         
                                         bugs within JavaScript virtual machines, if not exploits. And so I've seen fuzzers for JavaScript
                                         
    
                                         that have this cute, like, recurse this function a bunch of times, then call a helper, then
                                         
                                         unrecurse once, call the helper again, unrecurse another one, call the helper again. And it tries
                                         
                                         to kind of mess with the VM to fuzz the VSM itself. So it's kind of neat. I see. So it's actually saying,
                                         
                                         oh, I saw I got a stack overflow here. Let me back up one. Call some native stuff and see what
                                         
                                         happens. That's very, very tricky. Yeah. It's kind of a neat pattern. Yeah. I mean, working on the
                                         
                                         JS engine in Firefox a bit, fuzzers are the best.
                                         
                                         You turn them on when you're ready and you get the deluge of all the interesting things
                                         
                                         that are corner cases in the universe.
                                         
    
                                         They're amazing tools and they're great.
                                         
                                         Yeah, there's really good fuzzers out there for JavaScript.
                                         
                                         They do interesting modifications and sometimes you look at the code and you're like, wow,
                                         
                                         this is evil.
                                         
                                         But yeah, you find good stuff.
                                         
                                         Yeah, the reducers are really the key.
                                         
                                         They're less talked about, but they're kind of the stars of the show of,
                                         
                                         all right, you found some horrible thing.
                                         
    
                                         Now please reduce it for me.
                                         
                                         And then getting like a crisp, you know, this program is only this long
                                         
                                         and yet it shows some key problem is always very amazing to see.
                                         
                                         Yeah, absolutely.
                                         
                                         That was a lot of stuff, my friend.
                                         
                                         Yeah, I've kind of reached my stack limit.
                                         
                                         I feel like there's more stuff we haven't talked about, like rotating register windows
                                         
                                         or return predictors or something like that.
                                         
    
                                         But I think we're near the red zone right now.
                                         
                                         Yeah, yeah.
                                         
                                         I think we should probably do another podcast later because there's basically infinite stuff
                                         
                                         to talk about.
                                         
                                         Right, unlike the stack, this has so much stuff to talk about.
                                         
                                         We should do more TLB hits, everything about TLB hits.
                                         
                                         That's right, everything TLB hit.
                                         
                                         All right, on that note, well, thanks for doing this with me,
                                         
    
                                         and I guess we'll talk to you folks next time.
                                         
                                         If people have feedback for us, write to us on Twitter,
                                         
                                         at TLB hit.
                                         
                                         That's it, yeah, thanks.
                                         
                                         Appreciate it.
                                         
