TLB Hit 💥 - Episode 0: mov fp, sp

Starting point is 00:00:00 Hey, JF, you want to do a super nerdy podcast? Of course, in computer science, the hard problem is naming things. So what should we call it? Oh, hey, Chris. Yeah, that sounds cool. Let's go with something like TLB hit. And we can get one of those sweet.it domain names, TLBH.it for TLB hit. Yeah, that's a good one.

Starting point is 00:00:22 All right. I guess if we name it that, we should probably start by talking about memory subsystem stuff. Yeah, sure. Like TLB hits, what better way to foster TLB hits than the stack? The stack basically always TLB hits. So let's talk about the stack. Yeah. Stacks are awesome. Yeah. We should get like a Twitter handle. I think that's important. And I think at TLB hit is available. So it's not anymore. And if people want to send us feedback, they can hit us up on Twitter. See what I did there? Yeah. Hit. I got it. All right. I guess we should also give a disclaimer that, you know, we only know so much and we'll try not to say things that are wrong. And we'll also try to

Starting point is 00:01:02 follow up with errata, you know, like corrections and clarifications as we find out about things that we said wrong. And we'll put them up on the website. We try to do that lifelong learners thing. And we're always trying to push the boundaries of things we know about and have conversations about what we're learning and what's interesting to develop and hone our understanding. So hopefully everybody's on board with that. Yeah, that's good. That's good. Just like a good CPU will have erratas and side channels, but in this case on Twitter. Okay, so let's get started. A good CPU always has side channels. Yeah.

Starting point is 00:01:48 I know. I know. Yeah, it's for that speculation. Let's get started. What's the stack? So we named this episode moveFPSP and you usually find that in the prologue of ARM functions. And in the epilogue, you have the reverse, right? moveSPFP. And those are special instructions that manipulate the stack. Let's try to figure out what that means. The compiler spills values that registers can't hold onto the stack because it needs to do work. It has like say, you know, 16 or 32 registers to hold values and it just, you know, it has too much work to do. It'll put stuff on the stack and then load it back later. Functions do that a lot because they have their own state and sub functions they call have their own state. Since subroutines can recurse without bounds, you potentially just would need an infinite set of registers if you didn't do that. And there's ALU registers for arithmetic,

Starting point is 00:02:30 and then there are FPU registers for floating point. Those registers contain stuff, right? It has like pointers to data, pointers to code, return addresses, stuff like that. And it's usually contiguous memory. The stack itself is contiguous. It's usually allocated on a per thread basis. So that's a high level where the stack is. Is that right? Yeah. So I guess, so ALU and FPU registers.

Starting point is 00:02:55 So sometimes people talk about these GPRs, right? General purpose registers that could hold either one as like scalar values. But then on some of the processors we tend to work with, you can have these dedicated registers for floating point values as well, right? So they can kind of split the data paths. Or SIMD. Or SIMD, yeah, for really wide ones, right? And then I guess later we'll talk about calling conventions and stuff too. So one question I had was, so I think you said the prolog, we're moving the current stack pointer

Starting point is 00:03:29 to the base pointer, right? And then on the epilogue, we're moving what was the base pointer back into the stack pointer. So we're kind of undoing the thing that we did. So we locally kind of manipulate the stack pointer and then we kind of undo everything, kind of roll it back to where it previously was

Starting point is 00:03:46 when we're done. Right, right. Okay, cool. So then I guess there's a question of like, what mechanisms actually do this stack stuff in the processor? And so, like you said, there's this frame pointer, BP or base pointer, and then there's the stack pointer SP.

Starting point is 00:04:03 So on x86, there's RBP and RSP. And x86 has either the R prefix for the 64-bit mode or the E prefix for the 32-bit mode. And then on ARM, you have FP and SP registers. Right, right, right. The usual convention is the frame pointer doesn't change during a function's execution. The frame pointer points at the function's frame, its scratch pad for stuff, like the place where it puts data when it needs to spill it. And the generated code addresses slots on the stack

Starting point is 00:04:34 based on offsets from the frame pointer. So from the frame pointer, you say like plus 4, plus 8, or something like that. Those are slots off of the frame pointer for that function. And when that function calls another function, there's a code sequence which sets the frame pointer to the previous stack pointer. And when a function returns, the inverse is done.

Starting point is 00:04:54 So in that sense, the stack is kind of like a linked list. Right, because it's putting this pointer on the stack that basically says this is where the frame pointer used to be before I came into my routine. Exactly. Cool. Yeah, I guess you can also compare this kind of operation to this notion of a stack machine. So if you, you know, in CS class or whatever, learn about machines where you can kind of push two things onto a stack and then do an add operation that'll consume the top two things on the stack, you could compare

Starting point is 00:05:25 that to what's happening in a traditional processor that we use today, where it's kind of expanding the stack as a single operation. It's making a bunch of slots by making the stack pointer move down. And then it's kind of playing with those slots relative to the frame pointer or base pointer, whatever you want to call it, right? playing with those slots relative to the frame point or base point or whatever you want to call it. Right. So, so it doesn't have to strictly do everything in stack order for pushing and popping, consuming these values. Right. And so then there's kind of this distinction of a stack machine versus kind of scratch pad area style that happens in a stack like fashion for subroutine calls. Right? And there's things that matter also like

Starting point is 00:06:07 whether instruction cache pressure plays a role and these kinds of things where push instructions could be really short on x86, like a single byte is the push instruction. So if you're frequently doing these operations, manipulating the stack, if it's very common for programs to do that, we might even optimize our instruction sets to try to make these operations extremely fast. Or maybe on ARM, you'll have like a push multiple

Starting point is 00:06:30 values instruction, even though that's a little bit sisky. You know, that's something that you do so commonly, it might actually make some sense, right? Yeah, in ARM v7, it had an instruction that allowed you to push like 16 registers. So all the general purpose registers, you could just push all of them if you wanted to, and increment the stack pointer it was kind of magical push all in just like poker yeah exactly it was like a bit mask and you say here's the ones i want to push and it just pushed whatever but it's risk right it's kind of cool right exactly it's reduced yeah it's a reduced set of registers compared to other sets of registers you could use. Cool. Okay. So I guess there's also a bunch of compiler optimizations that are related to stackiness, right? So by moving things

Starting point is 00:07:13 onto the stack, the stack is often a very hot region because the code is constantly manipulating within your given subroutine what it's working with in the stack. So moving things into the stack can be more efficient than heap allocating, both because of locality, but also because then you can avoid potentially costly memory allocation subroutines like calling into malloc, which could be, you know, hundreds or thousands of cycles, depending. And so when they're in that scratchpad area, you know, that has this hot locality, the values are also tracked precisely in a data flow sort of style. So if you kind of say, oh, these are random heap pointers and they might

Starting point is 00:07:49 alias as you bring them kind of closer in to be more known to the computation in the compiler, the values might become more trackable kind of as, you know, if you're familiar with SSA at all, SSA values instead of just being arbitrary memory references. So then when structs end up making it onto the stack, then the component fields inside of the struct can actually be broken apart and tracked as individual values that are manipulated and stored into the stack.

Starting point is 00:08:17 So this is called a scalar replacement of aggregates, which is a super cool optimization pass where you break the struct apart and track the individual fields inside of it. And so when we home those on the stack, we can reason about those fields like values that we can, you know, common sub-expression eliminate or dead code eliminate and things like this. Whereas if they were on the heap, it might be a lot harder to do that. And so then the moral equivalent in, you know, managed languages. Like I sometimes like to talk about JavaScript or languages like Java. You do like escape analysis to make sure that the object doesn't actually

Starting point is 00:08:51 leave from the heap into some other subroutine. And then once it's put on the stack, then you can actually eliminate whole objects and just track the subfields inside of it. So there's a whole area of compiler optimizations that's also related to the stack. So that's pretty fun. Yeah, that's really cool because it allows you to just explode the object itself and think about its component fields individually and get rid of whatever doesn't matter in there. So that's really sweet.

Starting point is 00:09:16 And it turns out that some compilers can also sometimes optimize heap allocations. If it sees like it's a really small one and you allocated the startup function and then you delete at the end, then it can like just stack allocated sometimes. And C++ explicitly allows you to do that as of somewhat recently, a few years ago. And so client does that. So if you knew an object, it's not guaranteed that you're actually going to put that object on the heap that it's going to call the underlying allocator, right?

Starting point is 00:09:41 Right, right, right. And it's surprising to some people, like they see their allocation disappear and they're like, wait, I called my allocator. What happened? Well, it went on the stack instead. So that's kind of cool. And then it does SROA and other stuff and might get rid of the entire computation. So that's kind of neat unless that's not what you're trying to do. Yeah, it seems like a really key optimization to do. You know, if the heap is really part of your programming language environment, like if you're thinking about things as just objects instead of as like raw bytes and exactly where they go, having that higher level understanding that you can optimize based off of is pretty key, I think. Right. Right. And I remember like when I started programming, there's this thing called frame pointer emission that was really cool at the time because optimizers weren't

Starting point is 00:10:25 as good as they were now. And in some cases, like in leaf functions or when you enable frame pointer emission in the compiler, so you don't need the frame pointer anymore and you can just use the stack pointer, right? So it frees up an extra register for the compiler to use. So that's kind of a neat optimization too. Yeah. Back when you only had eight registers, I guess, for x86 before x86-64, that extra register could go a long way potentially because the stack, even though it's hot in cache, it's doing stores and loads to memory locations, right? Right, exactly. Yeah, I think I used to know of FPO as that flag that makes the debugger way worse, right?

Starting point is 00:11:03 Because, you know, without the ability to know how things refer to the base location, then the debug information has to be a lot more prescriptive about how you go and grab the value at any given point inside of a function, right? Right. Debugging used to be way worse too. And I think nowadays people don't really eliminate front pointers. And so debuggers don't really have to deal with it, but they've gotten better from what I can see anyways. Yeah, I guess because modern CPUs are doing registry renaming and cool stuff under the hood, they can actually have like a micro architecturally larger register set. And you're not worried so much about saving that one register

Starting point is 00:11:38 a lot more of the time, although in hot code, you still might. Yeah. Yeah. I guess another interesting thing is like, there's this idea of leaf functions, right? Like you talked about. And so when you inline things, which inlining is kind of the key compiler optimization, as we know, right, it makes these big regions for compilers to do their kind of local flow analysis and optimizations on. And so we make these big fat leaf functions. And I actually wonder, I don't know the answer to this, but how much of program time is generally spent in leaf functions in the course of, you know, some set of applications? It would be interesting to

Starting point is 00:12:16 know. Yeah, I don't actually know. I guess we never defined what a leaf function is, right? So it's like the function at the end of your call tree. Yeah. So if your subroutine doesn't call any other subroutines, that's kind of a nice property, right? Because now you know that basically everything at the end of this stack belongs to you and you're just going to pop back up to whoever called you. Exactly. And inlining really unlocks that, right? Because you start inlining into, you know, not leaf functions and they become the leaf now. That ends up unlocking a lot of stuff, right? As long as your working set doesn't become too big, right? So that function doesn't end up doing everything. Then it's kind of nicely segmented.

Starting point is 00:12:55 The compiler can know everything it does and it has a good amount of work to do. So that's kind of neat. It's kind of a small region in which you can analyze everything. It's like tiny little whole program analysis. Right. But then like, why do we even have a stack in that case? Right. So like, imagine you inlined everything, couldn't you just do that? Right. And I, I used to think that and I, I've realized there's really two main issues to that. One of them is you don't know what the call graph is for the entire program and recursion, right? And so the first one is like, if you knew where all the calls go, right? So all the virtual calls, indirect calls and other stuff in your translation unit, as well as the other ones in

Starting point is 00:13:39 your program. And if you had no recursion, then it kind of seems like you wouldn't need a stack, right? Because you know a perfect call graph. And even with some of these, you could still avoid having a stack, right? Like if you have virtual functions, but there's like, you know, only ever one or two implementations of it, then you could reify those virtual calls as two like test and branch to the indirect thing. So you could just have no stack at all ever in a program if you had all that information, right? Do you know if anyone does that?

Starting point is 00:14:11 Yeah, so I guess, so first let me just ask about the switch thing. So, or sorry, with the virtual. So if you have a fully analyzable virtual dispatch, effectively it just becomes a switch, right? Because you're like, which one of the targets is it? And then you can basically inline what the potential targets are. That's, that's cool. That control flow analysis that says like, oh, this is an indirect branch. So hypothetically, it could go anywhere. But you know, with things like within a

Starting point is 00:14:38 de-virtualization within inside a translation unit, you could actually enumerate like the full set of possibilities, right? That's always pretty cool. Yeah, so this fully analyzable call graph thing is an interesting computer history property, I'd say. So Fortran 77, I think, was classically able to do this or the programs were restricted enough that you could analyze it. And also the compiler that I worked on,

Starting point is 00:15:04 the XLA machine learning array programming compiler has the same property where the whole call graph is analyzable. So you can actually just create a slab that's basically the stack frame for this whole program you're optimizing. And all of the allocations are known fixed size. So yeah, there's some of that lives on even today

Starting point is 00:15:23 in some more niche use cases of this whole program analyzability, not needing to have strict call frames. So JavaScript engines too, I think they would sometimes recur from their call frames through the virtual machine, and then it would have to create like another stack or sub stack or things like this these are kind of interesting multi-stack problems as well beyond just being able to analyze a single stack yeah definitely that's really cool and i i remember programming in fortran and that was kind of cool because like you as a programmer you're trying to solve a very specific you know physics problem or whatever and you don't usually need any of those tools like recursion or virtual

Starting point is 00:16:07 functions or anything like that when you're like deep in your code, right? And I remember it's kind of a neat programming model. Like I actually non-ironically like Fortran for that. It's kind of neat to program in. Yeah, when everything is kind of monomorphized, right? Like when you have big arrays of fixed value types, then you kind of can know everything about the world and really optimize everything based off of it. That was, that's always kind of a fun mode to be in for scientific computing code. Yeah, exactly. And so like, are those the only issues like recursion and direct calls, right? Because there's some languages that use the stack for say fast thread switching, right?

Starting point is 00:16:44 So as well as like something like full stackful coroutines or something. So if you look at stacks in Go, they're not contiguous. They're more like a C++ deck or whatever, like a linked list of lists, instead of just being one contiguous stack. And they're linked to each other. It has this really cute, clever x86 code sequence,

Starting point is 00:17:04 which makes it fast to find the previous and next frame. So it allows the ghost stacks to be just kind of distinct allocations, right? Like each page-wise or something is one frame, and then the next function has another frame. You can put multiple functions in one frame, so that's kind of neat. Like I remember it can,

Starting point is 00:17:24 or at least it used to be able to have really bad perf if you were in a hot loop and you happen to straddle that boundary but otherwise it was kind of neat right it did some cool stuff and i think coroutines in some language ends up having some stackless stuff like this right when their closure is heap allocated instead so i know like the c++ um the c++ coroutines they try to do away with all of the heap allocation, but it really depends on your optimization level whether they can do that or not. And it's kind of similar for Objective-C blocks. I think by default until pretty recently, Objective-C blocks were just always heap allocated,

Starting point is 00:17:57 and they started being stack allocated in the last few years when they could. So that's kind of neat, too, where the language doesn't really say if stuff lives on the stack or the heap, it just kind of lives somewhere. And it's kind of neat. And because the stack is less constrained, it can kind of live in different places, especially in Go, right?

Starting point is 00:18:16 And then in some cases, like, you know, you remove the allocation entirely, right? So it's kind of neat. And conceptually, if you want to be able to scale your assumptions about concurrency to millions of threads or something like that, you can see how you wouldn't want to have huge stacks, right? Because each thread has a stack. And if you have millions of threads, then you don't want to be allocating all those stacks, right? So you want to be able to have really small stacks and switch between those threads pretty quickly. And so, you

Starting point is 00:18:43 know, that kind of asks the question, question, how do you usually size those stacks in the per-thread context that you have? Yeah, because if you're doing just tiny little operations, like if every single operation in your program was conceptually a thread, you wouldn't want to allocate 512 KB every single time you did a tiny little atomic operation. Yeah, it's a really interesting

Starting point is 00:19:05 point. Yeah. So actually the, the term stackless is also funny because I think Python's greenlets. So greenlets is like lightweight thread terminology, right? I think it's actually called stackless Python. One of the attempts that was made to make greenlit oriented Python and in managed languages like Python, all the frames are actually allocated on the heap, like even in CPython. The frames are allocated on the heap, though you still have your native stack for your native program

Starting point is 00:19:34 that's doing like the managed VM execution. So something that JITs do is instead of the interpreter kind of running off the native stack and then keeping a different stack for the interpreter, like that the interpreter manages, they can unify together the native stack and the virtual machine's stack space so that it puts it all back into like a native thread worth of stack space.

Starting point is 00:20:00 So a frame is really just some space and spaces get linked together, you know, in this abstract concept. And both of those things can also happen on the heap. There's no reason that, you know, this one virtual memory region that we call the stack is the place that all that operation has to happen. Right, right. And like, you know, we're talking about stack size, I imagine for GPUs, which have a bunch of warps, like you end up not wanting to have huge stacks. But typically on a CPU, I guess the stack is usually between, say, like 512 KB and 2 megs on common systems, like Mac OS, Linux, and Windows.

Starting point is 00:20:36 And usually they allocate at thread startup. So it's part of, say, POSIX or Win32, what the default size is. And then most of the platforms have an API that allow you to change the size. And it's kind of weird, because some of them can only do that at runtime, and some of them can only do it at startup.

Starting point is 00:20:55 So being able to change the stack size is kind of this uneven thing. And then that's like in user space, whereas if you compare the kernel or embedded systems, usually the stack size is way smaller, right? I remember for XNU it was 16 KB, which is super tiny. And blowing the stack is kind of easy in some cases when you're handling interrupts and other stuff,

Starting point is 00:21:17 and then some stuff happens. You end up blowing up the stack pretty easily, so you've got to be careful there. The reverse is interesting. I wonder what's the biggest stack programs that exist? Like, yeah, that's a good try to think like who, who does that, right? There's probably people who ginormous stacks, but I couldn't figure out what the biggest one would be. Yeah, that's a good question. I know I've worked in some environments that set the stack limit quite low. And then if you compile

Starting point is 00:21:45 in debug mode, debug mode is often spilling a bunch of stuff to the stack or homing it on the stack. It's not as aggressive. And then sometimes you'll get a random segmentation fault and it's because the frame got bigger than the initial stack allocation was assuming, or maybe you hopped over the guard page, which maybe we can talk about in a little bit. Yeah, I guess, but you know, when you're allocating this default start size for the stack, it brings up this point that I guess you have this default start size, but it doesn't all have to be mapped in at the start. You can potentially grow this region that you're calling the stack on the fly

Starting point is 00:22:19 as you see that it needs more space. Couldn't you do that? Yeah, totally. Yeah, so I guess there's interesting mechanisms to potentially do that, both at the operating system level and for managed languages, you could potentially even like relocate

Starting point is 00:22:33 where your stack segment is. Right, like just allocate a page and then just virtually map the other ones, fault when they get access and then map them in as as needed instead right yeah that's a potential way or i guess because you know where all the pointers are you could also relocate all the pointers that were anything to do with the stack locations yeah right right only in managed languages yeah so i guess something that we we were referring to earlier that maybe we can

Starting point is 00:23:01 provide some more context on is the the stack works nicely with this least recently used cache that modern processors have. So younger frames are recently touched, right? So there's spatial and temporal locality are two of the things that caches exploit, right? When I touch something, I might touch something right next to it. When I touch something, I might actually come back and touch it again very quickly, you know, very quickly, close in number of cycles or time. And so the last in first out nature of a stack and the predictable access pattern, right, where we kind of like go down in addresses and then we pop back up if the stack is growing down, that lines up with CPU features like LRU caches and address predictors, right, that are trying

Starting point is 00:23:42 to prefetch things that you're going to be pulling in. Right, right. I guess that means that a function's frame is usually in the cache in memory as recently used. And then it kind of brings up this broader question of how does memory on a computer work at all, right? There's physical addresses. So like when you're talking to a stick of DRAM, there's like some values that are going over the wires that are talking to the stick of DRAM. And those are the physical addresses, right? And then what the computer sees is this notion of virtual addresses.

Starting point is 00:24:14 And virtual memory was introduced originally so that we could share the physical memory that was on a system more easily in shared environments and things like this. But it also offers facilities like protection and things like this, but it also offers facilities like protection and things like this. But virtual memory really means, okay, so I have some address.

Starting point is 00:24:30 And let's say most modern systems, I think have like effectively 47 bit addresses, right? And that's a value that's really an alias. It turns into a real physical location and the alias is potentially unique to each program. So two different programs hypothetically could have these 47-bit values that if the mapping said, oh, they turn into the same physical location, that's actually possible on a modern system, right? Right. And so each one of these memory access instructions, like a load or a store,

Starting point is 00:25:05 needs to translate that 47-bit virtual address into a corresponding physical address. And maybe you have, or you very likely have less than 47 bits worth of memory in your system, right? So it actually might be- I find each access too, right? Yeah. Like I might have only four gigs of memory, only 32 bits of physical space. But, you know, I can pretend that the space is larger and like page stuff out to disk. I can take things and that would be backed by physical memory and store them on disk. This is kind of the magic of a virtual memory subsystem, right? And so each one of these memory accesses where I take my virtual address and turn it into a physical address has to translate that virtual thing into the physical thing. And that can be expensive because the

Starting point is 00:25:49 operating system has to maintain a big table of how that happens. But there's a translation cache in the hardware called the TLB. And because the stack, yeah, yeah, yeah. You had to figure we were going to talk about this at some point. But because the stack is accessed so frequently, it is small, the stack, in terms of number of pages. And the pages are very often used. So each stack access with this virtual location for where the stack lives will usually hit inside of this cache that translates our virtual address into our physical address. And that would be a TLB hit as opposed to a miss. Oh, wow. Yeah. Yeah.

Starting point is 00:26:33 That's genius. Genius. Yeah. It's smart that they do that. Those caches, that cache stuff. Yeah. So yeah, we should do an episode that talks about this more. The stack is basically this LRU cache-y amenable kind of structure.

Starting point is 00:26:48 And yeah, there's other benefits to it as well. I guess that was kind of a long-winded way of describing how we ultimately arrive at TLB hits. But the stack is very amenable to them, like you said at the beginning. Yeah, cool. And there's other advantages to the stack too, right? Like registers are really fast to access for the CPU and then caches are pretty fast too.

Starting point is 00:27:09 And then you get to memory after that. But if you look like the CPU itself, more modern CPUs usually have store-to-load forwarding, right? So there's another cache in between the registers and the caches that forward values. When you do a store and then you do a load after, the load will look up recently stored stuff to say, hey,

Starting point is 00:27:33 did you store that recently? Let me just get it from you instead of asking the cache. I don't need to bother with the cache. And so in a way, stored load forwarding is effectively an L0 cache between registers and the L1 cache. And the stack pretty much always sits in there, right? Like most stack XSCs effectively act as an L0 cache. That's kind of neat. Yeah, that's really neat. But then, so let's say I'm a CPU in a multi-core die, right?

Starting point is 00:27:59 So I'm one processor and I do a store and I do store to load forwarding, but then some other processor stored into that same location. Shouldn't I have seen the store that that other processor did? Right, exactly. Right. And it's stuck in the L0 cache, right? It hasn't been written to the cache yet. And the way coherency works between those multiple CPUs is at the cache level, not at the store to load, the store buffer, basically, right? Because the store buffer just buffers stuff before it goes to the cache. I think we should do an episode on that. Yeah, I think we have to do a whole episode on how this kind of thing works and how you can think about it, because it seems like once you get into this multiple processors world, stuff like store-to-load forwarding can be even more interesting than just like a common optimization.

Starting point is 00:28:50 Yeah, it's great until it's not, right? So we'll have to figure that out for the next episode, I guess. But the other interesting thing about stack usage is, you know, we said the stack is contiguous and each function has a frame. But how a function uses the stack is actually super sparse. So the compiler allocates slots for values. So it's like, effectively, if you have variables in your original code, it allocates slots for them.

Starting point is 00:29:20 Without optimizations, that's what it does. Every named value in your program effectively has a slot. And then the compiler gets in and it starts optimizing later if you turn that on. And it does what's called coloring, right? So it's like a graph coloring problem where it says like, you know, this part of the function doesn't use those values

Starting point is 00:29:38 at the same time as this other one. So they're not live anymore. And so it colors them, right? It literally like solves the graph column problems. And it figures out- Probably pretty similar to register allocation, I guess in that way, right? They're just kind of like another-

Starting point is 00:29:52 Yeah, exactly. Another set of resources that's on the stack memory. Right. And it figures out the ones that are mutually exclusive, exactly like register allocation. And say it's a diamond, right? Two sides of a branch, the if branch and else branch, then variables

Starting point is 00:30:07 declared in those blocks are mutually exclusive. And so the compiler can reuse those slots. But in reality, even if you reuse the slots that way, you end up with part of the stack that's often unused, depending on what the dynamic behavior of the program is. Like say you always go down one branch that happens to have more live values than the other one, then a bunch of it ends up being dead just for reasons, right?

Starting point is 00:30:33 Just because the compiler didn't know which way you're going to go and something like that. So it's kind of interesting because it's sparse. I guess you could split up. You could say, if I go down this path, then the stack frame will end up being this large. And if I go down that path, the stack frame will end up being this large and if i go down that path the stack frame will end up being that large but then that's probably a lot more uh overhead for the compiler to track what all these different paths are and what the stack

Starting point is 00:30:55 implications are and know how to roll them back uh it would be interesting if any compiler is like splitting based on stack size behavior i'm not sure that anyone would do that. Yeah, I've never really heard of that. Like I know compilers do partial outlining, right? Where I'll look at cold codes and like put it somewhere else, right? And force a bunch of spilling from the hot function so that if it needs to call the cold code,

Starting point is 00:31:19 it needs to go through the stack and whatever. But that, you know, I don't know if it ends up reducing the the used stack space or not yeah i think it probably will it probably will try to always unify it so that at a given program point at a given point in the program the stack depth is always the same right that's like pretty nice sanity check or at least if you have control flow when the control flow converges that the stack depth is always the same It shouldn't be two different values at any given time. Cool.

Starting point is 00:31:46 Yeah. Yeah, and I guess the next question is, where does the stack go? And not in a philosophical way, but more like, do the addresses go up or grow down? If function A calls function B, is function B's frame addresses bigger than function A's frame addresses. And that just really depends on the ABI, right? The application binary interface. So these days it's usually down, right? Because what that means is the function earlier in the call stack has a higher address

Starting point is 00:32:19 for its frames than the callees of that function, right? Those have lower addresses. And what's interesting is if you look at, say, an architecture like ARM, it's traditionally both bi-endian and bi-directional in its stack growth. And that's kind of interesting. And direction is interesting in CPUs in general. If you look at x86, it has a direction register.

Starting point is 00:32:42 It literally has a bit that changes the direction of certain instruction, the rep instructions, the repeat instructions. And it has nothing to do with push or pop, but it's kind of interesting, right? Like direction in a CPU is just this convention that we adhere to based on ABI. And hardware just kind of deals with it. And different implementations do different stuff

Starting point is 00:33:03 when they implement the ABI. So that's kind of cool. These seemingly arbitrary choices are always fun in computer architecture and computer programming. Of like little Indian versus big Indian. You know, does the address map have zero at the top or zero at the bottom? Yeah, these are always fun. And who was saying that like Indian-ness comes from Gulliver's Travel or something? Oh, yeah. are always fun and who was saying that like indianess comes from gulliver's travel or something oh yeah so i think if you reread gulliver's travels you'll find that they crack an egg on one side or the other and it led to this giant holy war among the lilliputians and so like gulliver

Starting point is 00:33:37 i guess gets there and he's like why are you fighting this giant war over how you crack an egg and i guess we have this giant holy war over you know whether the low bytes of a multi-byte word go at the low memory or the high memory so we're maybe similarly uh similarly i'm not sure what the charitable word is for it but we fight these holy wars over very interesting small issues right right? And actually another thing that I found interesting is for endianness, if you only had the ability to access a word, if you couldn't access a byte within a word

Starting point is 00:34:13 as like a memory access, you actually wouldn't know what the endianness of the machine is, right? Because, oh, when I load it out, I always get the same bits out. And when I store something in, I'm storing the bits in, but I don't know how it's stored actually in the byte storage of the underlying memory. I always thought that was kind of funny of like, if you get rid of byte accesses,

Starting point is 00:34:33 then you don't even know what the endianness is. It would fix all religious battles ever for computers anyways, right? That's right. Well, that in alignment, you would need to fix alignment as well. Everybody would just walk out into the streets and hug each other. Yeah. You would need to fix alignment as well. That's right. Cause then you could do, if you could access a unaligned look, or if you only had word addresses, that's another possibility, right? If you could only load and store words and you only had word addresses. Yeah. Yeah. Word. Word. All right. So, okay. We're talking about the stack stack we got to talk about unwinding so let's say that you want to you want to grab a backtrace from your program because it crashed

Starting point is 00:35:10 or there's a bug or you just want to know where it is for diagnostic purposes or let's say uh in c++ you throw an exception and you need to unwind it then you've got to figure out like where are all the frames in c++ you have to to do RAII, destructors, things like this. You need to figure out if there's exception guards to trigger, exceptions are their own whole thing. But even just if you're programming normal C, you might often want to capture a backtrace to emit a good message on segfault or something, or to send crash diagnostics back to home

Starting point is 00:35:42 base like in the case of browsers or something. So for backtrace, conceptually, you just want to walk that linked list-like structure that you described before. So you use the frame pointer to find the next frame previous to you in the stack, the older frame. And then you look next to it to find what the code's return address is, right? This is what the linking is between the frames and also the correspondence find what the code's return address is, right? This is what the linking is between the frames and also the correspondence to what the code is that caused the frame to be on the stack. And there's lots of cool libraries people have developed to do this. There's various lib unwinds, like there's libunwind, LLVM libunwind, GCC libunwind, a whole bunch of them. And there's crash reporting libraries that wrap up unwinding facilities. Iwind, a whole bunch of them. And there's crash reporting libraries that wrap

Starting point is 00:36:25 up unwinding facilities. I think Breakpad is one of them. And that tells you what code was being executed when something bad like a crash happened. But this ability to walk the stack structure and ask questions about it is super key. Right. And what's interesting is if you think of it as a basic unwinder, all you have to do is walk the stack itself. You look at the current frame in the stack pointer, and that tells you how to tiptoe through the linked list. But imagine you alighted frame pointers or did stuff like that. Then you really need a side data structure

Starting point is 00:36:59 which tells you what the frame size for each function is. But the compiler knows what that is, right? So it can obviously emit that data somehow in the program. And then the unwinder needs to look at the current function pointer, right? Or the program counter. And that also gets spilled to the stack, right? Because when you want to return, you spill the location to return to the stack. And then when you pop, you then jump to that location to go back to where you were. And so it means that that PC, the program counter, can be used to do that lookup in the linked list,

Starting point is 00:37:38 even though you don't know the frame pointer in all cases. So that's kind of interesting. And it's like a side data structure that you have on side of the stack that's, you know, read only. And it's something that compiler generated and says, well, if you're at this address, then the size of the function is this.

Starting point is 00:37:55 So the frame side is this and this is where you are, right? So it's kind of interesting because it's less stuff to carry on the stack and you only need to look up those side data structures if you're trying to do unwinding. It's the only case you use it.

Starting point is 00:38:07 I guess that goes into like a binary section, like the dwarf info, if you're using elf binaries on Linux or Linux. Yeah, or Peacoff or whatever on Windows. And so if you're doing unwinding for exceptions, say in C++ or something, then as you were saying, you need to call the destructors or even like the finally blocks if you're in a language like Java. So it's not just exceptions. It's also like you have finalizers and stuff like that or AI, as you were saying.

Starting point is 00:38:35 And then there's really a few ways to implement that. The main one is to have that metadata. But in other cases, you might want to do something like set jump long jump and have landing pads. And I think Win32 did that in x86 to implement exception handling, which is kind of an interesting life's choice there. But yeah, as you unwind the stack, you somehow have to figure out where your landing pads are because those are the ones that have the code that do the cleanup right so you don't just you know pop the stack pop the stack pop the stack but for exceptions you have to run code right and so that's a list of exceptions uh destructors that you need to run or something like that right whatever ends up being in the catch block or

Starting point is 00:39:16 something and then you do that then you unwind the next frame right so you like unwind one look up the landing pad do some work then then unwind another one, do that. And obviously, if you have an exception that happens during that time, you have to figure it out. So that's kind of interesting there. And then on Windows, it also mixes SCH, structured exception handling, with C++ exceptions. And structured exception handling can do more than just regular C++ exceptions. Like if you have faults or something like that, like, or divide by zero for a flowing point or something that you can, I think, handle in SEH. And so if I remember properly, if you do like catch dot, dot, dot,

Starting point is 00:39:56 you can catch kind of unrelated stuff in Windows, which is kind of weird. Yeah. So like in a C program, so I'm just trying to understand the distinction of seh i don't know too much about seh but i guess you could make like a trap handler even in a c program uh that did interesting things using these seh facilities that they have yeah okay cool yeah and it's what you'd use signals for in other environments oh i see but it's kind of an odd mix. I see. So instead of going the signal route then this provides kind of a signal alternative. Very cool. Yeah. Yeah. And so maybe could you describe briefly what set jump and long jump are? Right. Like set jump and long jump right. Set jump is basically a way to save the current state of your registers and where you

Starting point is 00:40:44 are in the stack and whatever, right? So you have a buffer and it's like a set jump buffer and you do set jump and saves it. And then when you long jump, you give it that buffer and it puts the state back as where you were before, right? So in a way, set jump is a function that can return twice, right? When you call set jump, it saves the state. And when eventually you end up calling long jump, it resets the states where you were. So set jump comes back twice, right? I think the only function that can ever return twice. So it's like the local state inside of a CPU, right? So I've sometimes heard it called like a micro context, which has like the registers in

Starting point is 00:41:22 it and this kind of stuff that you could restore to make the local processor state kind of look like it was. Of course, the heap doesn't go back to the state that it was in, but like the local processor state, I guess, can go back to where it was. And that's maybe kind of similar to what we do when we enter kernel space in a way. We're kind of stashing a bunch of the state away so that the kernel can start doing its thing and then you can resume. i guess it kind of plays into this notion of continuations which we might have to talk about uh that's at some future point right right right and yeah i guess what's interesting there also about the stack is is you know we're talking about uh frames being of a set size, but there's these functions in C and C++, well, kind of in C++

Starting point is 00:42:07 called allocate, and you also have variable length arrays, right? So an array where you say like n is the size that you declare on the stack. And it's kind of interesting, right? Like how are they optimized? What's their behavior when it's dynamic and stuff like that? And if you look at C++, you don't actually have variable length arrays at all. It's a common extension for compilers to support CVLAs in C++, but they're not officially a thing that exists.

Starting point is 00:42:36 So that's kind of neat, being able to change the size of the stack at runtime. So your function comes in, and you're like, oh, actually I want to have X things and you put them there. And I've seen a lot of programs do like small stack buffer optimizations doing that, right? So let's say like, well, I want to have a string or a vector,

Starting point is 00:42:56 but only have like say 256 bytes of stuff and I'll put it on the stack if that happens. And otherwise I'll just heap allocate it, right? is kind of an extension to the language in a way. And so it'll either do allocate or malloc, and then it'll do nothing or free when you exit the scope of the stack. So it's kind of interesting. And then there's interesting weird corner cases like allocate of zero is, I think, technically undefined here. And so are VLAs of zero. So it's kind of weird, right?

Starting point is 00:43:30 So I've seen compiler optimizations break code that had those, or I've written some, and they had to fix them because I realized like, oh, zero is probably a valid value, even though people didn't intend it. But like, what if you do allocate of a negative value, right? Like that, that's clearly undefined behavior. You can't ungrow the stack. So I, you know, I wonder how that's, that's handled in different programs. Right. And what's interesting is like allocate

Starting point is 00:43:57 and VLA's just don't fail. There's not a return code or whatever. It's just like, it happens. So yeah, that's, that's kind of neat. Yeah, that's interesting. I guess, so is it allocate or aloca? Because I think I've been mispronouncing it. If you can't, I mean, maybe it's both, but I've been calling it aloca. I'm French Canadian. I call it allocate.

Starting point is 00:44:16 Yeah, allocate sounds good. So if I do this allocate and it returns me a pointer to the stack location that it was able to acquire. So could that be null? Well, I don't think it can, right? I think it always returns whatever like stack pointer or frame pointer plus X that you asked to. I don't think it can fail.

Starting point is 00:44:33 Oh, I see. So it would have to be slower probably if it nominally could fail because you'd have to say like check some conditions. It's intended to never fail. Is that part of the language or part of like the open group POSIX standard? What level does Allocay live at?

Starting point is 00:44:50 Does C programming language include a notion of Allocay? I don't remember, honestly. I think Allocay is part of POSIX, but it's also how, at least within LLVM, like everything is implemented as quote unquote allocate. And then it uses constant values for a lot of stuff and, you know, creates the stack out of multiple allocates and stuff like that. So like, if you look at the internal IR for LLVM, everything's an allocate that goes on the stack. Yeah, it makes sense.

Starting point is 00:45:17 And then it tries to take everything off of the stack and put it into registers that it can, or combine these different stack allocations into a global frame that it can index slots out of and things like this exactly yeah cool well i mean it's kind of funny we we mix all this stuff on the stack and sometimes things are variable size sometimes you even try to put a fixed size thing on the stack and you think that you're writing into it the right way but then things can sometimes go wrong as well. So like the OG buffer overrun exploit was to overrun an array that someone

Starting point is 00:45:51 put on the stack. They were like, Oh, this will only be, you know, eight characters long, this, this little string here.

Starting point is 00:45:56 And then someone gets you to clobber the return location, which also lives on the stack by, you know, saying, Oh, you know, if I index past character seven and I do some writes, and if you didn't check the bounds on that, I could potentially

Starting point is 00:46:11 go find where the return address lives and write some memory location into there. And if I can write enough data onto the stack, and if the stack was actually executable, if I could jump to it, I could actually jump to the data that I had written. So it's always funny to think about how these exploits used to work back in the day. Now the stack is never marked as executable. That's like a big security no-no, right? So pages and memory, like we talked about virtual memory, they have these execute bits on them that says whether you can actually pull instructions into your CPU out of this memory region. And those are marked. That's fairly new, right? Like I think CPUs 20 years ago couldn't do that. I think it's a 20 year old technology, like the NX bit or whatever.

Starting point is 00:46:56 Like it's pretty new. Yeah. Yeah. I mean, it's a pretty fascinating, all the mitigations and things that have been added pretty recently in computer history, I guess. Right. Yeah. I mean, I remember I originally learned about that buffer overrun exploit in school, and I think it was nominally not really possible to happen anymore, especially because compilers could automatically insert guards and things when they were in protected mode. Yeah. So I think that kind of shows like all the stuff mixes together in the stack both the instruction

Starting point is 00:47:27 pointers and the and the data that you potentially want to home there and then if you're working in a memory unsafe kind of environment then potentially you can take advantage of the fact that these things all mix together things like virtual table pointers and stuff may also be in there right yeah yeah it's pretty cool yeah and i i guess like there's a lot of extra technologies that came in later not just on the cpu side but also just from the compiler right trying to protect stuff like there's this thing called stack protector that makes sure that when you create a large frame right so so say you have a large fixed size allocation on the stack say like you like 8K or something, then either statically or dynamically,

Starting point is 00:48:09 it'll touch all the pages that you might access. So that you don't allocate, say, like an 8K buffer and then just access the end, but that's part of guard page. So say you have a 4K guard page, if you were to access past the stack, so not use any of that buffer except the end, then you'd have a stack overrun effectively, like not technically your programs accessing valid stack, but the implementation of the abstract machine doesn't have that much stack. And so it's actually outside of the stack that the program really has, but not

Starting point is 00:48:43 outside of the stack that your source code contains. So that's kind of interesting. Yeah, can you tell us more about guard pages? I think we might have mentioned them earlier, but I don't think we really covered how those guard pages work. Right, yeah. So I guess when the OS creates a thread, it'll map a certain amount of stack, right?

Starting point is 00:49:03 And as we were saying, it could do that lazily or it could do that eagerly. And so if it, imagine, does that eagerly, right? So it'll give you like 16 KB of stack. And at the end, it'll put a 4K guard page, right? So it's a page that's marked not readable, not writable. And the intent is, well, if you go out of stack, it'll just fall to there, right?

Starting point is 00:49:25 But you don't map like, you know, four gigabytes of guard pages at the end of your stack. You just mark 4K or 8K or something like that. And so there have been exploits in the past where programs just jump over the guard page and the exploit goes and uses the edge of that code to jump over it and do malicious stuff because, you know, the stack is just memory. And so that happened to be, say, another stack or heap memory or something like that. And so it ends up clobbering some other valid thing way after the stack over the guard page. It brings up this fast, I guess, two fascinating things. One is a struct offset technically could be any value, right? It could be like, I have a struct with an array

Starting point is 00:50:07 with four gigabytes of data in the inline array there, right? So accessing that offset could take you arbitrarily far and you could be like, yeah, I'm placing that thing on the stack in an abstract sense. So that's kind of a funny idea. But then there's also the notion it brings up that, you know, nominally the stack kind of grows down, maybe if that's what you're used to, and the heat maybe grows up from the bottom of

Starting point is 00:50:30 the address map. And at some point they might meet, and then you might be able to like jump from one into the other, you know, because these are just kind of locations in memory. And maybe if you have multiple stacks, like you were saying, maybe your stack at some point will collide with another stack. So there's always these funny ideas of at the end of the day, we just have this big flat address space in some abstract sense on modern machines. And so jumping around inside, you have to make sure you know the extents and stuff, which also brings up segment registers and things, which I don't know if we should talk about today. But maybe, you know, if we talk about knackle or something fun like this in the future, we could talk about today. But maybe, you know, if we talk about Knackle or something

Starting point is 00:51:05 fun like this in the future, we could talk about how segmentation has evolved on architectures like x86. Yeah, lots of interesting stuff there. Speaking of x86 evolution, maybe you want to talk a little bit about the red zone? The red zone? Yeah, yeah, the red zone sounds cool, right? So it's a cool idea that you can guarantee some space underneath your frame, right? So like 120 bytes on x86-64. And so when you're a leaf function, you know there's like this extra space that's just there and it's guaranteed to be there. So you can avoid doing accounting and just kind of clobber stuff that's related to your frame pointer. That's kind of neat, right? So it's just part of your ABI, your application binary interface, that extra space just kind

Starting point is 00:51:51 of exists there. Yeah, that is super cool. Yeah. And there's other interesting stuff about the stack related to that. Like, for example, alignment is interesting too, right? Like, usually you always try to hide in the line stack right because having a byte aligned stack is weird when the whole point is to spill registers and your registers are like say four bytes or eight bytes or something like that and what's interesting is

Starting point is 00:52:16 i remember when when armed v8 was still being designed they they had this interesting thing they come up with at some point where all the stack-specific instructions, right? So instructions that were really manipulating the stack, aligning it or something like that, were allowed to fault or had to fault if the stack itself wasn't aligned properly, right? So an ARMv8, it just does that. And it's expected that any stack manipulation spills at least like two 64-bit registers, right? So it's expected that your stack is aligned to that 16-byte boundary all the time. And so it's kind of interesting there, right, where instructions aren't just specialized to the stack. They have these extra things that try to find incorrect usage or things like that.

Starting point is 00:53:00 It's kind of neat. Yeah, I guess the more... And confusing. Yeah, the more things you know about how your system should be working, the easier it would be to detect if something is trying to be exploited or going off the rails potentially. Your invariance about how the system

Starting point is 00:53:16 should be operating gets stronger. But then it also prevents people from doing super creative, if somewhat dangerous in the general case stuff. I guess self-modifying code always comes to mind, right? There's this idea of futurist programming where everything is self-modifying code, tight inner loops to get all the efficiency of a modern machine. So some of those things can become more difficult as we start trying to create this like general process around how things are supposed to work. Yeah, I think this idea of like processes interacting with performance, like to make implications for how a machine is used, is pretty generally interesting as like a theme.

Starting point is 00:53:58 So there's also, you know, the classic considerations that you learn in computer organization class around what registers are saved by the caller and which ones are saved by the callee. And caller versus callee saved conventions can have a big impact on performance because if you save too much in the caller that the callee doesn't end up using most of the time, then you're just kind of wasting performance. And if the callee has to save everything, but pretty much all the callees want to spill at least a couple of different things,, that's also wasteful. And also on architectures that used to have by convention purposes for different registers, like ECX is used in very special ways for multiplications or, you know, EDI is used for special things for repeated instructions and things like this. These also played into the constraints that would lead you to interesting caller

Starting point is 00:54:48 versus callee saved conventions. There's also, what do we save away conventionally as part of that micro context when we switch from user space to kernel space? The kernel, on an interrupt or assist call or what have you will probably want to work with some set of resources that's not too small, but also you don't want it to be too large

Starting point is 00:55:08 because then you're saving a bunch of stuff for no reason that the kernel routine ends up not needing to use. So, you know, for syscalls, you have to have like kind of this alternative stack in kernel space. And in general, the kernel threads have like a full separate stack

Starting point is 00:55:22 once they leave interrupts to go into like a full kernel thread. So there's like a whole space of conventions around performance in the stack that you know create this whole ecosystem that we're working in which is always fascinating right and what about managed virtual machines i guess we didn't talk too much about that is there cool stuff there like maybe in web assembly assembly land around how stacks work? Right, yeah. In WebAssembly, it's interesting because the return address and most of, well, all the code is completely separate from the data, right? So WebAssembly is meant to be like a secure VM that can run inside a browser.

Starting point is 00:55:58 And so, you know, computers nowadays, most like C and C++ programs, they just operate on what's called a von Neumann machine where data and code are just kind of mixed together. And WebAssembly really follows what's traditionally called the Harvard architecture, where those are really, really separate. And if you're writing a C or C++ program or Rust or whatever, in WebAssembly, you don't know where any of the code is for real. And so this stack isn't really a thing in WebAssembly, or rather WebAssembly, when you compile C++ to it, really has two stacks, right? It has the call stack with some internal state and whatever that might have function pointers and other things like that. And then it has whatever state that needs to be reified into your WebAssembly program.

Starting point is 00:56:52 So say if you take the address or something of a variable, say you call printf or whatever else, then it needs to see the values. And so there's a separate stack that's in the WebAssembly memory, effectively your heap. And that's separate from the call stack. And if WebAssembly is running inside a web browser, it's interleaved with JavaScript. There's a JavaScript stack interleaved with a WebAssembly stack. And if WebAssembly is running inside a web browser, it's interleaved with JavaScript. Like, you know, there's a JavaScript stack interleaved with a WebAssembly stack. And within the WebAssembly memory, there's a completely separate stack as well for your program that's being reified there. And that's kind of part of the ABI there again. And what's extra neat with WebAssembly is, like, if you look at the VMm spec itself right the the virtual instruction set architecture it doesn't have a finite set of registers it just has infinite registers so functions can just have

Starting point is 00:57:34 infinite registers functions can return an infinite set of values as well right so so functions don't just return one thing you can just return you a dozen so it's virtual register and variadic return value arity that's exactly exactly and variadic input in a way or or you know functions have to have a set number of inputs but it's not limited to you know 12 or something you can give a function 100 values or a values if you want to. How does it do printf? How do you do valists? Yeah, so that's why printf and valists, the way you have to, well, not you the programmer, but the way you the compiler has to make that actually happen is you have to create a stack in user accessible memory. So that's why you have a stack in the WebAssembly memory, so that you can spill to that location that's accessible. And then printf, which is itself implemented in user code within the

Starting point is 00:58:31 WebAssembly VM, is able to access the percent i, percent f, or whatever you're trying to print. And so yeah, VLists are implemented as stack spills in the WebAssembly memory. Cool, cool. So it's kind of neat. Yeah. I mean, I always find this fascinating. I guess in the full web browser, now we talked about there's the native stack, there's the JavaScript stack, which the JIT can unify with the native stack, right?

Starting point is 00:58:56 And then there's also the WebAssembly stack. And then WebAssembly is also being JIT compiled, potentially pulling stuff onto the native stack again. So, you know, it's funny how this whole ecosystem of things, you know, they all have interesting and potentially unifying in certain ways treatment of this general notion of stacks. That's like, it's like the glue that binds program execution together in a way. I don't know, maybe it's over romanticizing a little bit, but stacks are super integral to how we think

Starting point is 00:59:25 about how machines are going to work today. Right. In managed VMs, there's also like cool features that certain, you know, when you're managed, you potentially can put more invariance on the program because you are managing what the language is doing. That's the idea of a managed virtual machine. So in places like the CLR, which is the.NET JIT execution managed environment, you actually have a type stack. And so there have been projects like, I think it was called Pidgin that was mapping Python onto this type stack. And Python has both a data and a control stack per frame. And so, you know, figuring out how to mux the Python execution onto the CLR notion of stackiness in this managed world,

Starting point is 01:00:11 it creates all these interesting ideas and interactions around what a stack should be, how it should operate, what invariance we know about it, like whether the type of a given slot is always the same, or is it just a big bag of bytes? Or if you're thinking about a managed language environment, like with typed arrays, are the typed arrays words or are they bytes?

Starting point is 01:00:32 You can have all these different variations on how things think about them. Yeah, that's cool. And there's other interesting stuff. Again, in JavaScript, what's interesting, like in C++, when you stack overflow, it's not really defined when the stack end is and what happens when you run out of stack. And realistically, what happens is you'll trap nowadays, right? Because there's often a guard page in most implementations. But in JavaScript, it's also not defined when you stack overflow, but it is defined

Starting point is 01:01:03 that if you stack overflow, there's is defined that if you stack overflow, there's an exception that's thrown, like a JavaScript exception, and you can catch it, right? And that's kind of neat as a programmer. Now, it's kind of hard to handle. And as a VM implementer, it's kind of painful, right? Because imagine, like, when you JIT code, you're usually pretty careful about the

Starting point is 01:01:25 JIT code not running out of stack. So you'll do a check or something like that. And you'll throw an exception if it runs out of stack, kind of like what Stack Protector does. But if you're writing runtime functions that are supposed to support the JIT, right? So, you know, you as a VM implementer write C++ code that JITs code, but then at some point you're like, eh, this math function or whatever, or this helper function, it's kind of hard to write in jitted assembly, so I'm going to just write a C++ function that does that.

Starting point is 01:01:55 I'm going to call it from the jitted code. That C++ function itself has a stack. So imagine the virtual machine is, you know, calling JavaScript code, and eventually it calls this runtime function, and you run out of stack right then and there, right? So JavaScript had stack, it did all the checks it needed to, but then the helper doesn't, right? How do you check that you're not out of stack, right? That's kind of an interesting thing. And it's been the source of, at a minimum, bugs within JavaScript virtual machines, if not exploits. And so I've seen fuzzers for JavaScript

Starting point is 01:02:32 that have this cute, like, recurse this function a bunch of times, then call a helper, then unrecurse once, call the helper again, unrecurse another one, call the helper again. And it tries to kind of mess with the VM to fuzz the VSM itself. So it's kind of neat. I see. So it's actually saying, oh, I saw I got a stack overflow here. Let me back up one. Call some native stuff and see what happens. That's very, very tricky. Yeah. It's kind of a neat pattern. Yeah. I mean, working on the JS engine in Firefox a bit, fuzzers are the best. You turn them on when you're ready and you get the deluge of all the interesting things that are corner cases in the universe.

Starting point is 01:03:13 They're amazing tools and they're great. Yeah, there's really good fuzzers out there for JavaScript. They do interesting modifications and sometimes you look at the code and you're like, wow, this is evil. But yeah, you find good stuff. Yeah, the reducers are really the key. They're less talked about, but they're kind of the stars of the show of, all right, you found some horrible thing.

Starting point is 01:03:33 Now please reduce it for me. And then getting like a crisp, you know, this program is only this long and yet it shows some key problem is always very amazing to see. Yeah, absolutely. That was a lot of stuff, my friend. Yeah, I've kind of reached my stack limit. I feel like there's more stuff we haven't talked about, like rotating register windows or return predictors or something like that.

Starting point is 01:03:56 But I think we're near the red zone right now. Yeah, yeah. I think we should probably do another podcast later because there's basically infinite stuff to talk about. Right, unlike the stack, this has so much stuff to talk about. We should do more TLB hits, everything about TLB hits. That's right, everything TLB hit. All right, on that note, well, thanks for doing this with me,

Starting point is 01:04:18 and I guess we'll talk to you folks next time. If people have feedback for us, write to us on Twitter, at TLB hit. That's it, yeah, thanks. Appreciate it.

TLB Hit 💥 - Episode 0: mov fp, sp

The stack, and how it relates to TLB Hits. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

TLB Hit 💥 - Episode 0: mov fp, sp

The stack, and how it relates to TLB Hits. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.