TLB Hit 💥 - Episode 1: (char)0 = 0

Starting point is 00:00:00 Oh, hey, Chris. So we talked about TLBs, but we didn't really define what the acronym means last podcast. So let's rectify that by talking about it way too much. So I suggest that today's episode be titled Star Car Star Zero Equals Zero. Hey, JF. Yeah, we should also continue to note that we only know so much, but we'll try not to say things that are wrong. And we'll follow up with errata corrections and clarifications as we find out about them and put them on the website. We do the lifelong learners thing

Starting point is 00:00:32 and we're always trying to push the boundaries of things that we know and have conversations about what's interesting to develop and hone our understanding. So feel free to give us feedback. All right. So we got some feedback from last episode. So I'll do the first mea culpa here from previous episode. I said that the Go uses segmented stacks, but it doesn't

Starting point is 00:00:49 anymore. It used to, doesn't anymore. They do stack copying now. So sorry about that. We're slightly inaccurate, but mistakes is how you learn. So let's continue. Something else I want to point out. So we've got extensive show notes on the website. So if you're not sure about a term that we use or anything like that, it should be on tlbh.it. That's TLB hit, but with a.it domain name. So we'll have show notes there. All right. So today, how are we going to talk about TLB hits? Well, yeah. So how about we figure out how a program gets to a TLB hit to help us define what a TLB is and what hitting it means. So what I propose is we look at a C program.

Starting point is 00:01:29 So it'll start like this. Int main, open paren, close paren, open curly brace, star, open paren, car star, close paren, zero equals zero, semicolon, close curly brace. And I guess for people who don't have a teletype writer inside of their head basically that's a int main where we're taking null or taking zero turning it into a car star dereferencing it and assigning it null so storing zero to address zero as a car star exactly how could you not love this program what What about its much more explicit C++-y cousin where we use the null pointer value

Starting point is 00:02:09 and we reinterpret cast it? What about that one? Right. So if you did reinterpret cast of char star, take null pointer, shove it into reinterpret cast and then dereference that, that just wouldn't compile, right?

Starting point is 00:02:21 Because you can't reinterpret cast null pointer. It's just a C++ thing. One thing you could do as of, I guess, C++20 is you could use a std bitcast, right? So bitcast just takes the value and moves it, moves the bits, but that's not guaranteed to have a value because null pointer itself in C++,

Starting point is 00:02:39 it's kind of unintuitive, but it's just a type. It doesn't have a value. It's like a monostate type, right? The fact that you have a null pointer is all you need to know. There's no value inside of it. So the value is not relevant. And so it's not really guaranteed what you get when you bitcast or when you copy a null pointer. So this is maybe one of those examples where the old ways are actually a little bit more clear, just car star C style cast and dereferencing it. So what in C++, what's the difference between a bit cast and a reinterpret cast again? There must be some reason

Starting point is 00:03:10 that we have both of those, right? Yeah. Yeah. So a while ago, like the C++ committee had talked about having reinterpret cast of references and allow you to reinterpret something as something else. But what was decided for C++ 20 is to have bit cast instead. And what it does is it takes the object representation of one object and takes its literal bits, right, as if by memcpy, and then interprets those as the object representation of an other object, right? So what's interesting here is it requires that the types, both the source and destination types have the same size. So you can't reinterpret cast 64 bits to 128 or something like that.

Starting point is 00:03:49 And it requires that both be trivially copyable, which is basically a fancy way of saying you can memcopy it. It's valid to memcopy it. So there's no stuff you do when you copy it. No special stuff besides memcopy. And then what's interesting about bitcast compared to reinterpretcast is it creates a completely new object, right? So it returns by value instead of returning a reference, right? Whereas reinterpretcast returns a reference or a pointer as you specify it. So there's a lot of tricky stuff around there about, you know, whether value representations

Starting point is 00:04:18 exist or not in the to type or the from type and whatever else. But the key here is basically like null pointer t just doesn't have a value anyway. So it's kind of silly to bit cast it. Do any of us really exist? Can we really know? I don't know. I don't know. Yeah. It's always fun to learn about the language leery aspects of C++ because you kind of just type in reinterpret cast or bit cast, but you don't necessarily think like, what's the meaningful distinction between the two of these things? Right.

Starting point is 00:04:43 So from your description, I can't reinterpret cast car to an int. All right. So it sounds good. We should probably literally get back to the program. So we're dereferencing null and assigning it zero. And it's a car star. Yep. Yep.

Starting point is 00:04:58 And so what I like about this program, what I really love about it is how really open it is, right? You write that on a whiteboard and you're like, so let's talk about this. And I've done that pretty often. And some folks will talk about, you know, the language. Some folks will talk about how a compiler implements this. Others will talk about how an OS or hypervisor works, the instructions you see.

Starting point is 00:05:16 Some folks, you know, will want to talk about how the hardware does things with it, right? So it's pretty open and there's a lot to talk about, right? So let's dig into the basics. Conceptually, what we're trying to do in this C function, and spoiler, we're not really succeeding at it, but what we're trying to do is we're trying to reinterpret the value zero as a pointer, dereference that pointer, and then store another value, zero, inside that memory location.

Starting point is 00:05:42 That's what we're trying to do at a high level. Right, so if you're going to translate that abstractly from C to assembly, right, it's always good to do an exercise in your head where you look at some C code and figure out how you'd straight line translate it into the corresponding assembly code. So we'd kind of take this constant zero, we'd move into it as like a store instruction, and we would move into that zero address location, the zero value. Exactly. Yeah, so pseudocode is always a great way to look at that.

Starting point is 00:06:12 And even with people who don't necessarily know assembly very well, right? Pseudocode is something that most people should be able to get, like the basic, what does the CPU generally do, right? And so creating an immediate, moving it into register, and then storing to that register, that's basically what it's trying generally do, right? And so creating an immediate, moving it into register, and then storing to that register, that's basically what it's trying to do, right?

Starting point is 00:06:30 Makes sense. And so I think a question we can ask before we even get to the assembly is does the compiler even let you compile this program? So you're writing something into null, and so does it let you turn it into assembly, or does it give you an error at compile time and if it does give you an error at compile time why some people will run a static analyzer on their code to catch definitely null dereferencing errors like this that you can

Starting point is 00:06:58 determine statically but that probably won't come out of the box as part of invoking your normal c compiler if i just fed this to gcc orang, they would happily turn that into a couple of assembly instructions for me. Right. You know, if you look at, say, Clang static analyzer, it'll flag it. It has a thing that tries to detect any type of null pointer dereference. And those are possible null dereferences, right? Like nobody writes code literally like this, except us, I guess. But if it can try to figure out like, hey, you know, you initialize some object to null or a pointer to null.

Starting point is 00:07:30 And then if you take exactly this path, like a static analyzer could figure out, you know, this object is null. It'll tell you, right? And this is a legit source of bugs in real code. So a lot of people will run some static analyzer as part of their pre-submit scripts or, you know scripts or nightlies

Starting point is 00:07:45 or whatever. And other static analyses like Coverity also do stuff like that. All right. So at least the compiler will let us turn this into some binary, some set of assembly instructions. Right, right. But now we should note that this is undefined behavior, both in C and C++. That code is just technically undefined behavior.

Starting point is 00:08:06 Right. And undefined behavior, that generally means it can be exploited by the compiler that this thing is not the case, right? So when you say X is undefined behavior, the compiler can assume that X is something that doesn't happen. Because if you did happen, then it would be undefined behavior that occurs. So X in this case is I will not write to address zero dynamically because you're only allowed to write to actually existing objects. And then the compiler can use the fact that it must be an existing object to do optimizations or what have you.

Starting point is 00:08:43 So if you actually do this at runtime, then all bets are off for what the program behavior actually is. Right, right. And the important thing here is this is all a dynamic property, right? Like if you have undefined behavior in part of your code that never runs, then it's not undefined behavior because it never executes. So in this case, if you actually execute that code, it is undefined behavior. And it's really that, at least in C++, C++ likes to think about objects. Everything's an object. And if there's no object there, then you're not allowed to write to it. There's no lifetime that's begun for that location. And at a high level, undefined behavior is not the compiler trying to be a jerk to the programmer, right? It allows the compiler to assume that something isn't possible.

Starting point is 00:09:25 And if it were to happen, then, you know, that's a bug in your program. And it ends up being kind of a ripple effect in code optimization, right? Where each individual step as an optimization makes sense, but the final result is often pretty surprising to people. So in this case, we could make it well-defined by using Valtol and using a non-zero value like 1, for example.

Starting point is 00:09:47 This pattern, not specifically 1, but the pattern of taking some number, casting it to a volatile car star or instar or something like that, you see that really often in hardware-defined registers. So your hardware vendor will say, well, if you write at this address, it's kind of a magical value in my hardware. It's documented in the manual or the header file. And if you write to it, here's the behavior that you get out of the hardware. It'll do something special. Right. So we would use a volatile address of one and the value could still be zero. And that would be well-defined in that case. And so zero realistically is not mapped in to the address space in most modern user space environments so in kernel mode or in more niche architectures this actually could do something you could have a peripheral who had some register mapped in at memory address zero or you could just not fault

Starting point is 00:10:41 on a right to address zero potentially potentially, if you were running some embedded kind of system. And on x86, for example, in physical memory, the interrupt vector table resides there. Right. You know, we talked about what we think the assembly would look like, but let's look at what the assembly really looks like, right? It's good to form an idea of what you think is going to be there and then try it out in reality and see if you were wrong. Now, what's funny is if you go to, you know, godbolt.org or whatever, and you punch that in, you try GCC or LVM in different modes, so x86 mode, ARM mode or whatever. What's really funny is that without volatile, Clang just literally on x86 makes this UD2. UD2 is just an instruction that traps, right? And it makes it break. Another

Starting point is 00:11:26 instruction that just traps on ARM64. So the code you get after optimization is just like, you know, you enter main and you crash. That's all the code you get outside of Clang if you don't use volatile. Now, GCC is a bit nicer. What it does is instead of dereferencing, instead of just emitting UD2 or break, it'll dereference null and then emit UD2 or break if you don't have volatile in your code, right? So GC is being nice. It's doing what you asked it to do. And then it just, it just crashes afterwards. But if you add volatile in there, right? So you do star, paren, volatile, char star, close paren, zero equals zero. What you'll end up seeing is on x86, you'll get a move of byte pointer, you know, square bracket, zero comma zero, or on ARM, it's STRB. So byte store of

Starting point is 00:12:14 register zero into some register. Now, what's extra funny is GC still emits UD2 after the null dereference, even if it's volatile. And GC even ignores the value being written. So if we change the code to star volatile char star 0 equals OX42 or something, the value it tries to store is still 0. It tries to store it at address 0, but the value it tries to store into it is 0. It ignored our OX42, and then it has a trap after it, right? So this is a bit intuitive, and I know it's pissed off a lot of kernel people in the past, but you can tell GCC

Starting point is 00:12:51 to stop this with the option dash F no delete null pointer checks, which is super unintuitive, frustrating to people around volatile. But if we step back a bit and we say like, okay, well, zero is a special value, but we'll dereference one instead, right? So star volatile char star one equals zero or equals OX42 just to get rid of the silly UD2, then everything's fine, right? So it emits the code that we'd expect. So this means the address zero is really treated specially by both Clang and GCC. And in this case, volatile just tells the compiler when it sees address one here, that some effects that can't reason about might be happening, right?

Starting point is 00:13:30 There's something special. It shouldn't try to be too smart, even though it was trying to be smart. And using Volatile in this very specific case, we're just tricking the compiler, telling it like, you know, don't do that. But this pattern, if you have, you know, hardware-based register is legit a way

Starting point is 00:13:42 to access special memory that's hardware mapped, right? So it's something that happens in every computer pretty frequently. Okay, so let's focus on user mode instead of talking about kernels or embedded devices or whatever. How does this instruction cause a trap, or should we call this a fault? Yeah, that's a good question. Maybe first we could talk a little bit about exceptional things that happen in the processor in general when you're running a program. So there's this taxonomy we actually have of weird stuff that happens that's independent of the normal flow of instructions through the processor. Sometimes people break them into categories of traps, faults, and aborts. So traps would be something where you're kind of asking the operating system for assist or to inspect what happened. Faults would be, oh, something actually

Starting point is 00:14:31 went wrong and maybe it can be corrected. And aborts are just, you know, the program basically has to terminate. That's kind of what that taxonomy means. Generally for exceptions inside of the CPU, there's questions like, do I need to be able to show the exact architectural state where this exception thing happened? You know, there's differences between things like divide by zero and a system call where you're actually asking the operating system for something. And a peripheral device sending you some asynchronous interrupt like, hey, service me. I have some packet that came in or something like this. And then there's related questions like, can the hardware effectively reorder things, assuming that a load is non-null? So this gets back to precise exceptions. So if a segfault happens, you want to be able to analyze the cause of the segfault that you got instead of just kind

Starting point is 00:15:22 of skidding to a stop, however it keeps on going with the program you want to say ah this is exactly what led me to seg fault here and that gets into things inside of cpus like reorder buffers and the point of no return to say whether you can start clobbering your architectural state with more things that happen and then you get back up to the compiler level and you ask can can the compiler reorder things, assuming that a load is non-zero? So even though the abstract machine in the C standard or the C++ standard has no notion of what a program looks like when something gets loaded from zero, compilers are bridging the practicality gap between this abstract machine concept and what people really expect their program to do when running against the hardware, which again goes back to this notion of implementation defined or undefined behavior. And even more in the wacky things that happen zone, there's also traps that can happen or faults that can happen within instructions. So if you have a repeated instruction like rep move SB,

Starting point is 00:16:22 which is like the mem copy instruction, that's a canonical example of one of these composite SISC instructions where maybe part of it goes well, and then there's a fault somewhere in the middle. So how do you recover? How do you resume? Become interesting questions like the multi-stack push arm instruction that we talked about in the last episode. There's also devices that trap the CPU potentially in the middle of an instruction, like someone sends an asynchronous interrupt and it's a long running instruction. How do we preempt the long running instruction or something that races with another trap? Like if you have a unit processor, a single core system where the non-maskable interrupt comes in while some exception is also being raised from somewhere else. So memory accesses in user mode are all dealing with virtual addresses, which we talked a bit about last time. So virtual addresses are not the same as physical addresses, right? We said

Starting point is 00:17:16 physical addresses are actually like DRAM locations that are all linearly lined up in physical memory where that DRAM is. Each process and each guest virtual machine, if you're doing virtualization, has its own view of memory, which is the virtual memory view that's distinct from how other processes and virtual machines might see memory. And that's all done through this virtual memory address layer. Right. But then like, so we have these two worlds of virtual and physical stuff, and it's pretty complicated how stuff stuff one maps to the other. But when you write a program, not, not like, you know, in a high level language like C, but when you write assembly, right,

Starting point is 00:17:54 maybe a third of, you know, probably a third of all instructions are memory accesses. That's, that's a lot, right? So all this complication and a third of instructions have to deal with it, right? Right. So I'll flag a citation needed on where does the one third come from. But I think probably your reasoning is that if a basic block is around five instructions, then the conventional wisdom is there might be one load and one store in there. So two fifths or something like this. Right. Yeah. That number is totally what I technically call a NPOA. So it's a number pull out of S, right? It's roughly right. About a third instructions is about right. And as you're

Starting point is 00:18:30 saying, you know, if you count, you know, five instructions in a basic block and there's some load in some store, it's about right. Still like the idea of order of magnitude, you know, the, maybe a third or a bit less of, of, of instructions, there are loads in stores or access memory. That's a lot. Right. So in computer systems, we're always talking about the fast path or the common case, because if we can speed that up, then, you know, it might be a large percentage of the workload and you're speeding up the most important things in that case. So in the common case, when you've addressed this memory location recently, you want accesses to it to be fast. And caches are how this is made fast. And

Starting point is 00:19:05 we talked a little bit about that last time. Some caches can be addressed virtually using the virtual addresses, and some can be addressed physically using the addresses that you resolved from the virtual address. So that means that the virtual address could be used for what's called a virtually addressed cache. And of course, it's addressed by the process and its view of the virtual memory. So there's like a virtual machine ID or a process context that tells you this is the process and this is the virtual address that that process has. Yeah. Yeah. And so as soon as you address physically and your most caching is done physically, then you need to somehow translate the virtual address to a physical one. And the kernel keeps that mapping in what's called a page table.

Starting point is 00:19:53 So that's basically a map of virtual address, process ID, VM ID, something like that, all the relevant bits. And it maps those bits to the physical address, as well as some properties of the memory location, such as access permissions. So can you read there? Can you write there? Can you execute that location? And at a high level, the page table is organized as a tree, right? So if you think about the tree data structure, it's laid out in memory. And each level of that tree is keyed off of some bits of the virtual address because it's going to be pretty sparse.

Starting point is 00:20:26 And then flattening that tree would make it pretty huge. So you want to have a sparse layout for it. Right. And the base address of that tree where I can start walking it is pretty special because in x86, we might have a hardware page table walker and it needs to know where to start walking that exact structure. So there's this cool control register called CR3, if you've ever heard of that. So what's really cool about this is the kernel puts some stuff in memory and then tells the

Starting point is 00:20:55 special register and hardware, hey, this location is special. And then the hardware can start walking an actual data structure that's in memory, that the kernel agreed to. That's kind of mind-boggling to me, right? Like it's something that as a user space programmer, you're just really not used to. The hardware just doesn't really touch your stuff, it executes it, right? It doesn't walk your data structures in fancy ways, except here it totally does. And so what's interesting here is the granularity of the mapping that we're talking about depends on memory pages, right? So the valid memory page size, like what is that? Well, it has to be dictated by hardware, right? Because the hardware

Starting point is 00:21:31 has to tell you like, here's the valid page size and the kernel just follows whatever it's told is valid. And then often the smallest page size is going to be something like say four kilobytes or something like that. And in some hardware, you can add larger pages, right? So often it's called huge pages or super pages or something like that. And yeah, so that's pretty interesting. And then finding the base of a page, how do you do that? Well, that's actually pretty easy, right? You just mask off the bottom bits. If it's four kilobytes as a page, you just mask off four kilobytes worth of stuff, and that's the base of your page, right? So that's kind of interesting. Right. So then to zoom back out to the big picture, we have this page table structure and that lives in memory. And then this piece of hardware, which is often specialized, let's call

Starting point is 00:22:15 it like a page table walker. It understands exactly how that's going to be laid out. It knows what all the structs look like inside of there. And that's a contract between the piece of hardware that's being run on and the operating system, like inside of there. And that's a contract between the piece of hardware that's being run on and the operating system, like you were saying. And then the hardware is capable of doing lookups into that data structure in memory to find the virtual to physical mapping

Starting point is 00:22:36 by slicing off some bits from the virtual address and then using that to walk through these entries that are organized in a tree shape. So hardware knows how to find that page table through that special register, CR3. And then doing this walk, which potentially goes through like a bunch of different page table levels, like four, for example, would be for data dependent accesses to do it every single time. And like you said, there's a ton of load instructions.

Starting point is 00:23:00 So doing that for every single memory access would be quite expensive. So what do we do? The same way that we solve every other problem in computer science, we add a cache. And because it's an important cache, it has a fancy name, which is Translation Look Aside Buffer, or TLB. And every memory access looks in the TLB to see, hey, did I already look up the mapping for this part of the virtual address to the part of the physical address? And if it's found, that's a TLB hit. And if it's not found, that's a TLB miss. And that makes us sad. Yeah. Okay, cool. Right. So every memory access, every time you have an instruction that does a memory access, it'll look at the TLB and try to

Starting point is 00:23:42 see if it's a hit or a miss. That's really interesting, right? And what it's doing is it's making it so that you don't have to do a page table walk every time you do a memory access. So that makes a lot of sense. But it's important also that the TLB only finds mappings for the current process and the current virtual machine if there's a virtual machine, right? Because if you have two processes on the same machine and they found the, you know, the other processes addresses, one of the big points of processes is that you can't access the address space of another process, right? You as a process have your own view

Starting point is 00:24:14 of virtual memory. And it's different from the view of virtual memory from some other process. So some TLBs are actually addressed by process ID as well as VM ID, whereas others are just flushed on every context switch, right? So maybe every time the kernel touches, say, CR3, that flushes the TLV as well, right? Every time it changes, it does a context switch, it changes the base of the page table or something, and then that flushes the TLV at the same time, right?

Starting point is 00:24:38 So how hardware works there, it varies quite a bit. Right, and this is also why it's more uncommon to see fully virtual caches because the fact that two processes might have different views of memory and that you don't want to get the two different processes views of memory confused means that you might actually use more of a physical notion instead of having to flush all the entries anytime a process switch occurs. Yeah. So then there's kind of physical questions to help make this more real in our heads. How many entries are there in a TLP? There can't be that many since there's a bunch of hardware in there implementing an associative

Starting point is 00:25:15 array. So lookup tables in hardware are called CAMs, content addressed memory, which basically does a big parallel comparison on keys to figure out which value should be selected. And usually you'll say only one should ever match at most one should ever match. And so this kind of gets into more of a detailed notion of how many cycles does it take to hit in my closest cache, the L1 cache, and what are those cycles of latency doing? And so usually it's about three cycles of latency these days to hit in an L1 cache on a super tight timing path. So what happens is we look up an entry in the TLB

Starting point is 00:25:51 and that tells us what the tag is that we compare against for that address. And then we look up the index in the cache, so groups of addresses together, and then we resolve the index against the tag. So that needs to probably be spelled out in a little bit more detail, but roughly it's kind of this couple of step process. That TLB is ultimately software managed. So when software changes the page table,

Starting point is 00:26:16 the operating system changes the page table, it flushes out the entries from the TLB. I'm not totally sure, but I think page table walkers are pretty complete. Like they don't need any assistance from the operating system except to be flushed out. Like the TLBs get flushed out and the page table walkers are basically entirely autonomous and don't need to like ask for more assistance from the OS. Sometimes these little fast accelerator things, they can only handle common cases, but because walking the page table is a fairly straightforward process, I think they can be pretty complete and not need to ask for help.

Starting point is 00:26:54 Yeah. So on a TLB hit, the hardware queries the cache and accesses the cache line if it's present in there. And if it's not present, then the cache protocol kicks in looking into the next level of cache, right? The cache asks, hey, next level cache, do you have this line? I don't have this line. And we should probably do a whole episode on kind of how caches line up and talk to each other. If it's not in any level of cache, then it'll go actually fetch the line from DRAM. And again, we should do a follow-up on DRAM row buffering and cache line replacement policies and things like this that are involved in the memory subsystem. Right. And so if the TLB misses, then the page table walk occurs, right? And so when you do the

Starting point is 00:27:39 walk, it'll either find the entry or it won't find it if there's no entry there, right? So if the entry is found, you'll add it to the TLB because that's a cache, right? So you want to find it next time and you'll do same thing with the cache, right? It's worth noting that some TLBs are managed by software and some by hardware, as you were saying, right? So who adds it to the TLB depends on what the hardware dictates, right?

Starting point is 00:28:01 The hardware is going to put it in, then it's going to do that. Otherwise the OS might do that. Right, you don't need to have a hardware page table walker. You could just ask the OS when you page fault, hey, could you resolve this for me and put it into the TLB yourself? Right. And then the hardware would try again and look in the TLB and be like, oh, it's here now, right? Right, exactly. It would resume the user space program. Of course, you also have to check those access permissions for the page. So if the page is mapped as read only, and then you do a store on the CPU, then a fault still

Starting point is 00:28:30 needs to be raised in that case, because you don't have the access to write to it. Right, right. And we're talking about this, and it sounds super complicated, because there's multiple ways to implement all those ideas, right? Like different hardware does differently. But it's even more complicated. Folks might know there's separate data and instruction caches, right? For, you know, for the code that's running versus the data that's manipulating. But you can remember that code itself lives in memory. So code is effectively data, even though it might be cached differently. arithmetic instruction, right? So an ad, right? Like literally in the CPU, you do ad. Doing an ad might cause a memory access, right? So I was saying earlier, like a third of instructions or roughly cause memory access that like, if you start thinking about like instructions are actually data and you need to fetch them, well, any instruction could cause a memory access, which

Starting point is 00:29:17 is kind of another kind of mind bend right there. Because fetching the code from memory potentially causes a page table walk. So that ad could be like, you know, four, five, six loads or something like that. And of course, on top of the instruction data caches, you might have instructions and data TLBs, right? So the TLB itself could have two completely different TLBs for data instructions. And that makes me think, like, I wonder which instruction can cause the most memory accesses ever well like you know as things on the internet happen that some someone a while ago showed the mmus are turing complete you know it breaks our fun right here because if it's turing complete

Starting point is 00:29:55 just the mmu like you you can have a whole turing complete program without any instructions whatsoever which is honestly kind of boring at that point. It's fun, right? But it breaks my fun question. But if you stuck to just simple instructions, then how could you generate the most loads in stores from a single instruction? Well, there's scatter gather instructions in the vector units, right? And if that one misses in accessing the instruction itself causes a page table walk,

Starting point is 00:30:22 that might be say four or five accesses. And then the scatter gather itself might touch like eight locations or 16 or something. And so each of them could miss in the cache or something like that. So, you know, it'd be interesting to see if people listening to us on the phone line or on the interwebs have ideas of what accesses could be for one instruction. I'd love to hear what they think. Yeah, the AVX gather instruction has astounding amounts of memory level parallelism. So it's pretty cool to see as these new instructions get introduced, what it does in terms of questioning the assumptions of what we usually do with scalar code. It's also really mind boggling to think about the fact that stores that are happening on the data path can be storing into memory locations

Starting point is 00:31:06 that are cached inside of the instruction cache. If they need to be kept coherent, then immediately the store needs to be observable by the instruction side of the machine. That's always kind of mind-boggling. Whereas on some machines, you'll have to explicitly flush your instruction cache in order to perform non-coherent writes into your instruction memory. There's also devices which can get a view of memory that's virtualized to help prevent malicious PCIe devices. If I made an evil FPGA card and tried to plug it into somebody's machine, ideally there'd be some protection from it just scanning all, you know, grabbing all your passwords or secret keys or whatever it would want to get at. So we actually have this notion of IOMMUs, which is IOMemoryManagementUnits, which can hold virtualized page table understanding for these peripheral devices. But probably too much to talk about in this episode to get to deep dive on IOMMUs. Yeah, definitely.

Starting point is 00:32:05 So back to our initial example. So the virtual address was zero. Some operating systems map the zero page without any permissions. So it'll always cause a fault for every process. So there's kind of a global bit that you can set on these entries to say, this pertains to everybody. Because we don't want to map anything in there to catch bugs yeah it's it's so that like when you do m map and you try to map a thing that is at zero it just it'll

Starting point is 00:32:32 say no you can't do that it doesn't have to handle zero especially because it's already there i see gotcha there's already an entry that ends up acting similarly to just if the address wasn't mapped at all then the page table walk would fail saying, I don't see any entry here. And that would cause a segmentation fault as well. Yeah. So what is a segmentation fault? Yeah, so what's a segmentation fault? That's a good question.

Starting point is 00:32:52 So the hardware does the walk and then tells the OS like, hey, I couldn't find anything or that thing wasn't valid, right? So what the hardware ends up doing, looks at the OS and the OS has special registers that's set up like when it started up and it tells the hardware ends up doing looks at the os and and the os has special registers that's set up like when it started up and it tells the hardware well if this bad thing happens like start executing here instead there's a function that i want you to enter at this location

Starting point is 00:33:14 then the hardware there's like a an abi an application binary interface between the hardware and the operating system where that each type of functions uh it's it's it really looks like a c function call where there's values the hardware leaves and registers when it calls that interrupt handler right and it's exactly like a regular function call at that point and so the the abi defines you know which values should be in which registers and from the kernel's perspective it just looks like a regular function if you look at it. It just has to follow that ABI. You'd find that in the architecture handbook or something like this of how to program this kind of machine, what's going to happen when some CPU exception

Starting point is 00:33:56 occurs. Exactly. And for this type of problem, for this type of trap, you would expect an enum value, so an integer that tells you the kind of problem that occurred, so say page fault, and then the address at which it occurred, and maybe, you know, what else you were doing, which maybe which process ID you were in, or something like that. And then the OS would figure out which process was executing based on that, right? So maybe it remembers, hey, this is what I was running when I was trying to do something. And then the OS decides what to do, right? So in some cases, you know, if the OS itself is the one that generated a page fault, maybe it's like, there's a bug in me somewhere.

Starting point is 00:34:33 I didn't expect to do a page fault here. So I'm just going to, you know, panic. Could do that, right? The whole machine would panic. But if the OS itself was running in a hypervisor, then the hypervisor would get first dibs on handling the hardware exception. And then it would maybe pass it on to the OS or something like that. And then the OS might say, well, that wasn't my problem. It was the program that was running this problem, so I'm going to pass on this problem to the program.

Starting point is 00:34:57 I'm going to tell the program there's a problem. And that usually comes in as a signal, right? Now, in a lot of programs, most user-based programs that people write, I think, people don't really handle signals, right? Now, in a lot of programs, most user-based programs that people write, I think people don't really handle signals, right? So what'll happen is the OS will look at the program, say, well, here's the default signal handler and here's what happened, which address. And because the process won't have something, a signal handler installed,

Starting point is 00:35:20 it'll just call the default handler. And the default handler might be, you know, log something like, thank you for playing playing and then like leave execute or something. It'll print out like segfault happened this location and stop the program. Just call abort. So, yeah, that's what would happen with a page fault. Right. And if you were running in a debugger or running your program under a debugger, then the debugger would have installed signal handlers, and then it would report the issue to the developer, and it would go back to the debug

Starting point is 00:35:49 interpreter, right? Like the GDP prompt, it would say, yeah, SIG ill or SIG term happened at this location. Yeah, because the important thing to remember is a debugger itself, GDP or LDB or whatever, they themselves are a program, right? They're running your program in a kind of thin emulation layer, but they're a program. And so the way they do stuff is they talk to the kernel. They say like, here's a signal handler. Here's this thing, that thing. Do it a bit differently for me.

Starting point is 00:36:14 And then they let the program pretend that everything's executing regularly when actually they're running inside of the debugger, right? But, you know, debuggers are pretty fancy. They can do a lot of fancy stuff, but there's other programs that do this type of thing, installing signal handlers for a variety of reasons. You know, we've both worked on JavaScript engines and the JavaScript engines in browsers optimize things like WebAssembly by installing a signal handler to capture invalid memory accesses. So let's dive into that a bit because it sounds a bit odd. So when you have a

Starting point is 00:36:45 browser, just a web browser, it just runs on trusted code, right? Like the internet is full of code and the browser can't trust any of it really, right? But the execution has to be secure. But how do they do that? Well, JavaScript is a dynamic language, but it enforces certain semantics that should prevent any kind of faulting or whatever from happening. But the way WebAssembly works is that it's supposed to be a virtual machine that allows you to execute kind of code that looks like C or C++ or Rust, like kind of low-level static code. And the way it works is when you run that code, it creates what's called an instance. And it has associated with it a continuous slab of memory that's in the current state of WebAssembly limited to four gigabytes, right? Because Wasm WebAssembly is a 32-bit process model.

Starting point is 00:37:32 So that gives you a maximum amount of four gigs of memory. Now, when you run a Wasm program, the implementation itself, what it'll do when it starts up your Wasm program with its memory is it'll literally map four gigabits plus a red zone virtually. Now, that sounds huge. You go on a web page that maps four gigs, that sounds huge. But I want to point out this is virtual. It's not physical memory. So it occupies page table space to do that, but it doesn't use physical memory, right?

Starting point is 00:38:06 And the implementations then perform all memory accesses within WebAssembly as a base plus offset memory accesses, right? So the base of that four gigs, whenever you're in the WebAssembly VM, that's all the accesses within WebAssembly that occur through that. Base plus offset for every single access. And anything that wasn't allocated to WebAssembly will cause a fault, right? So that four gigs is mapped virtually, but it's mapped no read, no write, no execute, unless the user program has asked to map them. And so instead of checking, like, have I mapped this every time in the VM to make sure it's safe, you can just let the hardware tell you if it hasn't been mapped, because that's literally the hardware's job, right? So the way most WebAssembly VMs nowadays

Starting point is 00:38:47 implement WebAssembly is just by asking the hardware, let me know if this isn't mapped, because it's a really fast path, right? The hardware is really good at that. And what's interesting is it needs to be secure. So the WASM compiler has to be careful when you do base plus offset that it doesn't go way past that four gigabytes. So that's why we have the red zone, right? Say it's like 128 megs or something at the end of the four gigs. And any access that the WebAssembly compiler

Starting point is 00:39:15 can prove might go past that. So way past four gigs plus 128 or something like that. It'll explicitly check, is this going to go out of bounds of that 4 gigs region? But anything inside of it, it'll just let it trap, right? It'll let the hardware handle that. And realistically, if you look at Wasm VMs, they can prove that most memory accesses are within that bound anyways, right?

Starting point is 00:39:36 Because it's really rare that you have very large offsets for loads and stores. So that's kind of neat. If the Wasm memory access faults, then the browser's signal handler checks whether it was one of the WASM memory pages. Because you remember, signal handlers are a whole process thing. So the process doesn't know I'm running WebAssembly code versus JavaScript code versus DOM code or whatever else. It's just, it's running code, right? So the browser has installed a signal handler and the signal handler has to go and say, well, okay, let me see if that was a WebAssembly memory access. Is it this four gigs, that four gig, the other four gig? And if so, it'll do stack unwinding, as we talked about in the last episode, to unwind the stack out of WebAssembly and throw a JavaScript exception, which is kind of neat, right? It throws a JavaScript exception from a WebAssembly memory access that faulted in hardware.

Starting point is 00:40:27 Yeah, and that general technique is pretty cool and reusable. So for Java and null pointer exceptions, instead of explicitly checking for null pointers when you're in the highest mode of JIT code execution, when you're most optimized, an implementation can let most memory accesses just fault if they're null, because it'll know that the offset that it's accessing would still be within the null page. And then it would just catch the faulty access.

Starting point is 00:40:54 And then it would resume execution by throwing a Java exception out of that null dereference signal handler, which would be like a sigseg V or something. Right, right, right. So that's a super common techniques for virtual machines in general. So it's pretty cool.

Starting point is 00:41:07 Well, that was really a lot that we talked about, but I think we covered what a TLB hit is. So that's cool. But one last thing, how about we quantify these numbers? Because we're talking about hits and caches and whatever and registers. But I remember Jeff Dean gave a talk

Starting point is 00:41:23 where he had the following performance numbers. Do you want to tell us about that a bit since you're really into that stuff? Yeah, sure. I think a lot of the numbers are probably mostly similar to how they were when this talk was given. So we kind of talked about an L1 cache reference that hits, that can be just order like three cycles or something, depending on your processor. So if we scale everything to be nanoseconds, that'll be like a fraction of a nanosecond right usually you're running on like a gigahertz processor or a three gigahertz processor then you'll have like 0.3 nanoseconds per clock so you know order fraction of a nanosecond branch mispredict usually you have

Starting point is 00:42:00 to flush a pipeline and the pipelines can be reasonably deep these days. So you say something like 16 cycles. So it's, you know, around five nanoseconds or something like this. L2 cache reference, L2 cache is farther away. So that'll be, you know, maybe 10 clocks or something. Last level cache, maybe 30 clocks or something like this. So, you know, tens of nanoseconds for L2s and L3s. Mutex locking and unlocking, you have to go through a cache coherency protocol. And, you know, it could be across NUMA if you're running on like a multi-socket machine.

Starting point is 00:42:33 But, you know, you kind of roughly say it could be hundreds of nanoseconds or 100 nanoseconds about to resolve this sort of thing. Main memory references also can be order 100 nanoseconds. Then when we start talking to peripherals, when we're traversing PCIe, PCIe might be something like 500 nanoseconds away. So that's about everything that's inside of the local machine, let's say. Then there's things like disk and going over the network and things like this. But these are kind of all of the close times that you might have to deal with. And so you could imagine walking the page tables and doing a bunch of main memory references. If you have to walk four page table levels,

Starting point is 00:43:08 that might be four back-to-back 100 nanosecond accesses. So that'd be like a 400 nanosecond load. So loads can vary from a fraction of a nanosecond to hundreds of nanoseconds, which is really interesting. It's kind of loads are at the heart of computing these load store machines, the von neumann machine model the loads are so varied in their performance characteristics right right and then you can go all the way to like you know when you start sending packets across the world or you know to

Starting point is 00:43:34 satellites or whatever else so maybe we should do something about that talk about this type of stuff in a future podcast yeah sounds great everybody likes good performance numbers yeah well there's a lot of stuff to talk to so i think we should should adjourn now and, you know, follow up on a subsequent podcast. Sounds good. Catch you next time, my friend.

TLB Hit 💥 - Episode 1: (char)0 = 0

The adventure of storing NUL to NULL!...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

TLB Hit 💥 - Episode 1: *(char*)0 = 0

The adventure of storing NUL to NULL!...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

TLB Hit 💥 - Episode 1: (char)0 = 0