TLB Hit 💥 - Episode 1: *(char*)0 = 0
Episode Date: November 23, 2020The adventure of storing NUL to NULL!...
Transcript
Discussion (0)
Oh, hey, Chris. So we talked about TLBs, but we didn't really define what the acronym means
last podcast. So let's rectify that by talking about it way too much.
So I suggest that today's episode be titled Star Car Star Zero Equals Zero.
Hey, JF. Yeah, we should also continue to note that we only know so much,
but we'll try not to say things that are wrong. And we'll follow up with errata
corrections and clarifications as we find out about them
and put them on the website.
We do the lifelong learners thing
and we're always trying to push the boundaries
of things that we know
and have conversations about what's interesting
to develop and hone our understanding.
So feel free to give us feedback.
All right.
So we got some feedback from last episode.
So I'll do the first mea culpa here from previous episode. I said that the Go uses segmented stacks, but it doesn't
anymore. It used to, doesn't anymore. They do stack copying now. So sorry about that. We're
slightly inaccurate, but mistakes is how you learn. So let's continue. Something else I want to point
out. So we've got extensive show notes on the website. So if you're not sure about a term
that we use or anything like that, it should be on tlbh.it. That's TLB hit, but with a.it domain
name. So we'll have show notes there. All right. So today, how are we going to talk about TLB hits?
Well, yeah. So how about we figure out how a program gets to a TLB hit to help us define
what a TLB is and what hitting it means.
So what I propose is we look at a C program.
So it'll start like this.
Int main, open paren, close paren, open curly brace, star, open paren, car star, close paren,
zero equals zero, semicolon, close curly brace.
And I guess for people who don't have a teletype writer inside of their
head basically that's a int main where we're taking null or taking zero turning it into a
car star dereferencing it and assigning it null so storing zero to address zero as a car star
exactly how could you not love this program what What about its much more explicit C++-y cousin
where we use the null pointer value
and we reinterpret cast it?
What about that one?
Right.
So if you did reinterpret cast of char star,
take null pointer,
shove it into reinterpret cast
and then dereference that,
that just wouldn't compile, right?
Because you can't reinterpret cast null pointer.
It's just a C++ thing.
One thing you could do as of, I guess, C++20
is you could use a std bitcast, right?
So bitcast just takes the value and moves it,
moves the bits,
but that's not guaranteed to have a value
because null pointer itself in C++,
it's kind of unintuitive, but it's just a type.
It doesn't have a value.
It's like a monostate type, right?
The fact that you have a null pointer is all you need to know. There's no value inside of
it. So the value is not relevant. And so it's not really guaranteed what you get when you bitcast
or when you copy a null pointer. So this is maybe one of those examples where the old ways are
actually a little bit more clear, just car star C style cast and dereferencing it. So what in C++,
what's the difference between a bit cast and a reinterpret cast again? There must be some reason
that we have both of those, right? Yeah. Yeah. So a while ago, like the C++ committee had talked
about having reinterpret cast of references and allow you to reinterpret something as something
else. But what was decided for C++ 20 is to have bit cast instead. And what it
does is it takes the object representation of one object and takes its literal bits, right,
as if by memcpy, and then interprets those as the object representation of an other object,
right? So what's interesting here is it requires that the types, both the source and destination
types have the same size.
So you can't reinterpret cast 64 bits to 128 or something like that.
And it requires that both be trivially copyable, which is basically a fancy way of saying you can memcopy it.
It's valid to memcopy it.
So there's no stuff you do when you copy it.
No special stuff besides memcopy.
And then what's interesting about bitcast compared to reinterpretcast is it
creates a completely new object, right? So it returns by value instead of returning a reference,
right? Whereas reinterpretcast returns a reference or a pointer as you specify it.
So there's a lot of tricky stuff around there about, you know, whether value representations
exist or not in the to type or the from type and whatever else. But the key here is basically like
null pointer t just doesn't have a value anyway. So it's kind of silly to bit cast it.
Do any of us really exist? Can we really know? I don't know.
I don't know.
Yeah. It's always fun to learn about the language leery aspects of C++ because you kind of just
type in reinterpret cast or bit cast, but you don't necessarily think like,
what's the meaningful distinction between the two of these things?
Right.
So from your description, I can't reinterpret cast car to an int.
All right.
So it sounds good.
We should probably literally get back to the program.
So we're dereferencing null and assigning it zero.
And it's a car star.
Yep.
Yep.
And so what I like about this program,
what I really love about it is how really open it is, right?
You write that on a whiteboard and you're like,
so let's talk about this.
And I've done that pretty often.
And some folks will talk about, you know, the language.
Some folks will talk about how a compiler implements this.
Others will talk about how an OS or hypervisor works, the instructions you see.
Some folks, you know, will want to talk about how the hardware does things with it, right?
So it's pretty open and there's a lot to talk about, right?
So let's dig into the basics.
Conceptually, what we're trying to do in this C function,
and spoiler, we're not really succeeding at it,
but what we're trying to do is we're trying to reinterpret the value zero as a pointer,
dereference that pointer,
and then store another value, zero, inside that memory location.
That's what we're trying to do at a high level.
Right, so if you're going to translate that abstractly from C to assembly,
right, it's always good to do an exercise in your head where you look at some C code and figure out
how you'd straight line translate it into the corresponding assembly code. So we'd kind of
take this constant zero, we'd move into it as like a store instruction, and we would move into that zero address location,
the zero value.
Exactly.
Yeah, so pseudocode is always a great way to look at that.
And even with people who don't necessarily
know assembly very well, right?
Pseudocode is something that most people
should be able to get, like the basic,
what does the CPU generally do, right?
And so creating an immediate, moving it into register, and then storing to that register, that's basically what it's trying generally do, right? And so creating an immediate, moving it into register,
and then storing to that register,
that's basically what it's trying to do, right?
Makes sense.
And so I think a question we can ask before we even get to the assembly
is does the compiler even let you compile this program?
So you're writing something into null,
and so does it let you turn it into assembly,
or does it give you an error at
compile time and if it does give you an error at compile time why some people will run a static
analyzer on their code to catch definitely null dereferencing errors like this that you can
determine statically but that probably won't come out of the box as part of invoking your normal c
compiler if i just fed this to gcc orang, they would happily turn that into a couple of assembly instructions for me.
Right. You know, if you look at, say, Clang static analyzer, it'll flag it.
It has a thing that tries to detect any type of null pointer dereference.
And those are possible null dereferences, right?
Like nobody writes code literally like this, except us, I guess.
But if it can try to figure out like,
hey, you know, you initialize some object to null or a pointer to null.
And then if you take exactly this path,
like a static analyzer could figure out,
you know, this object is null.
It'll tell you, right?
And this is a legit source of bugs in real code.
So a lot of people will run some static analyzer
as part of their pre-submit scripts
or, you know scripts or nightlies
or whatever.
And other static analyses like Coverity also do stuff like that.
All right.
So at least the compiler will let us turn this into some binary, some set of assembly
instructions.
Right, right.
But now we should note that this is undefined behavior, both in C and C++.
That code is just technically undefined behavior.
Right.
And undefined behavior, that generally means it can be exploited by the compiler that this
thing is not the case, right?
So when you say X is undefined behavior, the compiler can assume that X is something that
doesn't happen.
Because if you did happen, then it would be undefined behavior that occurs.
So X in this case is I will not write to address zero dynamically because you're only allowed to write to actually existing objects.
And then the compiler can use the fact that it must be an existing object to do optimizations or what have you.
So if you actually do this at runtime,
then all bets are off for what the program behavior actually is.
Right, right. And the important thing here is this is all a dynamic property, right? Like if
you have undefined behavior in part of your code that never runs, then it's not undefined behavior
because it never executes. So in this case, if you actually execute that code, it is undefined behavior. And it's really that, at least in C++, C++ likes to think about objects.
Everything's an object. And if there's no object there, then you're not allowed to write to it.
There's no lifetime that's begun for that location. And at a high level, undefined behavior is not
the compiler trying to be a jerk to the programmer, right? It allows the compiler to assume that something isn't possible.
And if it were to happen,
then, you know, that's a bug in your program.
And it ends up being kind of a ripple effect
in code optimization, right?
Where each individual step as an optimization makes sense,
but the final result is often pretty surprising to people.
So in this case, we could make it well-defined
by using Valtol and using a non-zero value like 1, for example.
This pattern, not specifically 1, but the pattern of taking some number, casting it to a volatile car star or instar or something like that, you see that really often in hardware-defined registers.
So your hardware vendor will say, well, if you write at this address, it's kind of a magical value in my hardware.
It's documented in the manual or the header file. And if you write to it, here's the behavior that
you get out of the hardware. It'll do something special. Right. So we would use a volatile address
of one and the value could still be zero. And that would be well-defined in that case. And so
zero realistically is not mapped in to the address space in most modern user space environments so in
kernel mode or in more niche architectures this actually could do something you could have
a peripheral who had some register mapped in at memory address zero or you could just not fault
on a right to address zero potentially potentially, if you were running some embedded
kind of system. And on x86, for example, in physical memory, the interrupt vector table
resides there. Right. You know, we talked about what we think the assembly would look like,
but let's look at what the assembly really looks like, right? It's good to form an idea of what
you think is going to be there and then try it out in reality and see if you were wrong. Now, what's funny is if you go to, you know,
godbolt.org or whatever, and you punch that in, you try GCC or LVM in different modes,
so x86 mode, ARM mode or whatever. What's really funny is that without volatile, Clang just
literally on x86 makes this UD2. UD2 is just an instruction that traps, right? And it makes it break. Another
instruction that just traps on ARM64. So the code you get after optimization is just like, you know,
you enter main and you crash. That's all the code you get outside of Clang if you don't use volatile.
Now, GCC is a bit nicer. What it does is instead of dereferencing, instead of just emitting UD2 or
break, it'll dereference null and then emit UD2 or break if you don't have volatile in your code,
right? So GC is being nice. It's doing what you asked it to do. And then it just, it just crashes
afterwards. But if you add volatile in there, right? So you do star, paren, volatile, char star,
close paren, zero equals zero. What you'll end up seeing is on x86, you'll get a move
of byte pointer, you know, square bracket, zero comma zero, or on ARM, it's STRB. So byte store of
register zero into some register. Now, what's extra funny is GC still emits UD2 after the
null dereference, even if it's volatile. And GC even ignores the value being written.
So if we change the code to star volatile char star 0 equals
OX42 or something, the value it tries to store is still 0.
It tries to store it at address 0,
but the value it tries to store into it is 0.
It ignored our OX42, and then it has a trap after it, right? So this is a bit
intuitive, and I know it's pissed off a lot of kernel people in the past, but you can tell GCC
to stop this with the option dash F no delete null pointer checks, which is super unintuitive,
frustrating to people around volatile. But if we step back a bit and we say like, okay, well, zero is a special value,
but we'll dereference one instead, right? So star volatile char star one equals zero or equals OX42
just to get rid of the silly UD2, then everything's fine, right? So it emits the code that we'd
expect. So this means the address zero is really treated specially by both Clang and GCC. And in
this case, volatile just tells the compiler when it sees address one here,
that some effects that can't reason about
might be happening, right?
There's something special.
It shouldn't try to be too smart,
even though it was trying to be smart.
And using Volatile in this very specific case,
we're just tricking the compiler,
telling it like, you know, don't do that.
But this pattern, if you have, you know,
hardware-based register is legit a way
to access special memory that's hardware mapped, right? So it's something that happens in every computer pretty frequently. Okay, so let's
focus on user mode instead of talking about kernels or embedded devices or whatever. How does
this instruction cause a trap, or should we call this a fault? Yeah, that's a good question. Maybe
first we could talk a little bit about exceptional things that happen in the processor in general when you're running a program.
So there's this taxonomy we actually have of weird stuff that happens that's independent of the normal flow of instructions through the processor.
Sometimes people break them into categories of traps, faults, and aborts.
So traps would be something where you're kind of asking the
operating system for assist or to inspect what happened. Faults would be, oh, something actually
went wrong and maybe it can be corrected. And aborts are just, you know, the program basically
has to terminate. That's kind of what that taxonomy means. Generally for exceptions inside
of the CPU, there's questions like, do I need to be able to show the exact architectural state where this exception thing happened?
You know, there's differences between things like divide by zero and a system call where you're actually asking the operating system for something.
And a peripheral device sending you some asynchronous interrupt like, hey, service me. I have some packet that came in or something
like this. And then there's related questions like, can the hardware effectively reorder things,
assuming that a load is non-null? So this gets back to precise exceptions. So if a segfault
happens, you want to be able to analyze the cause of the segfault that you got instead of just kind
of skidding to a stop, however it keeps on going with the program you want to say ah this is exactly what led me to seg fault here and that
gets into things inside of cpus like reorder buffers and the point of no return to say whether
you can start clobbering your architectural state with more things that happen and then you get back
up to the compiler level and you ask can can the compiler reorder things, assuming that a load is non-zero?
So even though the abstract machine in the C standard or the C++ standard has no notion of what a program looks like when something gets loaded from zero,
compilers are bridging the practicality gap between this abstract machine concept and what people really expect their program to do when running against the hardware, which again goes back to this notion of implementation defined or undefined behavior.
And even more in the wacky things that happen zone, there's also traps that can happen or
faults that can happen within instructions. So if you have a repeated instruction like rep move SB,
which is like the mem copy instruction,
that's a canonical example of one of these composite SISC instructions where maybe part of it goes well, and then there's a fault somewhere in the middle.
So how do you recover?
How do you resume?
Become interesting questions like the multi-stack push arm instruction that we talked about
in the last episode.
There's also devices that trap the CPU potentially in the middle of an instruction, like someone sends an asynchronous interrupt and it's a long running instruction. How do we preempt the long running instruction or something that races with another trap? Like if you have a unit processor, a single core system where the non-maskable interrupt comes in while some exception is also being raised from somewhere else. So memory accesses in user mode are all dealing with virtual addresses, which we talked
a bit about last time. So virtual addresses are not the same as physical addresses, right? We said
physical addresses are actually like DRAM locations that are all linearly lined up in physical memory
where that DRAM is. Each process and each guest virtual machine, if you're doing virtualization,
has its own view of memory, which is the virtual memory view
that's distinct from how other processes and virtual machines might see memory.
And that's all done through this virtual memory address layer.
Right. But then like, so we have these two worlds of virtual and physical stuff,
and it's pretty complicated how stuff stuff one maps to the other. But when you write a program,
not, not like, you know, in a high level language like C, but when you write assembly, right,
maybe a third of, you know, probably a third of all instructions are memory accesses. That's,
that's a lot, right? So all this complication and a third of instructions have to deal with it,
right? Right. So I'll flag a citation
needed on where does the one third come from. But I think probably your reasoning is that if a basic
block is around five instructions, then the conventional wisdom is there might be one load
and one store in there. So two fifths or something like this. Right. Yeah. That number is totally what
I technically call a NPOA. So it's a number
pull out of S, right? It's roughly right. About a third instructions is about right. And as you're
saying, you know, if you count, you know, five instructions in a basic block and there's some
load in some store, it's about right. Still like the idea of order of magnitude, you know, the,
maybe a third or a bit less of, of, of instructions, there are loads in stores or access memory.
That's a lot. Right. So in computer
systems, we're always talking about the fast path or the common case, because if we can speed that
up, then, you know, it might be a large percentage of the workload and you're speeding up the most
important things in that case. So in the common case, when you've addressed this memory location
recently, you want accesses to it to be fast. And caches are how this is made fast. And
we talked a little bit about that last time. Some caches can be addressed virtually using the
virtual addresses, and some can be addressed physically using the addresses that you resolved
from the virtual address. So that means that the virtual address could be used for what's called a
virtually addressed cache. And of course, it's addressed by the process and its view of the virtual memory. So there's like a virtual
machine ID or a process context that tells you this is the process and this is the virtual
address that that process has. Yeah. Yeah. And so as soon as you address physically and your
most caching is done physically, then you need to somehow translate the virtual address to a physical one.
And the kernel keeps that mapping in what's called a page table.
So that's basically a map of virtual address, process ID, VM ID, something like that, all the relevant bits.
And it maps those bits to the physical address, as well as some properties of the memory location, such as access permissions.
So can you read there?
Can you write there?
Can you execute that location?
And at a high level, the page table is organized as a tree, right?
So if you think about the tree data structure, it's laid out in memory.
And each level of that tree is keyed off of some bits of the virtual address because it's going to be pretty sparse.
And then flattening that tree would make it pretty huge.
So you want to have a sparse layout for it.
Right.
And the base address of that tree where I can start walking it is pretty special
because in x86, we might have a hardware page table walker
and it needs to know where to start walking that exact structure.
So there's this cool control register called CR3, if you've ever heard of that.
So what's really cool about this is the kernel puts some stuff in memory and then tells the
special register and hardware, hey, this location is special.
And then the hardware can start walking an actual data structure that's in memory, that
the kernel agreed to.
That's kind of mind-boggling to me, right? Like it's something that as a user space programmer,
you're just really not used to. The hardware just doesn't really touch your stuff, it executes it,
right? It doesn't walk your data structures in fancy ways, except here it totally does.
And so what's interesting here is the granularity of the mapping that we're talking about depends on memory pages, right? So the valid memory page
size, like what is that? Well, it has to be dictated by hardware, right? Because the hardware
has to tell you like, here's the valid page size and the kernel just follows whatever it's told
is valid. And then often the smallest page size is going to be something like say four kilobytes
or something like that. And in some hardware, you can add larger pages, right? So often it's called huge pages or super pages or something
like that. And yeah, so that's pretty interesting. And then finding the base of a page, how do you do
that? Well, that's actually pretty easy, right? You just mask off the bottom bits. If it's four
kilobytes as a page, you just mask off four kilobytes worth of stuff, and that's the base
of your page, right? So that's kind of interesting. Right. So then to zoom back out to the big picture, we have this page table structure
and that lives in memory. And then this piece of hardware, which is often specialized, let's call
it like a page table walker. It understands exactly how that's going to be laid out. It
knows what all the structs look like inside of there. And that's a contract between the piece
of hardware that's being run on and the operating system, like inside of there. And that's a contract between the piece of hardware
that's being run on and the operating system,
like you were saying.
And then the hardware is capable of doing lookups
into that data structure in memory
to find the virtual to physical mapping
by slicing off some bits from the virtual address
and then using that to walk through these entries
that are organized in a tree shape.
So hardware knows how to find that page table through that special register, CR3.
And then doing this walk, which potentially goes through like a bunch of different page
table levels, like four, for example, would be for data dependent accesses to do it every
single time.
And like you said, there's a ton of load instructions.
So doing that for every single memory access would be quite expensive.
So what do we do?
The same way that we solve every other problem in computer science, we add a cache.
And because it's an important cache, it has a fancy name, which is Translation Look Aside Buffer, or TLB.
And every memory access looks in the TLB to see, hey, did I already look up the mapping for this part of the virtual
address to the part of the physical address? And if it's found, that's a TLB hit. And if it's not
found, that's a TLB miss. And that makes us sad. Yeah. Okay, cool. Right. So every memory access,
every time you have an instruction that does a memory access, it'll look at the TLB and try to
see if it's a hit or a miss. That's really interesting, right? And what it's doing is it's making it so that you don't have to do a page table walk
every time you do a memory access.
So that makes a lot of sense.
But it's important also that the TLB only finds mappings for the current process
and the current virtual machine if there's a virtual machine, right?
Because if you have two processes on the same machine
and they found the, you know, the other processes addresses, one of the big points of processes is
that you can't access the address space of another process, right? You as a process have your own view
of virtual memory. And it's different from the view of virtual memory from some other process.
So some TLBs are actually addressed by process ID as well as VM ID, whereas others are just
flushed on every context switch, right?
So maybe every time the kernel touches, say, CR3,
that flushes the TLV as well, right?
Every time it changes, it does a context switch,
it changes the base of the page table or something,
and then that flushes the TLV at the same time, right?
So how hardware works there, it varies quite a bit.
Right, and this is also why it's more uncommon
to see fully virtual
caches because the fact that two processes might have different views of memory and that you don't
want to get the two different processes views of memory confused means that you might actually use
more of a physical notion instead of having to flush all the entries anytime a process switch
occurs. Yeah. So then there's kind of physical questions to help make this more real in our heads. How many entries are there in a TLP?
There can't be that many since there's a bunch of hardware in there implementing an associative
array. So lookup tables in hardware are called CAMs, content addressed memory, which basically
does a big parallel comparison on keys to figure out which value
should be selected. And usually you'll say only one should ever match at most one should ever
match. And so this kind of gets into more of a detailed notion of how many cycles does it take
to hit in my closest cache, the L1 cache, and what are those cycles of latency doing? And so usually
it's about three cycles of latency these days to hit in an L1 cache
on a super tight timing path.
So what happens is we look up an entry in the TLB
and that tells us what the tag is
that we compare against for that address.
And then we look up the index in the cache,
so groups of addresses together,
and then we resolve the index against the tag.
So that needs to probably be
spelled out in a little bit more detail, but roughly it's kind of this couple of step process.
That TLB is ultimately software managed. So when software changes the page table,
the operating system changes the page table, it flushes out the entries from the TLB. I'm not
totally sure, but I think page table walkers are pretty
complete. Like they don't need any assistance from the operating system except to be flushed out.
Like the TLBs get flushed out and the page table walkers are basically entirely autonomous
and don't need to like ask for more assistance from the OS. Sometimes these little fast
accelerator things, they can only handle
common cases, but because walking the page table is a fairly straightforward process,
I think they can be pretty complete and not need to ask for help.
Yeah.
So on a TLB hit, the hardware queries the cache and accesses the cache line if it's present in
there. And if it's not present, then the cache protocol kicks in looking into the
next level of cache, right? The cache asks, hey, next level cache, do you have this line? I don't
have this line. And we should probably do a whole episode on kind of how caches line up and talk to
each other. If it's not in any level of cache, then it'll go actually fetch the line from DRAM. And again, we should do a follow-up on DRAM row buffering
and cache line replacement policies and things like this that are involved in the memory subsystem.
Right. And so if the TLB misses, then the page table walk occurs, right? And so when you do the
walk, it'll either find the entry or it won't find it if there's no entry there, right?
So if the entry is found, you'll add it to the TLB because that's a cache, right?
So you want to find it next time
and you'll do same thing with the cache, right?
It's worth noting that some TLBs are managed by software
and some by hardware, as you were saying, right?
So who adds it to the TLB
depends on what the hardware dictates, right?
The hardware is going to put it in,
then it's going to do that.
Otherwise the OS might do that.
Right, you don't need to have a hardware page table walker. You could
just ask the OS when you page fault, hey, could you resolve this for me and put it into the TLB
yourself? Right. And then the hardware would try again and look in the TLB and be like, oh,
it's here now, right? Right, exactly. It would resume the user space program. Of course,
you also have to check those access permissions for the page. So if the page is mapped as read only, and then you do a store on the CPU, then a fault still
needs to be raised in that case, because you don't have the access to write to it.
Right, right. And we're talking about this, and it sounds super complicated,
because there's multiple ways to implement all those ideas, right? Like different hardware does
differently. But it's even more complicated. Folks might know there's separate data and
instruction caches, right? For, you know, for the code that's running versus the data that's manipulating. But you can remember that code itself lives in memory. So code is effectively data, even though it might be cached differently. arithmetic instruction, right? So an ad, right? Like literally in the CPU, you do ad. Doing an
ad might cause a memory access, right? So I was saying earlier, like a third of instructions or
roughly cause memory access that like, if you start thinking about like instructions are actually
data and you need to fetch them, well, any instruction could cause a memory access, which
is kind of another kind of mind bend right there. Because fetching the code from memory potentially
causes a page table walk. So that ad could be like, you know, four, five, six loads or something like that.
And of course, on top of the instruction data caches, you might have instructions and data
TLBs, right?
So the TLB itself could have two completely different TLBs for data instructions.
And that makes me think, like, I wonder which instruction can cause the most memory accesses
ever well like you know as things on the internet happen that some someone a while ago showed the
mmus are turing complete you know it breaks our fun right here because if it's turing complete
just the mmu like you you can have a whole turing complete program without any instructions
whatsoever which is honestly kind of boring at that point. It's fun, right? But it breaks my fun question. But if you stuck to just simple instructions,
then how could you generate the most loads in stores
from a single instruction?
Well, there's scatter gather instructions
in the vector units, right?
And if that one misses in accessing the instruction itself
causes a page table walk,
that might be say four or five accesses.
And then the scatter gather itself might touch like eight locations or 16 or something. And so
each of them could miss in the cache or something like that. So, you know, it'd be interesting to
see if people listening to us on the phone line or on the interwebs have ideas of what accesses
could be for one instruction. I'd love to hear what they think. Yeah, the AVX gather instruction has astounding amounts of memory level parallelism. So it's
pretty cool to see as these new instructions get introduced, what it does in terms of questioning
the assumptions of what we usually do with scalar code. It's also really mind boggling to think
about the fact that stores that are happening on the data path can be storing into memory locations
that are cached inside of the instruction cache. If they need to be kept coherent, then immediately
the store needs to be observable by the instruction side of the machine. That's always kind of
mind-boggling. Whereas on some machines, you'll have to explicitly flush your instruction cache in order to perform non-coherent writes into your instruction memory.
There's also devices which can get a view of memory that's virtualized to help prevent malicious PCIe devices.
If I made an evil FPGA card and tried to plug it into somebody's machine, ideally there'd be some protection from it just scanning all, you know, grabbing all your passwords or secret keys or whatever it would want to get at.
So we actually have this notion of IOMMUs, which is IOMemoryManagementUnits, which can hold
virtualized page table understanding for these peripheral devices. But probably too much to talk
about in this episode to get to deep dive on IOMMUs. Yeah, definitely.
So back to our initial example.
So the virtual address was zero.
Some operating systems map the zero page without any permissions.
So it'll always cause a fault for every process.
So there's kind of a global bit that you can set on these entries to say,
this pertains to everybody.
Because we don't want to map anything in there to catch bugs yeah
it's it's so that like when you do m map and you try to map a thing that is at zero it just it'll
say no you can't do that it doesn't have to handle zero especially because it's already there i see
gotcha there's already an entry that ends up acting similarly to just if the address wasn't
mapped at all then the page table walk would fail saying, I don't see any entry here.
And that would cause a segmentation fault as well.
Yeah.
So what is a segmentation fault?
Yeah, so what's a segmentation fault?
That's a good question.
So the hardware does the walk
and then tells the OS like,
hey, I couldn't find anything
or that thing wasn't valid, right?
So what the hardware ends up doing,
looks at the OS and the OS has special registers that's set up like when it started up and it tells the hardware ends up doing looks at the os and and the os has special registers that's set
up like when it started up and it tells the hardware well if this bad thing happens like
start executing here instead there's a function that i want you to enter at this location
then the hardware there's like a an abi an application binary interface between
the hardware and the operating system where that each type of functions uh it's it's it really
looks like a c function call where there's values the hardware leaves and registers
when it calls that interrupt handler right and it's exactly like a regular function call at that
point and so the the abi defines you know which values should be in which registers and from the
kernel's perspective it just looks like a regular function if you look at
it. It just has to follow that ABI. You'd find that in the architecture handbook or something
like this of how to program this kind of machine, what's going to happen when some CPU exception
occurs. Exactly. And for this type of problem, for this type of trap, you would expect an enum
value, so an integer that tells you the kind of
problem that occurred, so say page fault, and then the address at which it occurred, and maybe, you
know, what else you were doing, which maybe which process ID you were in, or something like that.
And then the OS would figure out which process was executing based on that, right? So maybe it
remembers, hey, this is what I was running when I was trying to do something. And then the OS decides what to do, right?
So in some cases, you know, if the OS itself is the one that generated a page fault, maybe
it's like, there's a bug in me somewhere.
I didn't expect to do a page fault here.
So I'm just going to, you know, panic.
Could do that, right?
The whole machine would panic.
But if the OS itself was running in a hypervisor, then the hypervisor would get first dibs on
handling the hardware exception. And then it would maybe pass it on to the OS or something like that.
And then the OS might say, well, that wasn't my problem.
It was the program that was running this problem, so I'm going to pass on this problem to the program.
I'm going to tell the program there's a problem.
And that usually comes in as a signal, right?
Now, in a lot of programs, most user-based programs that people write, I think, people don't really handle signals, right? Now, in a lot of programs, most user-based programs that people write,
I think people don't really handle signals, right?
So what'll happen is the OS will look at the program,
say, well, here's the default signal handler and here's what happened, which address.
And because the process won't have something,
a signal handler installed,
it'll just call the default handler.
And the default handler might be, you know,
log something like, thank you for playing playing and then like leave execute or something.
It'll print out like segfault happened this location and stop the program.
Just call abort.
So, yeah, that's what would happen with a page fault.
Right.
And if you were running in a debugger or running your program under a debugger, then the debugger would have installed signal handlers, and then it would report the issue to the developer, and it would go back to the debug
interpreter, right? Like the GDP prompt, it would say, yeah, SIG ill or SIG term happened at this
location. Yeah, because the important thing to remember is a debugger itself, GDP or LDB or
whatever, they themselves are a program, right? They're running your program in a kind of thin emulation layer,
but they're a program.
And so the way they do stuff is they talk to the kernel.
They say like, here's a signal handler.
Here's this thing, that thing.
Do it a bit differently for me.
And then they let the program pretend that everything's executing regularly
when actually they're running inside of the debugger, right?
But, you know, debuggers are pretty fancy.
They can do a lot of fancy stuff, but there's other programs that do this type of thing, installing signal handlers for a variety
of reasons. You know, we've both worked on JavaScript engines and the JavaScript engines
in browsers optimize things like WebAssembly by installing a signal handler to capture
invalid memory accesses. So let's dive into that a bit because it sounds a bit odd.
So when you have a
browser, just a web browser, it just runs on trusted code, right? Like the internet is full
of code and the browser can't trust any of it really, right? But the execution has to be secure.
But how do they do that? Well, JavaScript is a dynamic language, but it enforces certain
semantics that should prevent any kind of faulting or whatever from happening. But the way WebAssembly works is that it's supposed to be a virtual machine that allows you to execute kind of code that looks like C or C++ or Rust, like kind of low-level static code.
And the way it works is when you run that code, it creates what's called an instance.
And it has associated with it a continuous slab of memory that's in the current state of WebAssembly
limited to four gigabytes, right?
Because Wasm WebAssembly is a 32-bit process model.
So that gives you a maximum amount of four gigs of memory.
Now, when you run a Wasm program,
the implementation itself,
what it'll do when it starts up your Wasm program
with its memory is it'll literally map four gigabits plus a red zone virtually.
Now, that sounds huge. You go on a web page that maps four gigs, that sounds huge.
But I want to point out this is virtual. It's not physical memory.
So it occupies page table space to do that, but it doesn't use physical memory, right?
And the implementations then perform all memory accesses within WebAssembly as a base plus offset memory accesses, right?
So the base of that four gigs, whenever you're in the WebAssembly VM, that's all the accesses within WebAssembly that occur through that.
Base plus offset for every single access.
And anything that wasn't allocated to
WebAssembly will cause a fault, right? So that four gigs is mapped virtually, but it's mapped
no read, no write, no execute, unless the user program has asked to map them. And so instead of
checking, like, have I mapped this every time in the VM to make sure it's safe, you can just let
the hardware tell you if it hasn't been mapped, because that's literally the hardware's job, right? So the way most WebAssembly VMs nowadays
implement WebAssembly is just by asking the hardware, let me know if this isn't mapped,
because it's a really fast path, right? The hardware is really good at that.
And what's interesting is it needs to be secure. So the WASM compiler has to be careful when you
do base plus offset that it doesn't go way past that four gigabytes.
So that's why we have the red zone, right?
Say it's like 128 megs or something
at the end of the four gigs.
And any access that the WebAssembly compiler
can prove might go past that.
So way past four gigs plus 128 or something like that.
It'll explicitly check,
is this going to go out of bounds of that 4 gigs region?
But anything inside of it, it'll just let it trap, right?
It'll let the hardware handle that.
And realistically, if you look at Wasm VMs, they can prove that most memory accesses are
within that bound anyways, right?
Because it's really rare that you have very large offsets for loads and stores.
So that's kind of neat.
If the Wasm memory access faults, then the browser's signal handler checks whether it was one of the WASM memory pages. Because you remember, signal handlers are a whole process thing. So the process doesn't know I'm running WebAssembly code versus JavaScript code versus DOM code or whatever else. It's just, it's running code, right? So the browser has
installed a signal handler and the signal handler has to go and say, well, okay, let me see if that
was a WebAssembly memory access. Is it this four gigs, that four gig, the other four gig? And if so,
it'll do stack unwinding, as we talked about in the last episode, to unwind the stack out of
WebAssembly and throw a JavaScript exception, which is kind of neat, right? It throws a
JavaScript exception from a WebAssembly memory access that faulted in hardware.
Yeah, and that general technique is pretty cool and reusable. So for Java and null pointer
exceptions, instead of explicitly checking for null pointers when you're in the highest
mode of JIT code execution, when you're most optimized, an implementation can let most memory
accesses
just fault if they're null,
because it'll know that the offset that it's accessing
would still be within the null page.
And then it would just catch the faulty access.
And then it would resume execution
by throwing a Java exception
out of that null dereference signal handler,
which would be like a sigseg V or something.
Right, right, right.
So that's a super common techniques
for virtual machines in general.
So it's pretty cool.
Well, that was really a lot that we talked about,
but I think we covered what a TLB hit is.
So that's cool.
But one last thing,
how about we quantify these numbers?
Because we're talking about hits and caches
and whatever and registers.
But I remember Jeff Dean gave a talk
where he had the following performance numbers.
Do you want to tell us about that a bit since you're really into that stuff?
Yeah, sure. I think a lot of the numbers are probably mostly similar to how they were when
this talk was given. So we kind of talked about an L1 cache reference that hits, that can be just
order like three cycles or something, depending on your processor. So if we scale everything to be
nanoseconds, that'll be like a fraction of a nanosecond right usually you're running on like
a gigahertz processor or a three gigahertz processor then you'll have like 0.3 nanoseconds
per clock so you know order fraction of a nanosecond branch mispredict usually you have
to flush a pipeline and the pipelines can be reasonably deep these days. So you say something like 16 cycles.
So it's, you know, around five nanoseconds or something like this.
L2 cache reference, L2 cache is farther away.
So that'll be, you know, maybe 10 clocks or something.
Last level cache, maybe 30 clocks or something like this.
So, you know, tens of nanoseconds for L2s and L3s.
Mutex locking and unlocking, you have to go through a cache coherency protocol.
And, you know, it could be across NUMA if you're running on like a multi-socket machine.
But, you know, you kind of roughly say it could be hundreds of nanoseconds or 100 nanoseconds about to resolve this sort of thing.
Main memory references also can be order 100 nanoseconds. Then when we start talking to peripherals,
when we're traversing PCIe, PCIe might be something like 500 nanoseconds away. So that's
about everything that's inside of the local machine, let's say. Then there's things like
disk and going over the network and things like this. But these are kind of all of the close
times that you might have to deal with. And so you could imagine walking the page tables and
doing a bunch of main memory references.
If you have to walk four page table levels,
that might be four back-to-back 100 nanosecond accesses.
So that'd be like a 400 nanosecond load.
So loads can vary from a fraction of a nanosecond
to hundreds of nanoseconds, which is really interesting.
It's kind of loads are at the heart of computing
these load store machines, the von neumann machine model the
loads are so varied in their performance characteristics right right and then you can go
all the way to like you know when you start sending packets across the world or you know to
satellites or whatever else so maybe we should do something about that talk about this type of stuff
in a future podcast yeah sounds great everybody likes good performance numbers yeah well there's
a lot of stuff to talk to so i think we should should adjourn now and, you know, follow up on a subsequent podcast.
Sounds good. Catch you next time, my friend.