Hardware-Conscious Data Processing (ST 2024) - tele-TASK - CPU and Caching
Episode Date: April 23, 2024...
Transcript
Discussion (0)
Welcome to our next session on Hardware Conscious Data Processing.
We started, or Martin started, last week with the CPU architecture
and some caching, well, not so much caching.
And we'll go there in a minute.
But before we start, two announcements.
You've already seen this, so we have some code examples.
I'm going to try to polish a couple more, so there's not that much in there yet.
Whenever I have time, I'm going to clean something up and put more, so you can do more of these
little tests.
Although, this is really not hard, right?
So you can also try this yourself.
If you're curious about the performance of your laptop do
this kind of stuff that's actually really helpful to understand things and your your system
performance and the other thing tonight uh in our lecture series on hbi research we will have
that anrich uh talking about connected help so if you ever wondered uh what connected health is and what the digital health guys are doing,
this is a good chance to get an insight into the research there.
If you're in the course, in the lecture series, well, it's anyway a good thing to go there and listen in.
So where we are is still at the CPU architecture and caching.
This will take us today, maybe even roll over a bit until tomorrow before we come to the instruction execution.
So this is also reflected here, right?
So we're in the CPU basics.
We'll talk about CPU instructions tomorrow and then start with the SIMD stuff next week.
Okay, with that, let me jump all the way to virtual memory.
Because that's where you left off last time, as far as I know.
So, who knows virtual memory?
Some degree, okay. So virtual memory is a construct, right? So if you're
programming something, so your program basically you don't have to worry about
how memory is laid out internally, right? So you basically basically your program gets a continuous address space
that you can work with. And in order to have this kind of idea of a continuous address space that
you don't have to deal with other processes, etc., there's an abstraction called virtual memory in your system.
And this is basically dealt with by the CPU and the operating system.
So you don't really have to do anything about it.
And the idea is that you can have, like, your program can access more,
or all of the programs in combination can access more memory
than is actually available on
the system.
Also you can, like even your own process can access more memory than is available on the
system and we're just going to, like whatever doesn't fit in memory, we're just going to
place on disk, for example.
Also depending on how you allocate memory, etc., this might be somewhere in the memory and you don't have necessarily a contiguous area.
So this is also abstracted away.
So as a programmer, you don't have to do, and as a process also,
you don't have to do the bookkeeping for the memory regions.
And the OS and the CPU are more flexible in basically
figuring out where to put the data, actually,
where to put these memory regions.
So each process has its own address space.
And this is, of course, too expensive
to do this per each individual byte,
so we'll do this in blocks.
So we have different memory blocks that then
are mapped into a virtual or like every
process gets its virtual memory this is split up into different blocks which will be placed in ram
and on disk depending on what's currently needed and this can be dynamically changed so it doesn't
mean that you're you have one large large virtual memory and some part is fixed to
disk, some part is fixed to memory, but the OS then has the liberty to basically move stuff around.
Also the CPU has the liberty to move stuff around while you're in ignorant bliss and just basically working away on your virtual memory. So that's the idea, right?
And just gives you basically an easy way to deal with memory. You don't have to deal with disk
necessarily, unless you explicitly want to deal with disk, right? Unless you explicitly want to store things on disk, this will basically keep just the memory part
abstracted away from you. Okay. As we said, we won't do this on a individual byte.
I mean, we could do this on a virtual individual byte level, but then basically translating
these addresses would be quite costly, right?
Because you need multiple bytes in order to address something.
So then translating these addresses will add up quite quickly. So typically we will have pages in memory and on disk
or virtual memory pages that are all equal sized and then we'll have like one address in memory or
on one disk and then we have kind of an offset within there.
And in typical systems, the pages will be, or your typical page size will be four kilobytes,
but it's also possible to work with huge pages, meaning that you will have larger contiguous
chunks in memory and on disk, which then means you don't need, like you don't have that much
fragmentation within the system.
And the memory then is split into these page frames, and within a page, so the page itself
in virtual memory and on disk, or in physical memory, let's forget about the disk,
so let's say we have everything in memory, actually.
An individual page will be laid out in exactly the same way.
So we have, say, four kilobyte pages.
Then our virtual page is four kilobytes
and has a certain order of bytes in there,
like all of the data in the virtual page.
And that will be an exact copy, basically, of the physical page.
But the virtual pages, how they are aligned in memory.
So we have like a virtual view on the pages.
The virtual pages will be contiguous.
But in the physical memory, they might be just all over the place,
right? Wherever there's currently space, this will be allocated. And then there needs to be a mapping.
And there's a memory management unit that translates these virtual addresses to physical
addresses, right? And then you can already see, right?, we have individual byte addresses.
So some part of the address will be basically within the page.
Some part of the address will be an address to the page.
And this address to the page is the part that we need to translate.
So within our virtual address space, let me try this.
Where do I have it? Here. So within the virtual address space, let me try this. Where do I have it here?
So within the virtual address space, so this basically we have this within the page and we have the address to the
page. So we have an offset and a page address.
And so
we have consecutive page addresses and we have the offset.
And on current systems, the address is typically a 64-bit, whereas the first couple of bits are unused.
Then, or let's start from the back, basically, we have an offset, which is the first 12 bits.
Then from bit 12, so the 13th bit, to 47.
So the next 36 bits are basically the page address.
And yeah, so that's basically the address that will point to or in virtual memory will be basically
one of the consecutive addresses and this will be translated to the actual addresses
that points or that is somewhere in memory, which also can change, right?
So this is again important.
If we have more memory or we need more memory than we actually have in the system, this
page will be evicted, will be somewhere on disk,
paged out, then we need to translate it to something
that points to a disk page.
And this can also change, right?
So this is not this fraction on how much is basically offset
and how much is the physical address.
This needs to change based on the size of the page.
So meaning if we have four kilobytes,
then we can basically address this with these 12 bits.
So then we can address each individual byte with 12 bits.
If we have something like a huge page
say two megabytes for example then we need 21 bits for the offset and this will be cut off
from from the page address so with that the page address will basically be smaller and the the
offset will be larger the total part of what is used for addressing
will still remain the same, right?
So we'll still only use 48 bits in these systems,
but just the relation or the ratio will change.
And that means we will have less pages, but larger pages.
So all in all, we cannot address more memory.
We can just address memory in a different way.
So meaning we don't need as much page translation
if we have huge pages, while if we have smaller pages,
we're more flexible in how we lay out the pages in memory.
But we need more page translations.
And these physical addresses of the virtual pages are stored in page tables.
And today, actually, it's not a single page table.
So on a modern x86 CPU, this will be four levels of page tables, each of them having
nine bits. And so this is basically how to organize this in a bit more easy fashion rather than having
like one gigantic page table.
And a page table looks something like this.
So we have a validity flag and a physical address.
So we have our virtual address that basically points to or gives us an offset
in the page table, where then we will find the physical address and of course the validity flex,
or is this actually valid or is this unused, for example. And these page tables can get large.
So if we say have 16 gigabytes of RAM with four kilobyte page tables and the 64 bit address,
this basically gives us a 32 megabyte page table.
All right, so you can basically have,
like can calculate this thing.
So we have eight byte addresses
and for the four kilobyte pages. So then these get large. And this means also if
they are so large, this means they're not going to be in cache. So in the first level cache,
we don't have 32 megabytes of first level cache. So this always means we're going to have,
if we want to do the Atlas translation,
this will actually be costly because that means, I mean, we might have it in a lower
level cache, but if it's like all the page tables are very large, this might actually
go all the way down to some kind of memory lookup.
And that's, of course, not good, right?
So I mean, we don't necessarily want to have two lookups just to find an individual page somewhere in memory or to read something from
memory because that would mean we go through all the caches just to find an address and
then go back, go again through the whole hierarchy just to find the actual, the actual byte or whatever we're looking for, the actual memory page.
So this means we're basically losing a lot of time if we just have the page table within memory.
Or we are in memory.
So for that, there's actually an accelerator in hardware.
And that's called the translation look-aside buffer.
You might come across this before this TLB.
That actually exists in multiple forms in multiple parts of the hardware
just because we need this address translation everywhere.
If we have something on the GPU, for example, some memory,
or we have something in main memory, et cetera.
Whenever we need to do address translations,
we need these page tables.
And whenever we need these page tables,
they can get large, especially for large memory.
So it's going to be costly.
So we want to buffer this somehow.
And the translation look-aside buffer
implements this buffering in hardware.
So this means this is like a small buffer or small associative cache,
which is placed on the CPU in hardware in multiple cache levels.
So every cache will come with a translation look-aside buffer where we will cache a certain amount of
of these address mappings so this looks something like this here so we have
the cache on the side where we have is it valid is it a reference is it dirty And then we have a tag. I'll explain this in a bit. This is basically the unique
Identifier for each address. And we have the actual page
Address that we're looking for. This translated part.
So we don't need to translate the offset.
We only need to translate the page address.
So this is something that we'll do. And whenever we can see this, so this is super fast, right?
This is built in hardware SRAM.
So this means within the individual,
or within a single cache lookup,
we can do basically while we're doing the cache lookup,
we can also basically while we're doing the cache lookup, we can also translate this.
And if we have the data within this,
then there's no additional cost.
If we don't have the address translation in here,
then we need to actually go to the page table.
The page table might be in cache,
might be in a lower level cache,
might be even in memory, won't be on disk, right?
So this won't happen, but it might go all the way down to memory.
So this basically means if it's not here,
then we'll have to pay additional costs to do just the address translation.
As I said, it's an associative cache table.
I'll go into more details what that means, but it basically means we can place individual
cache lines into multiple lines in here.
So we're not, so there's the concept of a directly mapped cache where each individual
tag or address would fit exactly in one slot of our cache.
So this is basically if we're doing modulo, for example,
then we exactly know where it needs to fit.
And if we have an associative cache,
then we can place it anywhere and the cache will just look
it up everywhere at the same time.
So we'll compare with all of the different entries
at the same time and give back the, like whatever fits.
And so that stores the physical address entries at the same time and give back the like whatever fits.
And so that stores the physical address for recently accessed virtual addresses.
And the virtual addresses are compared,
if it's fully associative, will be compared
to all TLB entries at once.
And these will then be directly translated.
So this happens in hardware.
So this is basically within one operation,
and if it's not translated, then we have a miss,
then we have to go to the page table.
We do need to do the calculation with the page table.
So as an example, for an older Intel i7 CPU, we have a trend instruction.
So you remember there's two kinds of caches.
We have an instruction cache and a data cache on the first level cache typically.
And for the instruction cache, we have 64 entries with an 8-way associativity.
Again, I'll explain this in a bit more detail, meaning that we can put 8 different cache
lines with the same tag.
So we have grouping sets in there.
And we have a data level cache with 64 entries,
which is four-way associative.
And then on the second level cache,
so L2 cache, which is shared across instructions and data,
we have 1,024 entries, which is eight-way associative.
And say, for example, for the MAC, so for this here,
the data level cache, so this is all I could find out
when I checked the data label TLB on level one cache
is 256 entries.
And on the level two cache, have 3072 entries for example.
So basically if we have one of these addresses cached then we're going to be fast.
If we run out of this, so we don't have one of these entries in here, so we're looking
for something that's not cached, then we'll have to go to the next level of the cache
or we might have to go to the next level of the cache, or we might have to go to memory. And this is typically something
where we already see some performance degradation.
So typically, for each cache,
and also for memory,
we won't have enough entries
in order to fully store all of these translations.
So meaning basically for many accesses,
so if my data is larger than the cache,
for sure at a certain point,
I will also have run into the trouble
of having to go to the next level
just for the page translation.
So I won't get a direct translation of the address, of having to go to the next level just for the page translation.
So I won't get a direct translation of the address.
If I have to go to a large part of the memory,
I first will have to go to the page table
and then will have to go to the actual entry.
And this also, of course, gets less of a problem
if we have huge pages.
So the larger our pages, the less entries
we need in these caches.
And with this, basically, we get less translation look
aside buffer misses.
So that's one reason why we would want this,
so why we would want the larger pages. On the other hand,
of course, we have less flexibility in how we're accessing the memory.
Okay. So, with the concept of virtual memory, let's look at the caches actually, how these work in a bit more detail.
So we have the caches work with the principle of locality.
So the idea is that we often reuse data many times.
So if we have like a small computation,
we're counting students in this classroom,
then we have one variable that we will update many times.
So that's basically locality.
So we're using this over and over again.
So keeping this variable in memory makes sense
because we're gonna reuse it again.
And this is where caches really help.
If we have large amounts of memory and we're doing to reuse it again and this is where where caches really help if we have large amounts
of memory and we're doing just pointer chasing across the large amount of memory then our caches
won't help much because we're basically jumping around in the memory regions somewhere so um
yeah so this is basically a way how or if our our looks like this, then we're not gaining anything from caches.
However, there's a certain a second part to that. So it's not only about the data, it's also about the code.
If we have like a small amount of code that we're reusing all the time. So we're running in some kind of small hot loop.
Then that means also this code will stay in the cache.
And this is why we have this instruction cache, right?
So often we don't really think about this.
If our code is very large and we're jumping in the code a lot,
so we're using many different libraries
and every individual statement in our code is a new library call,
that basically means we will jump in the memory for the different kinds of code and we will have
to go to memory for the code in order to load the next part of our library. But typically that's not
the case. Most of the time we're actually in some small part of the code, which we will do over and over again, or there's at least some kind of locality. So with this also the cacher will help
again. And there's two ways we can think about locality. So there's spatial locality, meaning
we're accessing data that's spatially closed. So within a page, for example, or consecutive pages that are near to each
other, or code that's basically a small loop that we're reusing, and there's
temporal locality, so we reuse certain variables, right? So I'm updating the
count of students every now and then, or not every now and then. I'm updating this frequently. And I'm calling
the same function over and over again. This means it makes sense to
cache this. And this is a small
example where we're calculating exactly such a sum.
So we have a sum variable and then we have
a for loop where we go over all students and count them up.
And hopefully our area of students is basically the way we go through this area of students.
Our A area, it's laid out contiguous in memory or continuous in memory.
So that basically means we have the spatial locality.
We're going through the area one by one.
So every next axis will be close to the previous axis.
And all of the instructions in our code are accessed in sequence.
So there is no jumps in the code.
We're not jumping across large code bases back and forth
or calling some other kind of variables or functions in here.
And we have this temporal locality.
So the temporal locality here is the sum variable
is called over and over again.
So this is, I mean, there's no spatial locality
with the same variable, but it's called each and every iteration,
and the loop is called in every iteration, right?
So we also have spatial locality in the code.
And that's important, right?
So if we get this, then our caches are super efficient.
So basically this small code example, this kind of stuff,
this will all be in first level cache.
And the variables, so something like the sum, this will even reside in a register if this
is executed efficiently.
This won't need to go out of the register ever because not much else is going on.
So this kind of code can be super fast.
If we convolute this with
jumping through large amounts of data because our data access is stupid, if we have many
kind of libraries around this which are called all the time, then all of a sudden this will
be much slower, although the same kind of computation is done. So it's really, are we using the cache well or not?
So let me give you an example.
I think we've seen that before.
So this is kind of this micro-benchmark example,
but we'll do this also now.
So we have a two-dimensional area,
so I'm simulating a two-dimensional area
with a one-dimensional area, so I don simulating a two-dimensional area with a one-dimensional
area. So I don't have to deal with the actual memory layout that C gives me or C++ gives me.
But I'm saying I have a row major order layout. So saying like the area is two-dimensional,
we have rows and columns, and we have first the rows rows or the rows are stored continuous and the columns are basically
jumps across the array. We could also in other if we actually use a two-dimensional array then
you basically have to check what does the programming language do. There are programming
languages or compilers that translate into column major, meaning that columns will be laid out basically
continuous in memory. So, and we do the same thing. So we have row major order. We're basically
having the rows and then the columns, and we're accessing in a row based fashion or row major
order in this loop. So for every row, we go through all the columns.
So we're taking row one, and then we're jumping all the columns.
Then we take row two, we jump through all the columns, et cetera.
So that's one way of accessing this.
Or we do it the other way around, which is, in this case,
where we're accessing for each column, we're accessing all rows.
So we're accessing the first column of the first row,
then the first column of the second row, first column.
So we're basically jumping across the area.
So it's not a stride one pattern, but it's a other kind of stride.
Depends on the number of columns in each row.
And if we do this, you will see that basically the execution
time for this is very different in these two loops,
although we do the exact same thing.
And this is only due to caching.
So this is only due to how we load the data from memory and
how then we can reuse whatever is in the caches. I mean only due to caching is a bit much.
There's a couple of things in here. So the prefetcher will help us, etc. So if we're smart, if we're using this,
there's a lot of hardware
and concepts that help us. If we don't, then we're basically
fighting against what the hardware actually wants to do. And so maybe let's try this briefly. So
I'll just show you what this looks like. So I have this here. Let me see. As you can see, this is basically exactly the same program.
So we have an array with, I can't fully read this, 100 million entries.
And we have 10,000 rows and 10,000 columns.
That's basically if we have a squared array.
And at least that's my starting point.
And then we go from that to more extreme setups where we basically say, okay, we have 10 rows and
1 million or 10 million columns or and vice versa. So going across these different layouts right and then we fill this with random numbers
and then basically do an initial run in order to make sure the caches are fine right so we have
kind of the same setup for both runs and then here i have some extra code so i'm going multiple doing doing multiple iterations where I change the ratio of columns to rows.
So in the initial iteration, I basically say,
we have basically 10 rows.
And so this is basically power of 10.
We're starting with zero. so we're starting with 1.
We have 10 rows, and then 10 million columns,
and then in the next one will be 100 rows, and a million
columns, and so on and so forth.
And then we just do the same two sums.
One time we go column-wise, one time we start with row-wise, and we're
doing the kind of measurements, we're changing the sum to zero again, and in the end I'm
outputting all of the individual timings. And so one thing, I mean, the compiler will
try to optimize here and there.
And this is also why we need to basically have some outputs,
why we actually need to do some actual additions.
Because if we don't do this or we're not using the additions in the compiler,
we say, well, this is dead code.
I'm going to clean this out.
And our program will be super fast.
Okay. And this, maybe let me show you this.
So I'm also going to do it on the command line
because there it actually makes some difference.
Let's clear this first.
Nope, can't read this here.
Okay, so... So now you should be able to see this somehow.
So this is the program.
I've compiled it with optimization flags
so the compiler can actually do something.
And first we need to initialize the area.
I mean, it takes a few seconds, basically.
So we're starting with the one where we have 10 rows
and 10 million columns,
and then 100 rows, 1,000 rows, 10,000 rows.
And now we're getting into the cases
where we're basically really jumping a lot into memory.
So with the column-wise order
or the column-oriented traversal,
if we have 10, if our rows have length of,
what is it, 10 million or something,
then we're jumping really far,
which basically means we actually have to go all the way
through the cache, through the memory
in order to find something.
So any kind of caching will never help us
because we're always jumping way too far.
So here you can see in the row-wise traversal,
we have a fairly stable throughput.
If the rows are very long, all of a sudden,
it somehow gets a bit slower.
But in the column-wise traversal,
if the rows are very long or we have this kind of very wide jumps in the memory, then let me see this.
So then here, the timing is actually much worse.
This is basically two orders of magnitude at least slower here.
And this is mainly due to caching, or not caching
in this case, because we always have
to go to memory with each and every individual access.
While in the row-wise traversal, we'll
still have to go to memory.
The data is too much.
This is a couple, I don't remember, a couple of gigabytes.
It doesn't fit into the caches.
So it is in memory. Basically, we have to pull it up
into the caches because, I mean, we can only work with data that's actually in the caches
eventually or actually in the end ends up in the registers. But this is happening for every individual access.
So this is why this is slow.
So this is the code.
You can basically check it out.
You can try it out yourself.
So what happens actually here?
So on each individual memory access, the CPU will check if the respective cache line is already
cached.
So again, this is hardware.
So this happens without you doing anything or even without the OS doing anything.
So the CPU will just check, do I have this in cache?
If I have it in cache, nice.
Then basically we'll read the data from cache.
We don't need to go to any lower level memory
or any lower level cache.
We're just gonna use this.
It will be much faster.
We can continue with our computation right away.
If we have a cache miss, well, tough luck, right?
So we'll have to go to the next lower level in memory.
Maybe there's also a cache miss, so we
might have to go all the way down to memory.
Then we have to evict some kind of cached block
and replace it with the newly read cache line.
And the CPU will stall until the data becomes available.
If there's anything else to do, the CPU
can do some unordering with stuff.
It will try to do something else in the meantime.
But if we're only computing this kind of sum,
well, then the CPU just has to wait until the memory becomes
available.
And you see that then this takes a lot more time.
And these caches typically are inclusive.
This is also something important or not, right?
So basically, whatever is in level 1 cache
is also in level 2 cache.
But level 2 cache will just have a bit more.
So we can find this.
And so then if we don't have it in L1 cache,
we might find it in L2 cache.
So that's basically the hierarchy that we have to go through anyway.
Okay, and we've seen this in the example,
but we can also calculate this to some degree.
It's also logical, right logical that there's a big difference between the cost of a cache hit and a cache miss.
And we've seen there's multiple levels of how we can be good about order cache.
We can hit cache or not.
So the first thing would even be just the address translation.
So if something is in our cache, well, then we don't really need to do the address translation.
We can work with this directly.
If we need to go through to the memory, so we don't have this, if we have to go to memory
and we don't even have the address translation there, then first, before actually going to
memory, we will have to load the page table.
So that means basically an additional cache miss.
So two cache misses, maybe even through the whole hierarchy for just a single access.
And the difference between having something in the first level cache or having something
in the register, this is in a factor of 100.
So between having your data in the register
and going to memory, this for sure
is a factor of 100 or even more, like a single memory access.
And the miss rate, you can basically
calculate the number of misses divided
by the number of accesses.
So that's also the 1 minus the hit rate.
And then we can calculate, so what's the hit time?
So how much time does it take to deliver a cache line
from the cache?
And there's a certain miss penalty,
which is basically how much does it cost us if we actually have a miss.
So then with this, with these two things, we can have an average time to access memory,
considering our workload, depending on the hit rate and the hit time or the miss rate and the
miss penalty plus the hit time. So if we have a 99% hit rate in our cache,
that's actually twice as good as for example,
a 97% hit rate.
So if we have a hit time of one cycle,
so if our data is in the first level cache,
it will take us one cycle to read this into the register.
So there's single cycle.
If we go all the way down
to memory, it's 100 cycles, probably even a bit more. If we have 79% hit rate, then we have one
hit, right, plus 1 minus 0.97 times 100, so 97%, or 1 minus 97%, basically.
So miss, or 97% miss rate, basically.
And this means we have 1 plus 3, or 4 cycles.
And for the 99%, on average, for each individual axis.
So because for every three hits, we basically have 97 misses.
Other way around.
For every 97 hits, we have three misses, basically.
So that means in total, like four cycles on average for a memory axis.
If you have 99% hit rate, because of this factor 100,
this means on average, our memory access
will have two cycles.
So this is just something to keep in mind.
So even if we have very effective caches,
just because the difference is so large,
this actually costs us a
lot, right? And this is still, let's say, positive numbers, 100 clock cycles for an individual memory
access. Okay, so in order, so yeah, let's talk about the cache internals. How does this work internally?
So the overhead of caching needs to be reasonable.
Also, we cannot use insane amounts of chip space for the caches.
This is also why you saw these caches are not really huge.
So in order to make this efficient, we are organizing the caches in cache lines.
And this is typically on most CPUs today 64 bytes.
On these ARM chips in M1, for example, it's 128 bytes.
And so there we're assuming that if we're accessing some kind of byte somewhere,
most likely we'll access more bytes around this.
And so this is why it actually makes sense to load more than just a single byte into memory.
Again, it will be costly to do the addressing, etc., all the data movement.
So we're loading more than just a very tiny fraction into the cache.
And so there's no way to just access or load into the caches a single byte.
We'll always go with 64 bytes
or on this kind of machine with 128 bytes.
If we're only using a single byte or even less,
then the rest of the data transfer will be useless.
And practically, we're not going to be
able to move something faster by just moving less.
But the other way around, we might
be more efficient by using whatever we're moving
through the cache lines.
So if we can utilize the whole cache line,
so if everything's, that's basically cache efficiency,
then if this is spatially local,
basically the way we're accessing this data,
our caches will be more efficient,
our program will be more efficient.
And the caches are organized in hierarchies,
so I think we've seen that to some degree.
So this is an older AMD processor, but it looks similar in other processors.
So we have the individual cores.
Then we have the level one caches, which are split into instruction and data caches.
Then level two caches, level three caches, main memory.
Some CPUs, like the one I have on the first slide, on the title slide,
might even have fork levels of caches.
Some only have two levels of caches.
So this is basically architectural choices.
Depending on the difference of access speeds across different tiers,
it might make sense to add additional cache layers
or to leave some out.
So for the L1 cache, we have the separate caches with each
having 64 kilobytes.
L2 cache is shared with one megabyte.
And then the L1 hit latency might be two cycles, L2 with latency
in this case seven cycles and L2 miss latency is like 160, 180 cycles and these are old numbers but
they are not so far like this like these ratios don't change that much. So approximate numbers regarding these cache latencies
for registers, we do this within the cycle, basically.
In level 1, 2, 2, 4 cycles depending.
Level 2, 7, 2, 14, something like this,
if we have the translation look-aside buffer,
basically latencies that should be
done within the access of the cache.
But if we have to go further down,
then it basically gets more expensive.
So with this, an overview of caches up to here.
I'm going to talk about cache associativity in a bit, but do we have questions so far?
Yes?
Yes, perhaps I misheard that, but I think you said that the L1, 2, 3 caches are redundant,
so that everything that's in the L1 cache is also in the L2.
Yes.
Why is that redundancy?
It's basically, I mean, it doesn't really make a huge difference
if you add more, and we will basically always have to go through the layers anyway.
Right? we will basically always have to go through the layers anyway.
So if you would split this up, the management
would get even much more expensive.
So we're just propagating data up the cache level.
So everything that you load will be in every cache level.
Only if it's in the utmost cache level evicted,
then it's basically missing there.
It's still on the next level.
So that's basically the way it works.
So we're not loading things first in the second,
like in L2 cache, and then assume maybe it
will be used or not.
But we're loading it in all of the caches,
and it will be evicted from the higher level caches earlier
because they are smaller.
Okay, thanks.
You're welcome.
Yes?
When the CPU tries to access some data based on the virtual address, I guess, does
it have some kind of table that dictates where in the cache this data might be or might not
be?
And to decide whether it has to go to main memory to catch the data? dictates where in the cache this data might be or might not be?
And we decide whether it has to go to main memory
to catch the data?
So the question is, how does the CPU
know where to look up if things are in the cache
or not in the cache or something?
That's basically the question, right?
It will just look it up, right?
So the CPU doesn't know what's in the cache
and what's not in the cache the CPU will check if it's in the cache and if it's
not then it will go to the next level so and regarding the the translation look
aside buffer it will actually look like at the same time will check do I have
this in the in the translation look aside buffer or do I have this in the translation look-aside buffer, or do I need this read
this from the page tables?
It happens simultaneous.
If I have it in the translation look-aside buffer, don't need to do anything anymore.
I'm just going to work with this.
So it goes and memory accesses will always go through the cache hierarchy first.
And if we don't find it, then we actually
need to do a memory read or memory load.
So let's continue.
And I'll quickly, well, not so quickly,
tell you about cache associativity.
That's more
because it's interesting also to know how these caches work and maybe explain
some of the behaviors of the caches. So there's associativity in a cache
basically explains where an individual cache line can be placed into the cache. So a fully associative cache means that we can place a cache line
into anywhere in the cache.
So we have a certain number of cache lines that fit into the cache.
And the question is, where can I put it?
In a fully associative cache, I can put it anywhere.
And directly mapped cache means it's not associative it means that we we
have only one location where each individual cache line can be placed and so you can think of this
basically as say we're moduling uh the address and then we know exactly where this needs to go
right so if we have five cache lines and our address is six, then this will go into position one in the
cache. And this means whatever was there before will need to be replaced. And then we have a set
associative cache and that kind of tries to combine the two. Fully associative cache means, well, we
need a lot of management in order
to check everything at the same time.
A directly mapped cache is much easier.
We know exactly where to look for a certain cache line
and where to place a certain cache line.
And in order to get kind of part of both worlds or something
in between, we have a set associative cache,
which is n-way associative which means a block can be placed anywhere within a set so a set of n items
basically so a four-way associative cache means we have four different uh or four places where each individual cache line can go.
And so we're not basically fighting with the exact
or for the exact same place in the cache
with all of the addresses
which have the same module, for example,
but we can replace this in some kind of round robin fashion
or something like this.
And first we have to, or next the question is basically how do we figure out anyway where would we put this in the cache,
right? So if it's not fully associative, we have to somehow identify this. And for this,
we're splitting up the address again. So we have the offset. The offset means basically wherever on the page we find our address
or our data. And then we have the block address. As we already know, this is basically
or the page address. This means where do we have our individual page. And now this block address, or let's say for the cache line size,
basically the cache line address will again be split up
into a tag and to a set index.
So the set index means basically which of the sets are we going to.
If we have a directly mapped cache,
this means the whole block address is basically the set index.
If we have this kind of N-way associative cache,
then we need to identify which sets this goes to.
So this is the set index.
And then we still want to identify which cache line
do we actually mean, and that's then the tag.
And of course we could also do this the other way around, right?
We could have the set index first and then the tag later.
This way, if we're doing it this way, then continuous addresses will basically be placed in different parts of the cache,
in different sets.
And this means if we're reading local data or we have some data locality in our code,
they won't kick each other out of the cache all the time.
If our set index would be in the beginning here, right?
So, if the set index would be here, this means if we're reading a number of continuous addresses in memory,
then quickly they would basically, or the next address would have the same set index again. And even if we have a larger set index or a larger set,
they will basically kick each other out
and we only use a smaller portion of the cache.
So that's why we have the set index further.
So hopefully, if we continue working with a continuous block of data,
then we stretch this out across a larger share of the cache.
Of course, it might also be stupid, right?
So if we're unlucky, then we have some kind of access pattern
where we always have the same kind of set index,
and we will again only be in a single set of the cache
and not reuse the whole cache.
But that's much more unlikely than the other way around.
Okay, and for this, so basically in order to address this, we have the set index,
we have the tag which identifies the cache line,
or identifies the cache line, and then we have the data. So within each set with the tag, we know exactly,
is this basically our cache line or not?
And if it's the cache line, then we have the actual cache line in our cache.
Okay, if we have a fully associative cache,
then we need the whole cache line address basically to find the
correct cache line and we can place the the cache line in any portion or in any
cache line of the cache in any any part of the cache so this basically means we
have a freely or we can do use any kind of block replacement strategy,
but it doesn't really scale, right?
So the idea is if we do this lookup, it's in hardware,
and we'll have to look it up,
like we want to do this simultaneously for all entries.
We don't want to basically have like an additional operation where we have to iterate
somehow through the cache and figure out is this somewhere or not. We want to directly know if we
ask the cache is this cache line in here or not this should be one single operation and this can
be done in hardware but if we do this across thousands of records then it will be expensive because we need a lot of security to build this.
So if we think about four megabyte cache, for example,
so this might be a level two cache kind of size,
then 64 byte cache lines, then we have 65,000 cache lines.
So comparing all of those in a single instance
means we need to do a lot of comparisons very quickly.
But for small caches, so very small caches,
this can still be done.
So for 100 entries or something like this,
or 64 entries, as we saw earlier,
something like this, we can build this in hardware.
So this can be directly translated.
Then we have the idea of a non-associative cache.
So this basically means, as I said, we have something like a modular operation.
And for each cache line, based on the block address, we know exactly where it needs to go.
So our cache has a certain size.
We're doing modulo,
and we put the data exactly there.
If there's another cache line that has the same modulo
address, that basically means it will replace
exactly this cache line.
Again, might work well, right?
So because continuous addresses or consecutive addresses
will be basically placed in new cache lines.
So basically if I'm caching line 12 here,
that will go into the cache line or in the cache four.
The line 13 will go into five, right?
So this works for locality,
but if I have a stupid pattern
where I'm basically hopping somewhere,
this might again be not good
because I'm replacing whatever I cached earlier before
all the time, right?
And then the way this works,
the lookup table basically works,
we have this cache line, we have the tag,
which is basically the part that's not of the offset, that's basically unique for each cache
line within our cache. And with that, we have the data we're going to compare. If we have a success, we'll have the success and directly have the data in our CPU.
If we don't, then we basically have to request the data again.
So then, or basically, it's a single operation, right?
So we check, is the tag correct?
If it's a success, yes.
Then we'll use the data.
If it's not a success, then we still get the data,
but we're not going to read it.
We're not going to use it.
And we'll have to go to memory or to the next level cache
basically.
Bit more complicated is then the set associative.
I mean, just more complicated in terms of circuitry.
So then we basically say we have these sets.
And within these sets, we can basically place our cache line anywhere.
And I explained the properties already.
In terms of circuitry, this basically means we're comparing with many cache lines,
so the complete set basically at the same time.
So we're loading all of the tags.
Basically, we compare all of the tags in hardware with the tag that we're requesting out of the set.
So say two-way associative here. This would mean we have requesting out of the set. So say two way associative here,
this would mean we have like two of these instances.
This would be four way associative.
So we take the tag, we check for all of the tags.
We're comparing is one of the tag the right tag?
If yes, then basically this tag will be going
into the multiplexer or demultiplexer.
And if we have success, we'll at the same time also have to request the data from the cache.
And we can directly work with it.
If it's not a success, we'll have to load.
And while loading, of course, we remember we'll then also have to replace whatever was before in the cache or at least one cache line.
And that will go into one of these.
And there's different ways we can replace.
I'll talk about this in a bit.
And so let me give you an example.
Again, older Intel core, so we have a total cache size of 4 megabytes.
I don't remember which should be probably level 2 cache.
The cache line size is 64 bytes.
We have a 6-bit offset for these 64 bytes in order to address. a cache line we don't need to address.
There we just have the offset. We have 65 000 cache lines in total because we're splitting
the page or the four byte memory into these individual cache lines. So, 65,000 entries that we can have in the cache.
And the cache is 16-way set associative. So, we divide the 65,000 by 16, and we have 4,096
individual sets of each 16 cache lines.
And for that, we basically need a 12-bit set index. So in order to uniquely identify each of the bits,
each of the sets, we have the 12-bit set index,
which leaves us with an 18-bit tag that we will use
to within a set, uniquely identify each of the cache lines.
So, in total, we only need 30-bit addresses in this case, so we can have 64 gigabyte address space.
In this CPU and other CPUs, we might have more.
Okay, so the tag is the unique identifier within the set,
and the set index basically gives us the right index.
And this is the same kind of hierarchy
that we saw earlier with the virtual memory,
but it's differently spaced again, right?
So the virtual memory, we have the page size,
or the virtual memory page size as an offset.
Here we have the cache line as an offset.
So it's the same kind of addresses,
but we're splitting it up in different ways.
And there's different way, or this is basically,
if we're doing like a compilation workloads or some kind of GCC workload,
then we can see that the associativity actually helps us by reducing the cache misses, especially for smaller caches.
So we have a smaller cache and we have these certain kind of patterns in our code,
then associativity will somehow reduce the effect of these patterns.
So then we get less cache misses by having more associativity
because we have more flexible page replacement.
Or block replacement or cache line replacement. Or block replacement or
cache line replacement.
Whenever we have
pulling in a cache line
we have to replace an existing
cache line and there's different
ways of doing this.
A common way of
buffer management also is least
recently used.
Whatever hasn't been used for the longest time will have to go.
That takes some time to manage this, right?
So it's a bit more complicated.
You can do this by ordering,
but you don't want to copy data in the cache around all the time.
This will be costly.
So then another way would be first one in first or FIFO.
So basically, we're just checking whichever
was added in first or where we added the last time.
So we'll just go round robin, basically, replacing.
This often behaves similar to LRU
and is much easier to implement.
In some cases, it might not be great.
But still, in hardware with small caches,
might still be good enough.
Another idea is randomness.
Often, actually, it works quite well,
especially if we're not super small
or we don't have very clear patterns
that we could do much better with other replacement strategies.
And that's also simple to implement in hardware, at least pseudo-randomness.
And then another approximation might be not most recently used.
So that's kind of also an approximation.
So we're just keeping in mind what did we use last.
And whatever there, we can replace.
So we're just not replacing the one that we actually just
required.
So that might also be a good idea.
But this needs to be done in hardware and needs to be fast.
So we also don't know what actually is happening,
because that's the CPU.
So there might be an option, or you
could think of an experiment how you can figure this out,
if you know the set associativity, et cetera.
So you can basically tweak things maybe to get some ideas about the replacement, but
the hardware vendors will probably use something that's very easy to implement and gives you
a good way of doing this.
Or basically it's a good trade-off of performance
versus number of circuits you need for this
and overhead that it generates while doing the replacement.
Okay, so, so far, we've only talked about reads, right?
So, we've only talked about what do we do
if we want to access the data, read the data, right?
So, we've not really talked about writes.
And there, the cache is actually, well, give us some extra level of thinking what we need to do with this. And especially, I mean, there's two things you have to consider about this, right?
So we might have a multi-threaded, or that's the main thing actually,
we might have a multi-threaded program.
So all of a sudden, I might have stuff in my individual cache,
but some other CPU also wants to access this.
So then this situation I need to handle.
So I have to think about what do I do in a write?
How will this be propagated into memory eventually again?
Also, just in general, I have to think about what happens if I write something.
Does it stay in cache or does it go all the way back to memory?
And there's basically both ways or an option.
So there's write through means we're really
just directly writing the data all the way down
to the memory, also into the cache, of course.
We're not gonna just flush it out of the cache
just because we've written it,
because we might just use it again.
But we also want to have whatever we did in memory.
So that means we'll go all the way through until we hit memory
in order to basically make sure that the data is written.
This makes data coherency much easier,
because we'll always have the right data, the correct data
in memory, but it's also slow, because it basically
stalls the CPU.
We might use something like a write buffer, so we're buffering whatever happens and then
combine writes somehow, but it's the more slow way to go.
And the other way to go is basically say we're writing back, meaning we're writing into the
cache, and we somehow have to mark this then, right? to go is basically say we have uh we're writing back meaning we're writing into the into the cache
and we somehow have to mark this then right so we're marking that this um this cache line that
we've written to is dirty now it's uh means it's different what we have in cache than what is in
memory and that also means eventually this will have to be cleaned up again somehow.
So we have something in cache that deviates from whatever we have in memory.
With this, we reduce the amount of traffic to the lower level memory.
But we need to somehow do something whenever we evict such a cache line. We need to make sure that it's actually written back into a memory
through all of the cache hierarchies, right?
And whenever somebody else wants to access the same cache line,
we also need to make sure that they see whatever we've updated
and now work on the same cache line with something else.
But we'll come to this later when it comes to NUMA,
so more or not only NUMA, to parallelism whenever
we have to think about what happens if multiple threads
access the same kind of data.
And modern processors, because of the less traffic
and higher efficiency, will implement write back today.
Okay.
So with this, let me give you two more examples
of what caches actually look like.
So this is an older, we've had this before, older CPU.
So you see what this is actually on the chip, right? So we see there are six cores
on the chip. We have a shared level three cache and this is two times two centimeters,
for example, right? So this is basically, we have the six individual cores. Each of these cores will
have level one and level two cache, two level one caches, one for instruction, one for data,
then the level 2 cache,
and then we have this large shared level 3 cache,
which is accessed by all of the cores,
where they basically can all read.
And then we have the memory controller,
so basically our access from the core will be to level 1 cache,
level 2 cache, level 3 cache.
Still not there?
Well, memory controller, let's go to the memory.
And this will be outside of the core in the DIMM, then.
Yes?
What happens to the two dead cores that are still visible on the graphic?
So the question is what happens to these cores or that core.
So this is basically, I actually don't know if these are cores
or other kind of helper structures on this kind of chip.
There is different ways.
You could use this for other kind of, let's say, interconnect
or something
like helper structures.
What this actually is in here, I don't know.
These might be extra cores also.
And we're only using six of those.
So sometimes, CPU vendors also say, OK,
this die didn't really work that well.
So we have a small error here.
We're just going gonna switch off this core
and resell it as a five core CPU, right?
I mean, not five core typically would be like four,
eight, 16, something like this.
And if like the larger the dice get, the more error prone.
So then the high core CPUs, they will need to,
like everything needs to be correct. prone. So then the high core CPUs, everything
needs to be correct.
The lower or cheaper cores, we switch off
whatever is incorrect on these cores.
So that's one way to deal with this to some degree.
And some manufacturers also build very large dice,
and then whenever there's something broken on there
or larger defects, you have to trash it.
Then you basically, there's a certain amount of chips
that are not going to work because of that.
And this, in order to see more modern dice,
so this is what's, it's even not super modern, right?
So there's M3 nowadays, which, I mean, doesn't change much.
It's just more of the same kind of dice replicated.
But this is what this machine has.
So you can see there's two kinds of cores.
So we have the performance cores here, So we have the performance cores here and we have the
efficiency cores here. Both have an L1 and an L2 cache. So we have the L1 cache that's, I mean,
there I can only assume, so this might be the cache structure and this might be the ALU, etc. functional units. I don't know, right?
So, and then we have the shared cache. So, this is the shared L2 cache for the efficiency cores
and then we have a shared system level cache. So, that's basically an L3 cache, but it's not only between the cores, the compute cores,
but it's also across the GPU cores and the neural engines
or inference cores basically here.
And then we have also memory controllers.
Right here, you can see the memory controllers.
These are basically memory channels
where then these would go basically connect to the individual DIMMs and then we'll have some PCI
Express here somewhere and some other helper structures. So this is what
how does this basically end out and you can see that here we have a fairly large
instruction cache in level one instruction cache. So this basically means if this CPU has to work with more
complicated code, it will be more efficient than something
whoops, I don't have the numbers here, but something
where we only have like a 64 kilobyte L1 instruction cache.
So here we can basically have three times as much code within the
level one instruction cache. And also larger level one data cache. So this is basically where they
made some decisions in order to have a better cache and efficiency up here. And then 12 megabytes shared cache and a 16 megabyte system level cache,
which is shared across all of the individual units here.
Okay, so these are numbers on this chip.
So we've implemented a pointer chasing benchmark at some point.
And well, basically, if we have one kilobyte of memory and we're doing
pointer chasing right so this is basically we have like a random numbers across the whole
address space within this within the one kilobyte up to one gigabyte basically just pointers that
just will point to a next address we're looking up
the pointer and then go to whatever address is within this pointer right so that we can just do
random hops in this memory if we do this in one kilobyte that will be in first level cache right
so then we can see every in each individual axis will be one nanosecond. So this is fairly fast.
I mean, we load the pointer.
We translate it to an address.
And we go to the next step.
Then for 8 kilobytes, that's still level 1 cache.
That's still one nanosecond.
If we go 64 kilobytes, let me see, what's level one cache?
64 kilobytes is still within level one cache.
So we have 128 kilobytes level one cache.
So we're still in one nanosecond.
As soon as we're moving out of the level one cache, right, so we're hitting level two cache. So we're still in one nanosecond. As soon as we're moving out of the level one cache,
right, so we're hitting level two cache, this will be with 256 kilobytes. This is twice as much
as what fits into the level one cache, we will see all of a sudden, oops, now we're in the three
nanoseconds, right? So this is basically first level of the cache hierarchy missed. So there we need to go to the next level.
Then we have one megabyte.
It basically hits us more, right?
So we said L2 cache is,
the shared L2 cache is 12 megabytes
for the performance course.
So up to 16 megabytes, we're basically getting,
or eight megabytes, we're still in level two.
So we're seeing more of the portion of accesses
being in level two than level one.
So we said four, let me see again,
128 kilobyte level one cache.
So some of these, or 1 eighth of these accesses in the 1 megabyte
will still be level 1 cache.
So this is still in the, they still get 1 nanosecond.
The rest 7 eighth of the 1 megabyte
will go to the level 2 cache.
With 8 megabytes, we're completely in level two cache.
But still, some of them, of course, are level one cache.
16 megabytes, we're hitting, like we're
going out of level two cache into level three cache.
And then once we're in the 64 megabytes,
this is even without the shared cache.
So there many of the accesses will then already go to memory and there we can see, well, this
is already this factor of 100 with a gigabyte, we're getting close to almost all of the accesses
being in main memory.
So you can see we have this factor of 120 or so in comparison to whatever we have in the cache,
the highest level cache and the lowest level cache.
And if we want it, we could basically be even a bit more fine-grained
and at a certain point within these, basically probably around here, we could see where we start missing TLB,
or where we see TLB misses, basically.
So the TLB, level 1 TLB, will have 256 entries.
So we cannot calculate quickly enough
where we would see this, but probably already somewhere here.
And level 2 TLB has 3,000 entries, so this would also be somewhere soon.
Because basically that means whenever we're hopping to memory.
So we're not using cache in any way smartly, right, because we're jumping far in memory. So this is where this would give us
yet another small dent in our performance just because of the TLB misses.
Okay, so of course the TLB also needs to interrelate or work with the cache.
And there's basically two ways, again, to do this.
Either we're caching physical addresses,
then we need to translate the address before the lookup.
So then we basically need, like all of the,
we should at least have
all of the addresses that we want to look up in the cache in our TLB or we're using the virtual
addresses in cache. Then we basically need to figure out how to deal with multiple accesses
from different processes
to the same virtual, to the same physical address,
for example.
So then we might have these different virtual addresses
there and there's different ways how this can be implemented.
It's just something to be slightly aware of
if you want to dig deeper into this kind
of hardware situations.
Okay, so I'm slightly running out of time, so let me maybe just give you a hint here and then continue from there next time.
So now there's different kind of benchmarks for this kind of stuff. This is spec-int,
so one of the typical CPU benchmarks
where you basically measure how fast is your CPU.
And here you can see for different kind of workloads.
So these spec benchmarks are typically a suite
of different kind of tools where then you gzip something,
you compile something with GCC, et cetera.
So it's really just runs different kind of programs
and checks how fast does your CPU does it.
And then you can see, well, there
are certain kind of instruction cache misses and L2 cache
misses.
And one of these programs is really bad in terms of L2 cache.
This probably does lots of large jumps in memory
or these kind of pointer chasing things that I showed you.
But all in all, you see kind of number
of misses per 1,000 instructions.
So this would be basically, for most of these things,
would be in the 99% or even higher cache hit ratio. And now if we're looking at a typical database benchmark,
then this is TPCC, so an OLTP benchmark,
where we have a smallish but reasonably large enough,
definitely larger than caches, data set,
and we're doing fast transactions over this.
So we're doing many small updates and reads in there and you can see that our instruction cache
is actually really bad, right? So we're there and in comparison it's much worse than all of the other
programs or all of the other things used there. And the L2 cache is also not great, but it's
okay-ish. But the instruction cache, that's actually bad, right? Because basically the CPU
continues to just load data in or just load instructions in, right? So, loads new code in order to do whatever we want to do. And, well, we'll
look at why this is actually tomorrow, right? So, this is basically, you see that traditional
databases are not good when it comes to cache efficiency here, especially instruction cache.
So, as a teaser for tomorrow, we'll identify where this comes from
and then look into what can we do about it.
Questions? No questions?
Well, then thank you very much and see you tomorrow.