Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching (2)
Episode Date: May 2, 2023...
Transcript
Discussion (0)
Welcome to our next session.
Today we're going to continue talking about the CPU
architecture and caching.
And quick announcement, I also added some of the code examples
that I showed you so far to our GitHub.
So you can basically also try this on your laptops
if you want to see what the performance of your CPU
is or also the characteristics of your CPU.
I've updated the slides, so you have the link.
I'll probably also post it in Moodle somewhere.
And I'm going to add more code examples as we go.
It always takes me some time to clean them up, et cetera.
So, but I'll try to add more as we go.
We're still where we are,
but I'm already a bit behind schedule.
So I'm shifting things a bit.
So we're gonna have start with the task next week
on Wednesday rather than on Tuesday.
And it's all fine.
Just so you get some SIMD information already
before we start with the task.
And yeah, otherwise everything's,
well, everything shifted a bit, but that should be fine.
So we have enough time, as we said. Well, everything shifted a bit, but that should be fine.
So we have enough time, as we said.
And with that, let me actually dive right back
into where we stopped.
So if you remember, we talked about the memory hierarchy
and mainly about caches, how do different kind of memory
technologies work and then today we're going to continue and how these are
actually organized and or also prepared for the programmer right so because of
course I mean this is kind of fine if you're a proficient programmer, you can deal with the different kind of cache levels,
but it's also a hassle. Right? And so in the end, we don't really want to have to directly address
different kind of levels of caches. And so this is why the CPU does this for us. And the OS.
And for this, there's something called virtual memory.
So rather than having to think about all
these different levels from a programming perspective,
we try to give a continuous address space to the programmer.
Or not we, but the system, the CPU and the OS
try to give a continuous address space to the programmer.
And this is called virtual memory.
And this is also good because often programs actually
need more memory than there is physically available.
So typically, there's a certain amount of memory
that the program can address or can reserve but
then you might have multiple programs or processes running at the same time and
then in the end you have so much stuff in on your machine that it doesn't all
fit into your memory and then technically I mean if you have to deal
with this yourself you basically have to know is this in memory is this somewhere
in the cache is this on disk somewhere and you basically have to know, is this in memory? Is this somewhere in the cache?
Is this on disk somewhere?
And you would have to change these addresses
and the addressing based on this.
And since you don't want to do this,
as a programmer, there's virtual memory,
which basically gives you a continuous address space
that also doesn't care,
like from an address space perspective, where in memory or where
on disk this is actually residing.
So per process, we get virtual memory, which is continuous, contiguous, and then the OS
and the CPU basically map this to different physical locations. So this can
be actually in memory or it could even be on disk. And as a programmer we don't
know. So then we basically just have to deal with this one big address space.
But of course if we would do this on a per byte level this would mean that
we have like a lot of bookkeeping to do we need to basically do the address translation for each
individual memory address so rather than doing this we split up into typically equal sized pages.
This means on most processors, this would be 4 kilobytes.
But there's also more instructions.
Most instruction sets would be 4 kilobytes,
but there's also different levels.
So Intel or x86, for for example also supports huge pages, all typical
processes today do this. So they can be two megabytes or even up to gigabytes depending
on the architecture. And these pages typically are in the size of a power of 2. So we can then address them with,
so if we have 2 to the power of k,
then we can address all of the bytes within a page with k bits.
So in the terms in like for kilobytes,
this, for example, would be 12 bits, right?
So for 12 bits, we can basically completely address
all bytes within four kilobytes.
And the memory is then split up into page frames
and we're just storing like in the virtual address,
we're basically storing the page address
or the virtual page address and we're basically storing the page address or the virtual page address
and then the offset within the page.
So the layout in the memory within a single page will be exactly the same as it would be on disk or in memory.
So the virtual page is an exact copy of the physical page,
but where exactly the physical page will be then lying.
This is basically dependent on the OS,
dependent on if it's right now in memory,
if it's paged out to disk, et cetera.
And because this needs to be fast, right?
So now we have these virtual addresses
or this virtual space um now we need to
basically see find this fast uh where exactly are our pages how do we we're like where is the virtual
page how do we translate the virtual page into a physical page because of that there's a special
unit on the cpu dealing with this and i mean typically this is as I said 4 kilobytes
on Mac for example 16 kilobytes and you can actually find this out if you're
curious so I can also show this let me see if I have still have this open
there's probably too small to see so did some bad compilation earlier.
So there's a, if you do get conf page size, then this will basically give you your page
size and you can see on Mac, this would be 16 kilobytes, for example.
Okay. would be 16 kilobytes for example. Okay, so virtual memory, individual pages in my case
16 kilobytes, if you have an Intel laptop 4 kilobytes and these placed in memory or Now, how does the translation work? OK.
So we said, like, within a page, the address is the same as, so the offset, the page layout is the same physically as virtually.
So whatever we store within a page will be stored exactly in the same way.
So there we can use a certain offset.
So for four kilobytes, we need 12 bits.
And that will be the least significant bits of the address.
So the last 12 bits basically will be the offset
that will directly translate to what is laid
or how the page is laid out.
And now if we have 16 kilobyte pages,
what would be our offset so we need 12 bits to address 4 kilobytes 14
bits exactly so 12 for 4 kilobytes 13 for 8 kilobytes 14 for 16 kilobytes
right and if you have huge pages for example so we also said we could have 13 for 8 kilobytes, 14 for 16 kilobytes.
And if you have huge pages, for example, so we also said we could have larger tables,
larger pages, then we would even need more.
So for 2 megabytes, for example, would need 21 bits.
So 1 megabyte would be 20 bits two megabytes 21 bits
and that then will be the offset so this will be the last part of the address and
this basically we never need to really change because we're just going to use
it as is as on the on the page itself however then we have the virtual address
part which basically translates or the address of the virtual page, which
then translates to the actual physical address, which
would basically point to wherever the physical page is
lying around right now.
And this can be in memory, but could also be on disk, right?
So this will be transparent.
And even if you think about it, right?
So if you have multiple processes
and the processes are basically changing,
then we might move some like new process comes in.
We have to load some data, some pages,
some other pages need to go back
on the disk then basically we have to change these physical addresses while
we're keeping the same virtual addresses for the process so the virtual the
process won't see any change in the address but under the hood basically
there's the OS and the CPU make sure that there's always enough memory for all
of the processes currently running.
And the pages that are not needed right now
can be flipped out and can be removed.
So now we have this translation.
So we have the virtual address that
needs to be translated to the page address.
The offset remains the same as we said.
So for this, we just need a page.
We're using a page table.
And this is actually just a simple data structure that says, OK, on this address or this address,
virtual address is translated to this physical address.
And this physical address would then say, OK, this address in memory, for example, or this address address is translated to this physical address. And this physical address would then say, OK,
this address in memory, for example, or this address
on disk.
So this way, we can easily find it.
And here, I'm basically showing you a single page table.
Or I will show you or explain a single page table.
In modern processors, there's multiple levels of page tables.
So in the x86 processor right now,
there's four levels of page tables,
and each of them having 9-bit.
So all in all, then, 36-bit of page table
that we can basically
use for addressing.
So we can have a total of 36 bits for individual pages
maximum.
And then on top of this, the address.
And this basically gives you the maximum number
of memory addressable within the process as well.
And so this is also why even if you're having 64-bit addresses,
a certain space up in the front, like the most significant bits,
are actually not used today.
So in x86, so in Intel AMD processors, et cetera, because these page page tables are nine bit and there's four of them
the top bits are unused so this part of the address is not used this part
will be directly translated basically within like four page tables.
And then the offset will be just a direct position
within the page itself.
So the page table is just, as I said, a simple table.
It has something like a validity flag and the physical address.
So we're basically saying this based on a
Translation table base register. So this is basically offset where our virtual address
Starts or the address space starts. Could potentially be zero. So we're starting from
Address zero. And then we're going on top of those. That would be each every next address.
So if we want to find something on virtual address 2,
then, I mean, virtual address 2 would for sure be in the first page, of course,
because we're still in the first page.
But say in the second page something, then we will go.
Let me get the pointer.
We will basically go to the page table,
find the second or third page, know the physical address,
and then in there we can find it.
And for this, we need this translation.
So we'll translate the virtual address, the virtual page
address with this page table, and then just add the offset,
and we get the actual physical address.
However, the page table can get large quickly.
So if you can say you have 16 gigabytes of RAM,
four kilobyte pages, each address is four bits,
this means your page address or the page table
will be 32 megabytes.
And of course, this is not something you want to cache.
This is something that will end up in your memory
but won't be in your top level caches.
You remember there's multiple levels of caches.
And 32 megabytes is already too large
to keep it in the top most caches.
This means, and actually, I mean,
you don't want to use all your cache just for the page table.
So this means any kind of access would actually go to memory.
Any kind of memory access that you want,
or just finding out where to address something,
means going to memory, where to find something.
And well, every access means basically 200 cycle, something
like that.
And as we know, it's not just one table.
It's four tables.
Because we have this hierarchy of page tables,
this would mean multiple rounds of memory accesses, which,
of course, would be way too slow.
So we need something else.
We need to do this faster.
And for this, there's something called the translation look
aside buffer.
And that's a hardware unit, again,
to speed up the address translation.
So it's an associative cache table.
This means basically we can check multiple positions
in this cache at the same time.
And this is hardware.
So this is really super fast.
So the level one translation look-aside buffer
will be almost instantaneously.
It will be within one cycle, essentially.
And the way this works is we have the, we're caching basically the physical addresses
for the virtual addresses to do a direct translation
while we're executing our code.
So we're trying to find something,
we're trying to find a physical address
and before actually checking the page table
or while we're actually checking the page table,
at the same time, we can also check in the Translation Look
Aside buffer, or TLB, and see if we have this address cached
there.
And if we have it cached there, then instantaneously,
we right away get the physical address
and can actually go to the memory
and find the actual page that we're
looking for.
If our page is cached, like in or the data that we need is already in cache, of course
we don't need to do this.
But this is if we want to access something in memory, then we will have to go through
this way because we need the virtual address to be translated into a physical address.
And so this is basically just storing the most recently
accessed addresses.
And because it's an associative cache, the way this works,
it doesn't go through the cache one by one.
Or the way that it's a hardware unit which compares everything
at the same time.
So all of the entries, if it's fully associative,
then all of the entries in the cache in the translation
look-aside buffer will be checked at the same time
and then directly translated.
And well, if we don't have it, if we don't have the address
cached in the TLB, then we have to go to the page table.
And actually, you can also see this
if you're running some kind of benchmarks.
If you're running out of the TLB, then all of a sudden,
you need these multiple rounds of memory accesses And then the other thing that I would like to point out is that if you're running out of the T-LB,
then all of a sudden you need these multiple rounds
of memory accesses for anything that you
want to address in memory, or that you
want to access in memory.
And that then will be slow.
So if your working set is larger than the T-LB,
and you have to go through the working set a lot of times, for example. But then this means basically you
will have TLB misses, which then means even just a page
translation will result in memory accesses.
So just as an overview, so how large are these TLB caches?
So on Intel, for example, there's
an instruction translation look-aside buffer
for the level one instruction cache, which has 64 entries
and is eight-way associative.
I'll explain what this means in a bit.
Then there's data translation look-aside buffer
with 64 entries, which is four-way associative.
And we maybe remember that the level one caches
on the Intel CPUs, at least for some generations,
are split into an instruction and a data cache.
And then on the level 2 cache, we already
have one general larger cache for everything.
And there we have, for the translation look aside buffer,
we have 1,024 entries, which are eight-way associative.
And I tried to find this out also for the Mac.
So for the Mac, for example, we see
that I didn't find the associativity,
what the set sizes are.
But you can see that the cache or the translation look
aside buffers are actually much bigger here.
So we can have more tables in our translation look-aside
buffer, or more entries in our translation look-aside buffer,
on the level 1 and level 2.
So by a factor of 4, basically, and 3.
So the associativity, I will basically, I mean, yeah, I will show this in a bit. So let's look at the caches a bit more.
And well, how we use them and why they're actually effective.
As we said, this is something that as a programmer,
we don't see.
So I cannot, or there's very little
that I can do to somehow influence the caching behavior.
There is some commands, some instructions
that give some hints to the CPU how to basically cache things, et cetera.
But in general, if I'm just programming away,
I won't really have any influence on what is cached
and what is not.
The CPU basically tries to be smart about this.
And sometimes it is, sometimes it isn't.
And we can, what we can do is we can write
cache-friendly code.
We can write code in a way that it fits well
to how the cache is organized,
or we can write it in a way that it doesn't fit well
to how the cache is organized.
So the idea of caches in general,
why does it make sense in the first place,
is that we have a principle of locality.
Often we see that most of the data,
or that there's a hot set, right? We're
not really using all of the data all of the time, but there's like a small amount of data or small
amount of variables or something that are used very frequently, just updated all the time. And
it makes sense to keep these in a special memory structure which is closer to the CPU because if they're
accessed all the time, then this basically, if they are faster, then the whole program
will be faster.
So if we have a hot set and that fits into caches, then this time that we spend there
will be much less than if we would have to go to memory all the time.
And so there's different ways of locality. There's spatial locality and temporal locality.
And spatial locality means that, say, for example, if we're talking about a page, right? So we have
our pages somewhere in memory. And one page basically is one unit of memory
that is closely co-located.
So everything's close together.
We know that this will be directly stored
in one location in memory.
So this is what we call spatial locality.
So stuff is close by.
But then there's also temporal locality,
which means we're using data again and again
temporarily close, so soon after using something else.
So a good example of an operation, database operation,
that is neither of those is a scan.
A scan basically goes through a lot of data. I mean there is some
spatial locality because we're consecutively going through the pages so
it makes sense to not like we're not jumping in the page but we're going
through a lot of data and we're not reusing data so there's no temporal
locality within the data. Difference would be we have an area
that we update all the time, something like that.
Then if we have a small loop, a small program that loops
over this all the time, then we have good temporal locality.
And this doesn't only affect the data,
but it also affects the code,
and that's important to know.
If your code is huge, this means your code
doesn't have good spatial or temporal locality.
So you have very large loops inside your code,
then this means, well, there's a lot of instructions
that will need to be executed, and they are not
close to each other.
So we might have some kind of, or you
might have large jumps in your code.
And same as kind of function calls.
So if you have many functions that are like huge call stacks,
et cetera, and things are jumping in the code
based back and forth, then the code won't be cached.
So then it basically means in order to load this kind of library that you're calling now
and the next one there, this always goes to memory.
This will also slow down your program.
If you have a small tight loop that's executed all the time, then you have this kind of locality.
The code will be cached, the data will be cached,
and you're iterating over the same data
that will stay in the cache.
And so let me show you a small example.
So if you have this kind of loop,
so we have a for loop where we're
computing a sum of an array.
So then we have the spatial locality
because we're accessing the area in a stride one pattern.
So one item at a time.
So one by one, we're going through this code.
And then we also have like all of the instructions
are accessed in sequence.
So we don't have any larger jumps.
Of course, we have one jump for the for loop,
but this is also tight.
So our code is also spatially local
or has a small locality.
And then we have a temporal locality.
So we have the sum, which is referenced in each iteration.
So for sure, because there's not a lot else going on,
for sure this will be cached.
So this will be probably in L1 cache.
And we can maybe even this stays in the register,
if we're lucky.
So this can be executed or accessed all the time
with a high locality. And we have the same loop all the time.
So the loop, the code of the loop
will also be accessed all the time,
meaning the code also will be cached in our caches.
So there's no need to go somewhere to the memory
to find the next instructions for this loop.
Okay, so I showed you this example, I think, last time already, or before.
So this sum of all entries in a two-dimensional area.
And I want to do this with an example today as well. So here, the idea is we have this two-dimensional area.
It's basically a fake two-dimensional area.
It's a one-dimensional area, which we basically give,
like we're accessing in a two-dimensional way.
So we're building our own two-dimensional area
with rows and columns. And then we're iterating over it either
row wise or column wise
so meaning
In the alternative one. We're basically for each row. We're going through all of the columns
This means if this is our our area
Well doesn't matter.
Then we're basically starting with the first row.
And then we're going to go through the columns.
And if we're laying out the data,
and as you can see, the way we're basically laying
or accessing the data here, this means
we're going in this stride one pattern through this
data.
Right?
So meaning we're accessing the first element, the second element, and so on, all the way
through the whole area.
And the alternative would be we're doing column by column and then the rows.
So we're basically saying, well, I first take the first column and then all of the rows.
But this is not how the array is laid out in memory, right?
Because we laid it out in row-wise fashion.
So this means we have this pattern where we jump each row.
So we're jumping basically if our row is 10 tuples wide,
then we access the first element, jump 10 elements,
and access the second element, and so on.
This easily gets a lot less efficient
than if we access row-wise or in a one-strided fashion.
So here, I also think I showed you this before.
Here we see the difference.
If we go through, I don't remember the shape of the area,
of the two-dimensional area.
So from very thin and long to very wide and shallow,
let's say.
And in the middle, you would have a square area here.
And you can see the row-wise axis is more or less stable.
And the column-wise axis gets a lot less performant
quite quickly.
So here you can see it's basically almost a factor
of 100 in difference.
And well, we can also try this.
So I basically wrote up a simple program.
Here you can see this is just a single iteration.
So this would be for the square array.
So we have number of rows and the number of columns.
We're filling this into an array.
And we're creating a new array. We're filling the into an area. And we're creating a new area.
We're filling the area with random numbers between 0 and 9,
just so my sum doesn't explode.
And then we're measuring here the column-wise traversal.
So here, I basically switch this.
We're starting with the columns and then the rows.
And I'm just doing some counting here.
Then I do the other one with the row-wise traversal.
And we can see the difference of the two.
And I'll show you this.
Let me just quickly see.
OK.
Let me show you this.
So here, so it's the same program with the difference
that I basically did it with a couple of iterations here.
So we could test basically these different kind of shapes
of the array, so from very small to very wide, basically.
And I've executed this before, but let's actually
do it on the command line.
Because you can see, if we do it here in the profiler or in the IDE, what happens
is it doesn't optimize.
So the code is not optimized by the compiler.
So the execution will actually be a lot slower
than if we properly optimize the code.
So here, I basically compiled the whole thing.
I can do this again.
Compile the whole thing. I can do this again. I compile the whole thing with optimization flags. And
then if we execute this, it will still take a bit because I'm basically creating,
I don't remember, a gigabyte or something in memory. And then go through the different rows and columns. And you can see here we're starting with many columns
and a very thin, long area, and then basically changing
the shape one by one, making the area wider and wider
and shallower and shallower until at a certain point
where we have a very wide, very shallow area.
And here you can see the results.
And what we can see is that in milliseconds,
the traversal of these, I don't remember,
it's like 100, 1,000 million, 1 billion or something elements
that we're going through.
So that would be if it's four bytes.
So we're in the tens of gigabytes, or in 10 gigabytes.
I would have to check again. So basically, if we go through this row-wise, So we're in the tens of gigabyte or in 10 gigabyte.
I would have to check again.
So basically, if we go through this row-wise,
so just basically just accessing the data
as it is laid out in memory,
we're basically in 80 milliseconds here.
So we're very fast going through this data.
However, if we go in column-wise, we're jumping all the time, right? So we have to fast going through this data. However, if we go in column wise, we're jumping all the time.
So we have to basically all the time, every axis, especially for the accesses where the
matrix is square, we're jumping very far in the area from one position to the next.
And we'll basically have very, like, everything basically has to go to memory.
Yes?
So why doesn't the compiler see that we are just
doing the sum and accessing it in an optimal fashion?
I mean, for the row-wise, I would expect it to merge
the two loops so that it's only one over the whole range.
Which two loops you mean?
The column and row loop. So in the end, it's just one large list, right? the whole range, but for the- Which two loops you mean? Like-
The column and row loop.
So in the end, it's just one large list, right?
Yes.
The array.
Yes.
I would expect the compiler to merge this column-wise and
row-wise iteration for the better approach,
where we just go through the array.
Yes.
So the-
Yeah.
But for the column-wise,
why doesn't it just see that we are also just doing a sum over the whole? JAN-FELIX SCHWARTZMANN- Yes. So the, yeah. But for the column ones, why doesn't it just see that we are also just doing a sum
over the whole array and just?
JAN-FELIX SCHWARTZMANN- So the compiler
tries to optimize here, but the compiler can't see everything.
So in this access pattern, the compiler
doesn't see that we've, so that it could potentially rearrange the way
that the area is accessed.
And I'm not a compiler specialist.
So there are some rules what the compiler can and cannot do.
And completely change the memory layouts, for example,
it cannot do.
So the memory layout needs to be the same.
And then for jumping in the data,
it would have to do an analysis if the data stays constant.
I mean, here it maybe could find something.
And say, for example, if I'm not checking the data in between,
so if I'm not somehow outputting results in between here
somewhere, so the sum somewhere in between,
then it would actually optimize away the whole thing.
So if I'm not measuring anything or doing something here,
then if we don't do something with the result,
this part of the code will be optimized away.
So we don't see any measurements here.
However, for some reason, and I have to say I don't know why,
this type of access won't be optimized by the compiler.
Potentially, in the small example, it could.
But of course, you can also think of more complex cases
where the way you actually access this,
it's not really clear if these indexes stay constant,
if there's not some side effects here.
And this is where, basically, optimization quickly
stops at all.
If some of these would, say, for example,
be some variables that you give at runtime,
like this is a function
and this is a variable of the function, right?
So how these are accessed.
Then the compiler cannot optimize anymore here.
So now I'm assuming it's the same.
It's basically here, this columns and rows,
these are outside of the scope,
what the compiler sees here in this loop,
because they come from outside here somewhere.
So that's my assumption.
But it's a good question.
I can ask one of my team members who's more or deeper in compilers,
and maybe they have an idea where this comes from.
OK, so I also wanted to show you what this looks like. So basically, I also plotted this earlier.
And you can see here that we have a similar pattern.
So this is on my laptop.
On a different laptop, you might see
a slightly different pattern.
But you can see that this column-wise traversal here
really is much, much worse in these cases
where we access very far memory locations.
It's basically, to some extent, well, we basically
break everything here.
So we're accessing individual pages
because all of a sudden, a single row
won't fit into a page anymore.
So every single access will basically
mean accessing a complete new page somewhere else
in the memory.
At a certain point, we'll also break the TLBs,
meaning that the translation won't work anymore because we're jumping a lot in the code.
And then basically this is what happens, right?
So we see all of the caches don't help us anymore.
All of the, we're accessing very little data,
but at a large granularity.
Because in memory, we will have on this machine, for example,
we'll basically access 16 kilobytes at a time,
where we just want a single integer.
OK. So, basically, what the CPU does for each and every step, right, so each iteration
in here, with like each element in the array is basically a pointer to a memory address,
right?
So, or a pointer to or has like this physical address somewhere that we then can read and
That will then need to be translated etc. So and
for every access the CPU checks if the line is already cached and if it's
Cached then it will read the diary and the data directly from the cache and it won't need
To access the lower level memory. So this will be much faster. And this is good for us, especially
if we have these hot loops where we basically
go over the data many times.
But this is also good for us if we're accessing
at a larger granularity.
So we're reading pages from memory.
So in this case, 16 kilobytes.
We're reading it from memory into our caches.
And if we're accessing those consecutively, they will be in cache.
So then we will have fast access there.
If we have a cache miss, well, then we need to go to the lower level memory and read the full cache line.
And then we need to evict.
Probably the cache is not empty unless we just
rebooted our system.
But even then, the OS will do something
so the cache won't be empty.
So that means we read the cache line, we evict something else,
and replace it with the newly read cache line, we evict something else and replace it with a new with the newly read cache line.
And then the CPU cannot do anything, at least in this loop,
until the data becomes available.
And now you can imagine that this,
depending on our column and row size,
this can get like there's multiple levels where we basically get additional stalls
or where we basically do not benefit from caching anymore. Even the column-wise traversal will
benefit from caching if we read like we read something from a page again before it's evicted
from the caches, right? So we have 16 kilobytes, meaning we can actually
have a couple of thousands of values in our page.
If our row size is not a couple of thousand values,
then basically we will have data in the same page again.
So that's already something.
If our page is not evicted yet while we're
accessing the same row again, we're
basically saving some time, even if we're not reading everything
consecutively.
And at a certain point, like if our rows are too large
and basically we have to jump too far,
then all of that doesn't happen anymore.
And we have to go through this, then all of that doesn't happen anymore. And we have to go through this, basically, all of the cache
levels and read the data.
And for this, then the CPU will stall.
The CPU cannot do anything.
And this is why this is much slower.
OK.
And well, there's big hit.
You already saw this in the example.
There's a big difference between the cost of a cache hit and a cache miss.
So this is a factor of 100 on a CPU.
So meaning, or even 200, if something depends on, yeah, depends on the architecture, et cetera.
But in terms of clock cycles, if we're in the register,
we will be in one cycle.
Level one can still be accessed in one cycle.
Memory, 100, 200 cycles, something like that.
So we can also calculate cache performance by basically
calculating the missed rate, or we
can calculate how well our code performs in terms of cache
by saying, OK, what's the fraction of memory references
not found in the cache?
So the number of misses divided by the number of accesses,
which is then one minus the hit rate.
So our hit rate, we want a very good hit rate.
And the number of, or the miss rate is basically the,
what you say, the counterpart.
Then we can calculate, or we might know how fast
our hit
time is.
So that's the time to deliver a cache line from the cache
to the processor.
And then there's additional time, or the miss penalty
is the additional time required because of a miss.
And basically, on average, and this basically depends on the program, right?
So on average, the average time to access memory, considering the hits and misses, would be the hit time plus the miss rate times the miss penalty.
So that's how fast our memory would be in a program. And well, this also means that if we have a good hit rate,
we have, I mean, technically, the memory always
has the same speed, right?
But the more we can make use of the cache,
the faster this will be appearing.
And well, I mean, the interesting thing is, right, something like a 99% hit rate would
be twice as good as a 97% hit rate.
So if we say we have a hit time of one cycle and we have 1 plus 1 minus 0.97 times
100.
So basically, in 3% of the time, we have to pay this price of 100.
So 3.
And if we have a direct hit, we have one. So on average, we basically have four cycles per memory access.
And if we have a 99% hit rate, then we pay for each hit.
We basically pay one still.
But if we don't hit memory, so we don't,
we have to pay this miss penalty,
we still have to pay one cycle in 1% of the case, 100 cycles
in 1% of the cases.
So on average, we will pay two cycles for a memory access.
And this is basically if your cache performs well.
And if our program has very poor cache performance, like if your cache performs well and if
Well, if our program has like very poor poor cache performance like what I showed you earlier
Then this could go down a lot
So then we might have to pay a hundred cycles per memory access like we did here, right?
So you could see the factor of 100 difference between like the nice
cache access
and the cache-friendly accesses and the cache-unfriendly
accesses.
OK.
So now let's look into a bit more detail how the cache works.
So caching, in order to be fast, the overhead of caching
needs to be reasonable.
So we don't want to pay a lot for stuff in cache,
because then it doesn't make sense anymore
if we're close to memory speed again.
And we also don't want a too fine granular organization,
because that, again, would cost us a lot. So we basically have, we organize the cache in cache lines.
So we don't have individual bits in there, we don't have individual bytes, we have complete
cache lines which are cached.
This would be 64 bytes for example.
And the caches are actually loaded in these complete cache lines.
So rather than, so we're in the cache, while in memory,
we're talking about pages, right?
So complete page.
In the cache or in the top level caches,
we won't have a full page.
There's 46 kilobytes, but only like a subpart of the page,
which would be in there, then, 64 bytes.
And in Apple, this again is a bit larger, so 128 bytes, for example.
And here, we basically still have, like, this will still be a sub-part of the page, right?
This is a consecutive memory region that we will load into the memory.
And then we have, like, our cache is organized in these lines.
And we can load many of these lines,
or we can have multiple of these lines in memory.
So for example, here's an example of an AMD cache.
Here we would have per core level one instruction
and a level one data cache.
And then we have a level two cache, level three cache,
which is shared across all of the cores,
and then the connection to main memory.
And for the level one cache, we have 64 kilobytes and 64 byte cache lines.
And this is consistent with what you would see in an Intel CPU as well.
So, the caches might be a bit larger by now, but you will have the same kind of cache lines.
Then we have the shared cache with one megabyte
with 64-byte cache lines.
And the latencies will be in level one.
You have two cycles.
Level two, you have seven cycles.
And if we go to the main memory, so we have a miss there,
then we would have to go like 160 to 180 cycles here,
for example.
And so basic approximate numbers that you can basically remember
maybe to some degree.
I mean, it's, again, something that you just have in mind.
So if we're in level one or if we're in the register,
this is immediate.
So the data in register is what we're working on.
Then we have level 1, 2, 2, 4, 2, 3, 4, something like that.
Level 2, 7, 2, 14.
The translation look-aside buffer 1 could be 12.
Translation look-aside buffer on level 2, 30.
Might even be a bit faster in main memory
than, again, an order of magnitude slower
in the 200s, something like that.
And with that, let's take a quick break, five minutes,
and then we're going to talk about cache associativity.