Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching (2)

Starting point is 00:00:00 Welcome to our next session. Today we're going to continue talking about the CPU architecture and caching. And quick announcement, I also added some of the code examples that I showed you so far to our GitHub. So you can basically also try this on your laptops if you want to see what the performance of your CPU is or also the characteristics of your CPU.

Starting point is 00:00:33 I've updated the slides, so you have the link. I'll probably also post it in Moodle somewhere. And I'm going to add more code examples as we go. It always takes me some time to clean them up, et cetera. So, but I'll try to add more as we go. We're still where we are, but I'm already a bit behind schedule. So I'm shifting things a bit.

Starting point is 00:01:01 So we're gonna have start with the task next week on Wednesday rather than on Tuesday. And it's all fine. Just so you get some SIMD information already before we start with the task. And yeah, otherwise everything's, well, everything shifted a bit, but that should be fine. So we have enough time, as we said. Well, everything shifted a bit, but that should be fine.

Starting point is 00:01:25 So we have enough time, as we said. And with that, let me actually dive right back into where we stopped. So if you remember, we talked about the memory hierarchy and mainly about caches, how do different kind of memory technologies work and then today we're going to continue and how these are actually organized and or also prepared for the programmer right so because of course I mean this is kind of fine if you're a proficient programmer, you can deal with the different kind of cache levels,

Starting point is 00:02:07 but it's also a hassle. Right? And so in the end, we don't really want to have to directly address different kind of levels of caches. And so this is why the CPU does this for us. And the OS. And for this, there's something called virtual memory. So rather than having to think about all these different levels from a programming perspective, we try to give a continuous address space to the programmer. Or not we, but the system, the CPU and the OS try to give a continuous address space to the programmer.

Starting point is 00:02:46 And this is called virtual memory. And this is also good because often programs actually need more memory than there is physically available. So typically, there's a certain amount of memory that the program can address or can reserve but then you might have multiple programs or processes running at the same time and then in the end you have so much stuff in on your machine that it doesn't all fit into your memory and then technically I mean if you have to deal

Starting point is 00:03:19 with this yourself you basically have to know is this in memory is this somewhere in the cache is this on disk somewhere and you basically have to know, is this in memory? Is this somewhere in the cache? Is this on disk somewhere? And you would have to change these addresses and the addressing based on this. And since you don't want to do this, as a programmer, there's virtual memory, which basically gives you a continuous address space

Starting point is 00:03:40 that also doesn't care, like from an address space perspective, where in memory or where on disk this is actually residing. So per process, we get virtual memory, which is continuous, contiguous, and then the OS and the CPU basically map this to different physical locations. So this can be actually in memory or it could even be on disk. And as a programmer we don't know. So then we basically just have to deal with this one big address space. But of course if we would do this on a per byte level this would mean that

Starting point is 00:04:27 we have like a lot of bookkeeping to do we need to basically do the address translation for each individual memory address so rather than doing this we split up into typically equal sized pages. This means on most processors, this would be 4 kilobytes. But there's also more instructions. Most instruction sets would be 4 kilobytes, but there's also different levels. So Intel or x86, for for example also supports huge pages, all typical processes today do this. So they can be two megabytes or even up to gigabytes depending

Starting point is 00:05:16 on the architecture. And these pages typically are in the size of a power of 2. So we can then address them with, so if we have 2 to the power of k, then we can address all of the bytes within a page with k bits. So in the terms in like for kilobytes, this, for example, would be 12 bits, right? So for 12 bits, we can basically completely address all bytes within four kilobytes. And the memory is then split up into page frames

Starting point is 00:05:58 and we're just storing like in the virtual address, we're basically storing the page address or the virtual page address and we're basically storing the page address or the virtual page address and then the offset within the page. So the layout in the memory within a single page will be exactly the same as it would be on disk or in memory. So the virtual page is an exact copy of the physical page, but where exactly the physical page will be then lying. This is basically dependent on the OS,

Starting point is 00:06:31 dependent on if it's right now in memory, if it's paged out to disk, et cetera. And because this needs to be fast, right? So now we have these virtual addresses or this virtual space um now we need to basically see find this fast uh where exactly are our pages how do we we're like where is the virtual page how do we translate the virtual page into a physical page because of that there's a special unit on the cpu dealing with this and i mean typically this is as I said 4 kilobytes

Starting point is 00:07:05 on Mac for example 16 kilobytes and you can actually find this out if you're curious so I can also show this let me see if I have still have this open there's probably too small to see so did some bad compilation earlier. So there's a, if you do get conf page size, then this will basically give you your page size and you can see on Mac, this would be 16 kilobytes, for example. Okay. would be 16 kilobytes for example. Okay, so virtual memory, individual pages in my case 16 kilobytes, if you have an Intel laptop 4 kilobytes and these placed in memory or Now, how does the translation work? OK. So we said, like, within a page, the address is the same as, so the offset, the page layout is the same physically as virtually.

Starting point is 00:08:26 So whatever we store within a page will be stored exactly in the same way. So there we can use a certain offset. So for four kilobytes, we need 12 bits. And that will be the least significant bits of the address. So the last 12 bits basically will be the offset that will directly translate to what is laid or how the page is laid out. And now if we have 16 kilobyte pages,

Starting point is 00:09:01 what would be our offset so we need 12 bits to address 4 kilobytes 14 bits exactly so 12 for 4 kilobytes 13 for 8 kilobytes 14 for 16 kilobytes right and if you have huge pages for example so we also said we could have 13 for 8 kilobytes, 14 for 16 kilobytes. And if you have huge pages, for example, so we also said we could have larger tables, larger pages, then we would even need more. So for 2 megabytes, for example, would need 21 bits. So 1 megabyte would be 20 bits two megabytes 21 bits and that then will be the offset so this will be the last part of the address and

Starting point is 00:09:51 this basically we never need to really change because we're just going to use it as is as on the on the page itself however then we have the virtual address part which basically translates or the address of the virtual page, which then translates to the actual physical address, which would basically point to wherever the physical page is lying around right now. And this can be in memory, but could also be on disk, right? So this will be transparent.

Starting point is 00:10:28 And even if you think about it, right? So if you have multiple processes and the processes are basically changing, then we might move some like new process comes in. We have to load some data, some pages, some other pages need to go back on the disk then basically we have to change these physical addresses while we're keeping the same virtual addresses for the process so the virtual the

Starting point is 00:10:56 process won't see any change in the address but under the hood basically there's the OS and the CPU make sure that there's always enough memory for all of the processes currently running. And the pages that are not needed right now can be flipped out and can be removed. So now we have this translation. So we have the virtual address that needs to be translated to the page address.

Starting point is 00:11:24 The offset remains the same as we said. So for this, we just need a page. We're using a page table. And this is actually just a simple data structure that says, OK, on this address or this address, virtual address is translated to this physical address. And this physical address would then say, OK, this address in memory, for example, or this address address is translated to this physical address. And this physical address would then say, OK, this address in memory, for example, or this address on disk.

Starting point is 00:11:49 So this way, we can easily find it. And here, I'm basically showing you a single page table. Or I will show you or explain a single page table. In modern processors, there's multiple levels of page tables. So in the x86 processor right now, there's four levels of page tables, and each of them having 9-bit. So all in all, then, 36-bit of page table

Starting point is 00:12:24 that we can basically use for addressing. So we can have a total of 36 bits for individual pages maximum. And then on top of this, the address. And this basically gives you the maximum number of memory addressable within the process as well. And so this is also why even if you're having 64-bit addresses,

Starting point is 00:12:52 a certain space up in the front, like the most significant bits, are actually not used today. So in x86, so in Intel AMD processors, et cetera, because these page page tables are nine bit and there's four of them the top bits are unused so this part of the address is not used this part will be directly translated basically within like four page tables. And then the offset will be just a direct position within the page itself. So the page table is just, as I said, a simple table.

Starting point is 00:13:39 It has something like a validity flag and the physical address. So we're basically saying this based on a Translation table base register. So this is basically offset where our virtual address Starts or the address space starts. Could potentially be zero. So we're starting from Address zero. And then we're going on top of those. That would be each every next address. So if we want to find something on virtual address 2, then, I mean, virtual address 2 would for sure be in the first page, of course, because we're still in the first page.

Starting point is 00:14:19 But say in the second page something, then we will go. Let me get the pointer. We will basically go to the page table, find the second or third page, know the physical address, and then in there we can find it. And for this, we need this translation. So we'll translate the virtual address, the virtual page address with this page table, and then just add the offset,

Starting point is 00:14:53 and we get the actual physical address. However, the page table can get large quickly. So if you can say you have 16 gigabytes of RAM, four kilobyte pages, each address is four bits, this means your page address or the page table will be 32 megabytes. And of course, this is not something you want to cache. This is something that will end up in your memory

Starting point is 00:15:24 but won't be in your top level caches. You remember there's multiple levels of caches. And 32 megabytes is already too large to keep it in the top most caches. This means, and actually, I mean, you don't want to use all your cache just for the page table. So this means any kind of access would actually go to memory. Any kind of memory access that you want,

Starting point is 00:15:51 or just finding out where to address something, means going to memory, where to find something. And well, every access means basically 200 cycle, something like that. And as we know, it's not just one table. It's four tables. Because we have this hierarchy of page tables, this would mean multiple rounds of memory accesses, which,

Starting point is 00:16:16 of course, would be way too slow. So we need something else. We need to do this faster. And for this, there's something called the translation look aside buffer. And that's a hardware unit, again, to speed up the address translation. So it's an associative cache table.

Starting point is 00:16:38 This means basically we can check multiple positions in this cache at the same time. And this is hardware. So this is really super fast. So the level one translation look-aside buffer will be almost instantaneously. It will be within one cycle, essentially. And the way this works is we have the, we're caching basically the physical addresses

Starting point is 00:17:08 for the virtual addresses to do a direct translation while we're executing our code. So we're trying to find something, we're trying to find a physical address and before actually checking the page table or while we're actually checking the page table, at the same time, we can also check in the Translation Look Aside buffer, or TLB, and see if we have this address cached

Starting point is 00:17:33 there. And if we have it cached there, then instantaneously, we right away get the physical address and can actually go to the memory and find the actual page that we're looking for. If our page is cached, like in or the data that we need is already in cache, of course we don't need to do this.

Starting point is 00:17:54 But this is if we want to access something in memory, then we will have to go through this way because we need the virtual address to be translated into a physical address. And so this is basically just storing the most recently accessed addresses. And because it's an associative cache, the way this works, it doesn't go through the cache one by one. Or the way that it's a hardware unit which compares everything at the same time.

Starting point is 00:18:29 So all of the entries, if it's fully associative, then all of the entries in the cache in the translation look-aside buffer will be checked at the same time and then directly translated. And well, if we don't have it, if we don't have the address cached in the TLB, then we have to go to the page table. And actually, you can also see this if you're running some kind of benchmarks.

Starting point is 00:19:01 If you're running out of the TLB, then all of a sudden, you need these multiple rounds of memory accesses And then the other thing that I would like to point out is that if you're running out of the T-LB, then all of a sudden you need these multiple rounds of memory accesses for anything that you want to address in memory, or that you want to access in memory. And that then will be slow. So if your working set is larger than the T-LB,

Starting point is 00:19:24 and you have to go through the working set a lot of times, for example. But then this means basically you will have TLB misses, which then means even just a page translation will result in memory accesses. So just as an overview, so how large are these TLB caches? So on Intel, for example, there's an instruction translation look-aside buffer for the level one instruction cache, which has 64 entries and is eight-way associative.

Starting point is 00:19:59 I'll explain what this means in a bit. Then there's data translation look-aside buffer with 64 entries, which is four-way associative. And we maybe remember that the level one caches on the Intel CPUs, at least for some generations, are split into an instruction and a data cache. And then on the level 2 cache, we already have one general larger cache for everything.

Starting point is 00:20:33 And there we have, for the translation look aside buffer, we have 1,024 entries, which are eight-way associative. And I tried to find this out also for the Mac. So for the Mac, for example, we see that I didn't find the associativity, what the set sizes are. But you can see that the cache or the translation look aside buffers are actually much bigger here.

Starting point is 00:21:01 So we can have more tables in our translation look-aside buffer, or more entries in our translation look-aside buffer, on the level 1 and level 2. So by a factor of 4, basically, and 3. So the associativity, I will basically, I mean, yeah, I will show this in a bit. So let's look at the caches a bit more. And well, how we use them and why they're actually effective. As we said, this is something that as a programmer, we don't see.

Starting point is 00:21:46 So I cannot, or there's very little that I can do to somehow influence the caching behavior. There is some commands, some instructions that give some hints to the CPU how to basically cache things, et cetera. But in general, if I'm just programming away, I won't really have any influence on what is cached and what is not. The CPU basically tries to be smart about this.

Starting point is 00:22:15 And sometimes it is, sometimes it isn't. And we can, what we can do is we can write cache-friendly code. We can write code in a way that it fits well to how the cache is organized, or we can write it in a way that it doesn't fit well to how the cache is organized. So the idea of caches in general,

Starting point is 00:22:36 why does it make sense in the first place, is that we have a principle of locality. Often we see that most of the data, or that there's a hot set, right? We're not really using all of the data all of the time, but there's like a small amount of data or small amount of variables or something that are used very frequently, just updated all the time. And it makes sense to keep these in a special memory structure which is closer to the CPU because if they're accessed all the time, then this basically, if they are faster, then the whole program

Starting point is 00:23:13 will be faster. So if we have a hot set and that fits into caches, then this time that we spend there will be much less than if we would have to go to memory all the time. And so there's different ways of locality. There's spatial locality and temporal locality. And spatial locality means that, say, for example, if we're talking about a page, right? So we have our pages somewhere in memory. And one page basically is one unit of memory that is closely co-located. So everything's close together.

Starting point is 00:23:52 We know that this will be directly stored in one location in memory. So this is what we call spatial locality. So stuff is close by. But then there's also temporal locality, which means we're using data again and again temporarily close, so soon after using something else. So a good example of an operation, database operation,

Starting point is 00:24:18 that is neither of those is a scan. A scan basically goes through a lot of data. I mean there is some spatial locality because we're consecutively going through the pages so it makes sense to not like we're not jumping in the page but we're going through a lot of data and we're not reusing data so there's no temporal locality within the data. Difference would be we have an area that we update all the time, something like that. Then if we have a small loop, a small program that loops

Starting point is 00:24:52 over this all the time, then we have good temporal locality. And this doesn't only affect the data, but it also affects the code, and that's important to know. If your code is huge, this means your code doesn't have good spatial or temporal locality. So you have very large loops inside your code, then this means, well, there's a lot of instructions

Starting point is 00:25:22 that will need to be executed, and they are not close to each other. So we might have some kind of, or you might have large jumps in your code. And same as kind of function calls. So if you have many functions that are like huge call stacks, et cetera, and things are jumping in the code based back and forth, then the code won't be cached.

Starting point is 00:25:44 So then it basically means in order to load this kind of library that you're calling now and the next one there, this always goes to memory. This will also slow down your program. If you have a small tight loop that's executed all the time, then you have this kind of locality. The code will be cached, the data will be cached, and you're iterating over the same data that will stay in the cache. And so let me show you a small example.

Starting point is 00:26:15 So if you have this kind of loop, so we have a for loop where we're computing a sum of an array. So then we have the spatial locality because we're accessing the area in a stride one pattern. So one item at a time. So one by one, we're going through this code. And then we also have like all of the instructions

Starting point is 00:26:41 are accessed in sequence. So we don't have any larger jumps. Of course, we have one jump for the for loop, but this is also tight. So our code is also spatially local or has a small locality. And then we have a temporal locality. So we have the sum, which is referenced in each iteration.

Starting point is 00:27:04 So for sure, because there's not a lot else going on, for sure this will be cached. So this will be probably in L1 cache. And we can maybe even this stays in the register, if we're lucky. So this can be executed or accessed all the time with a high locality. And we have the same loop all the time. So the loop, the code of the loop

Starting point is 00:27:32 will also be accessed all the time, meaning the code also will be cached in our caches. So there's no need to go somewhere to the memory to find the next instructions for this loop. Okay, so I showed you this example, I think, last time already, or before. So this sum of all entries in a two-dimensional area. And I want to do this with an example today as well. So here, the idea is we have this two-dimensional area. It's basically a fake two-dimensional area.

Starting point is 00:28:12 It's a one-dimensional area, which we basically give, like we're accessing in a two-dimensional way. So we're building our own two-dimensional area with rows and columns. And then we're iterating over it either row wise or column wise so meaning In the alternative one. We're basically for each row. We're going through all of the columns This means if this is our our area

Starting point is 00:28:44 Well doesn't matter. Then we're basically starting with the first row. And then we're going to go through the columns. And if we're laying out the data, and as you can see, the way we're basically laying or accessing the data here, this means we're going in this stride one pattern through this data.

Starting point is 00:29:07 Right? So meaning we're accessing the first element, the second element, and so on, all the way through the whole area. And the alternative would be we're doing column by column and then the rows. So we're basically saying, well, I first take the first column and then all of the rows. But this is not how the array is laid out in memory, right? Because we laid it out in row-wise fashion. So this means we have this pattern where we jump each row.

Starting point is 00:29:39 So we're jumping basically if our row is 10 tuples wide, then we access the first element, jump 10 elements, and access the second element, and so on. This easily gets a lot less efficient than if we access row-wise or in a one-strided fashion. So here, I also think I showed you this before. Here we see the difference. If we go through, I don't remember the shape of the area,

Starting point is 00:30:30 of the two-dimensional area. So from very thin and long to very wide and shallow, let's say. And in the middle, you would have a square area here. And you can see the row-wise axis is more or less stable. And the column-wise axis gets a lot less performant quite quickly. So here you can see it's basically almost a factor

Starting point is 00:30:58 of 100 in difference. And well, we can also try this. So I basically wrote up a simple program. Here you can see this is just a single iteration. So this would be for the square array. So we have number of rows and the number of columns. We're filling this into an array. And we're creating a new array. We're filling the into an area. And we're creating a new area.

Starting point is 00:31:26 We're filling the area with random numbers between 0 and 9, just so my sum doesn't explode. And then we're measuring here the column-wise traversal. So here, I basically switch this. We're starting with the columns and then the rows. And I'm just doing some counting here. Then I do the other one with the row-wise traversal. And we can see the difference of the two.

Starting point is 00:31:53 And I'll show you this. Let me just quickly see. OK. Let me show you this. So here, so it's the same program with the difference that I basically did it with a couple of iterations here. So we could test basically these different kind of shapes of the array, so from very small to very wide, basically.

Starting point is 00:32:30 And I've executed this before, but let's actually do it on the command line. Because you can see, if we do it here in the profiler or in the IDE, what happens is it doesn't optimize. So the code is not optimized by the compiler. So the execution will actually be a lot slower than if we properly optimize the code. So here, I basically compiled the whole thing.

Starting point is 00:33:01 I can do this again. Compile the whole thing. I can do this again. I compile the whole thing with optimization flags. And then if we execute this, it will still take a bit because I'm basically creating, I don't remember, a gigabyte or something in memory. And then go through the different rows and columns. And you can see here we're starting with many columns and a very thin, long area, and then basically changing the shape one by one, making the area wider and wider and shallower and shallower until at a certain point where we have a very wide, very shallow area.

Starting point is 00:33:48 And here you can see the results. And what we can see is that in milliseconds, the traversal of these, I don't remember, it's like 100, 1,000 million, 1 billion or something elements that we're going through. So that would be if it's four bytes. So we're in the tens of gigabytes, or in 10 gigabytes. I would have to check again. So basically, if we go through this row-wise, So we're in the tens of gigabyte or in 10 gigabyte.

Starting point is 00:34:25 I would have to check again. So basically, if we go through this row-wise, so just basically just accessing the data as it is laid out in memory, we're basically in 80 milliseconds here. So we're very fast going through this data. However, if we go in column-wise, we're jumping all the time, right? So we have to fast going through this data. However, if we go in column wise, we're jumping all the time. So we have to basically all the time, every axis, especially for the accesses where the

Starting point is 00:34:53 matrix is square, we're jumping very far in the area from one position to the next. And we'll basically have very, like, everything basically has to go to memory. Yes? So why doesn't the compiler see that we are just doing the sum and accessing it in an optimal fashion? I mean, for the row-wise, I would expect it to merge the two loops so that it's only one over the whole range. Which two loops you mean?

Starting point is 00:35:24 The column and row loop. So in the end, it's just one large list, right? the whole range, but for the- Which two loops you mean? Like- The column and row loop. So in the end, it's just one large list, right? Yes. The array. Yes. I would expect the compiler to merge this column-wise and row-wise iteration for the better approach,

Starting point is 00:35:37 where we just go through the array. Yes. So the- Yeah. But for the column-wise, why doesn't it just see that we are also just doing a sum over the whole? JAN-FELIX SCHWARTZMANN- Yes. So the, yeah. But for the column ones, why doesn't it just see that we are also just doing a sum over the whole array and just? JAN-FELIX SCHWARTZMANN- So the compiler

Starting point is 00:35:51 tries to optimize here, but the compiler can't see everything. So in this access pattern, the compiler doesn't see that we've, so that it could potentially rearrange the way that the area is accessed. And I'm not a compiler specialist. So there are some rules what the compiler can and cannot do. And completely change the memory layouts, for example, it cannot do.

Starting point is 00:36:24 So the memory layout needs to be the same. And then for jumping in the data, it would have to do an analysis if the data stays constant. I mean, here it maybe could find something. And say, for example, if I'm not checking the data in between, so if I'm not somehow outputting results in between here somewhere, so the sum somewhere in between, then it would actually optimize away the whole thing.

Starting point is 00:37:03 So if I'm not measuring anything or doing something here, then if we don't do something with the result, this part of the code will be optimized away. So we don't see any measurements here. However, for some reason, and I have to say I don't know why, this type of access won't be optimized by the compiler. Potentially, in the small example, it could. But of course, you can also think of more complex cases

Starting point is 00:37:40 where the way you actually access this, it's not really clear if these indexes stay constant, if there's not some side effects here. And this is where, basically, optimization quickly stops at all. If some of these would, say, for example, be some variables that you give at runtime, like this is a function

Starting point is 00:38:05 and this is a variable of the function, right? So how these are accessed. Then the compiler cannot optimize anymore here. So now I'm assuming it's the same. It's basically here, this columns and rows, these are outside of the scope, what the compiler sees here in this loop, because they come from outside here somewhere.

Starting point is 00:38:29 So that's my assumption. But it's a good question. I can ask one of my team members who's more or deeper in compilers, and maybe they have an idea where this comes from. OK, so I also wanted to show you what this looks like. So basically, I also plotted this earlier. And you can see here that we have a similar pattern. So this is on my laptop. On a different laptop, you might see

Starting point is 00:39:03 a slightly different pattern. But you can see that this column-wise traversal here really is much, much worse in these cases where we access very far memory locations. It's basically, to some extent, well, we basically break everything here. So we're accessing individual pages because all of a sudden, a single row

Starting point is 00:39:34 won't fit into a page anymore. So every single access will basically mean accessing a complete new page somewhere else in the memory. At a certain point, we'll also break the TLBs, meaning that the translation won't work anymore because we're jumping a lot in the code. And then basically this is what happens, right? So we see all of the caches don't help us anymore.

Starting point is 00:40:01 All of the, we're accessing very little data, but at a large granularity. Because in memory, we will have on this machine, for example, we'll basically access 16 kilobytes at a time, where we just want a single integer. OK. So, basically, what the CPU does for each and every step, right, so each iteration in here, with like each element in the array is basically a pointer to a memory address, right?

Starting point is 00:40:58 So, or a pointer to or has like this physical address somewhere that we then can read and That will then need to be translated etc. So and for every access the CPU checks if the line is already cached and if it's Cached then it will read the diary and the data directly from the cache and it won't need To access the lower level memory. So this will be much faster. And this is good for us, especially if we have these hot loops where we basically go over the data many times. But this is also good for us if we're accessing

Starting point is 00:41:35 at a larger granularity. So we're reading pages from memory. So in this case, 16 kilobytes. We're reading it from memory into our caches. And if we're accessing those consecutively, they will be in cache. So then we will have fast access there. If we have a cache miss, well, then we need to go to the lower level memory and read the full cache line. And then we need to evict.

Starting point is 00:42:05 Probably the cache is not empty unless we just rebooted our system. But even then, the OS will do something so the cache won't be empty. So that means we read the cache line, we evict something else, and replace it with the newly read cache line, we evict something else and replace it with a new with the newly read cache line. And then the CPU cannot do anything, at least in this loop, until the data becomes available.

Starting point is 00:42:34 And now you can imagine that this, depending on our column and row size, this can get like there's multiple levels where we basically get additional stalls or where we basically do not benefit from caching anymore. Even the column-wise traversal will benefit from caching if we read like we read something from a page again before it's evicted from the caches, right? So we have 16 kilobytes, meaning we can actually have a couple of thousands of values in our page. If our row size is not a couple of thousand values,

Starting point is 00:43:17 then basically we will have data in the same page again. So that's already something. If our page is not evicted yet while we're accessing the same row again, we're basically saving some time, even if we're not reading everything consecutively. And at a certain point, like if our rows are too large and basically we have to jump too far,

Starting point is 00:43:42 then all of that doesn't happen anymore. And we have to go through this, then all of that doesn't happen anymore. And we have to go through this, basically, all of the cache levels and read the data. And for this, then the CPU will stall. The CPU cannot do anything. And this is why this is much slower. OK. And well, there's big hit.

Starting point is 00:44:03 You already saw this in the example. There's a big difference between the cost of a cache hit and a cache miss. So this is a factor of 100 on a CPU. So meaning, or even 200, if something depends on, yeah, depends on the architecture, et cetera. But in terms of clock cycles, if we're in the register, we will be in one cycle. Level one can still be accessed in one cycle. Memory, 100, 200 cycles, something like that.

Starting point is 00:44:40 So we can also calculate cache performance by basically calculating the missed rate, or we can calculate how well our code performs in terms of cache by saying, OK, what's the fraction of memory references not found in the cache? So the number of misses divided by the number of accesses, which is then one minus the hit rate. So our hit rate, we want a very good hit rate.

Starting point is 00:45:12 And the number of, or the miss rate is basically the, what you say, the counterpart. Then we can calculate, or we might know how fast our hit time is. So that's the time to deliver a cache line from the cache to the processor. And then there's additional time, or the miss penalty

Starting point is 00:45:37 is the additional time required because of a miss. And basically, on average, and this basically depends on the program, right? So on average, the average time to access memory, considering the hits and misses, would be the hit time plus the miss rate times the miss penalty. So that's how fast our memory would be in a program. And well, this also means that if we have a good hit rate, we have, I mean, technically, the memory always has the same speed, right? But the more we can make use of the cache, the faster this will be appearing.

Starting point is 00:46:23 And well, I mean, the interesting thing is, right, something like a 99% hit rate would be twice as good as a 97% hit rate. So if we say we have a hit time of one cycle and we have 1 plus 1 minus 0.97 times 100. So basically, in 3% of the time, we have to pay this price of 100. So 3. And if we have a direct hit, we have one. So on average, we basically have four cycles per memory access. And if we have a 99% hit rate, then we pay for each hit.

Starting point is 00:47:18 We basically pay one still. But if we don't hit memory, so we don't, we have to pay this miss penalty, we still have to pay one cycle in 1% of the case, 100 cycles in 1% of the cases. So on average, we will pay two cycles for a memory access. And this is basically if your cache performs well. And if our program has very poor cache performance, like if your cache performs well and if

Starting point is 00:47:45 Well, if our program has like very poor poor cache performance like what I showed you earlier Then this could go down a lot So then we might have to pay a hundred cycles per memory access like we did here, right? So you could see the factor of 100 difference between like the nice cache access and the cache-friendly accesses and the cache-unfriendly accesses. OK.

Starting point is 00:48:14 So now let's look into a bit more detail how the cache works. So caching, in order to be fast, the overhead of caching needs to be reasonable. So we don't want to pay a lot for stuff in cache, because then it doesn't make sense anymore if we're close to memory speed again. And we also don't want a too fine granular organization, because that, again, would cost us a lot. So we basically have, we organize the cache in cache lines.

Starting point is 00:48:50 So we don't have individual bits in there, we don't have individual bytes, we have complete cache lines which are cached. This would be 64 bytes for example. And the caches are actually loaded in these complete cache lines. So rather than, so we're in the cache, while in memory, we're talking about pages, right? So complete page. In the cache or in the top level caches,

Starting point is 00:49:17 we won't have a full page. There's 46 kilobytes, but only like a subpart of the page, which would be in there, then, 64 bytes. And in Apple, this again is a bit larger, so 128 bytes, for example. And here, we basically still have, like, this will still be a sub-part of the page, right? This is a consecutive memory region that we will load into the memory. And then we have, like, our cache is organized in these lines. And we can load many of these lines,

Starting point is 00:49:50 or we can have multiple of these lines in memory. So for example, here's an example of an AMD cache. Here we would have per core level one instruction and a level one data cache. And then we have a level two cache, level three cache, which is shared across all of the cores, and then the connection to main memory. And for the level one cache, we have 64 kilobytes and 64 byte cache lines.

Starting point is 00:50:32 And this is consistent with what you would see in an Intel CPU as well. So, the caches might be a bit larger by now, but you will have the same kind of cache lines. Then we have the shared cache with one megabyte with 64-byte cache lines. And the latencies will be in level one. You have two cycles. Level two, you have seven cycles. And if we go to the main memory, so we have a miss there,

Starting point is 00:51:04 then we would have to go like 160 to 180 cycles here, for example. And so basic approximate numbers that you can basically remember maybe to some degree. I mean, it's, again, something that you just have in mind. So if we're in level one or if we're in the register, this is immediate. So the data in register is what we're working on.

Starting point is 00:51:32 Then we have level 1, 2, 2, 4, 2, 3, 4, something like that. Level 2, 7, 2, 14. The translation look-aside buffer 1 could be 12. Translation look-aside buffer on level 2, 30. Might even be a bit faster in main memory than, again, an order of magnitude slower in the 200s, something like that. And with that, let's take a quick break, five minutes,

Starting point is 00:52:04 and then we're going to talk about cache associativity.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.