Hardware-Conscious Data Processing (ST 2024) - tele-TASK - CPU and Caching

Starting point is 00:00:00 Welcome to our next session on Hardware Conscious Data Processing. We started, or Martin started, last week with the CPU architecture and some caching, well, not so much caching. And we'll go there in a minute. But before we start, two announcements. You've already seen this, so we have some code examples. I'm going to try to polish a couple more, so there's not that much in there yet. Whenever I have time, I'm going to clean something up and put more, so you can do more of these

Starting point is 00:00:37 little tests. Although, this is really not hard, right? So you can also try this yourself. If you're curious about the performance of your laptop do this kind of stuff that's actually really helpful to understand things and your your system performance and the other thing tonight uh in our lecture series on hbi research we will have that anrich uh talking about connected help so if you ever wondered uh what connected health is and what the digital health guys are doing, this is a good chance to get an insight into the research there.

Starting point is 00:01:12 If you're in the course, in the lecture series, well, it's anyway a good thing to go there and listen in. So where we are is still at the CPU architecture and caching. This will take us today, maybe even roll over a bit until tomorrow before we come to the instruction execution. So this is also reflected here, right? So we're in the CPU basics. We'll talk about CPU instructions tomorrow and then start with the SIMD stuff next week. Okay, with that, let me jump all the way to virtual memory. Because that's where you left off last time, as far as I know.

Starting point is 00:02:00 So, who knows virtual memory? Some degree, okay. So virtual memory is a construct, right? So if you're programming something, so your program basically you don't have to worry about how memory is laid out internally, right? So you basically basically your program gets a continuous address space that you can work with. And in order to have this kind of idea of a continuous address space that you don't have to deal with other processes, etc., there's an abstraction called virtual memory in your system. And this is basically dealt with by the CPU and the operating system. So you don't really have to do anything about it.

Starting point is 00:02:54 And the idea is that you can have, like, your program can access more, or all of the programs in combination can access more memory than is actually available on the system. Also you can, like even your own process can access more memory than is available on the system and we're just going to, like whatever doesn't fit in memory, we're just going to place on disk, for example. Also depending on how you allocate memory, etc., this might be somewhere in the memory and you don't have necessarily a contiguous area.

Starting point is 00:03:28 So this is also abstracted away. So as a programmer, you don't have to do, and as a process also, you don't have to do the bookkeeping for the memory regions. And the OS and the CPU are more flexible in basically figuring out where to put the data, actually, where to put these memory regions. So each process has its own address space. And this is, of course, too expensive

Starting point is 00:03:57 to do this per each individual byte, so we'll do this in blocks. So we have different memory blocks that then are mapped into a virtual or like every process gets its virtual memory this is split up into different blocks which will be placed in ram and on disk depending on what's currently needed and this can be dynamically changed so it doesn't mean that you're you have one large large virtual memory and some part is fixed to disk, some part is fixed to memory, but the OS then has the liberty to basically move stuff around.

Starting point is 00:04:34 Also the CPU has the liberty to move stuff around while you're in ignorant bliss and just basically working away on your virtual memory. So that's the idea, right? And just gives you basically an easy way to deal with memory. You don't have to deal with disk necessarily, unless you explicitly want to deal with disk, right? Unless you explicitly want to store things on disk, this will basically keep just the memory part abstracted away from you. Okay. As we said, we won't do this on a individual byte. I mean, we could do this on a virtual individual byte level, but then basically translating these addresses would be quite costly, right? Because you need multiple bytes in order to address something. So then translating these addresses will add up quite quickly. So typically we will have pages in memory and on disk

Starting point is 00:05:50 or virtual memory pages that are all equal sized and then we'll have like one address in memory or on one disk and then we have kind of an offset within there. And in typical systems, the pages will be, or your typical page size will be four kilobytes, but it's also possible to work with huge pages, meaning that you will have larger contiguous chunks in memory and on disk, which then means you don't need, like you don't have that much fragmentation within the system. And the memory then is split into these page frames, and within a page, so the page itself in virtual memory and on disk, or in physical memory, let's forget about the disk,

Starting point is 00:06:43 so let's say we have everything in memory, actually. An individual page will be laid out in exactly the same way. So we have, say, four kilobyte pages. Then our virtual page is four kilobytes and has a certain order of bytes in there, like all of the data in the virtual page. And that will be an exact copy, basically, of the physical page. But the virtual pages, how they are aligned in memory.

Starting point is 00:07:14 So we have like a virtual view on the pages. The virtual pages will be contiguous. But in the physical memory, they might be just all over the place, right? Wherever there's currently space, this will be allocated. And then there needs to be a mapping. And there's a memory management unit that translates these virtual addresses to physical addresses, right? And then you can already see, right?, we have individual byte addresses. So some part of the address will be basically within the page. Some part of the address will be an address to the page.

Starting point is 00:07:57 And this address to the page is the part that we need to translate. So within our virtual address space, let me try this. Where do I have it? Here. So within the virtual address space, let me try this. Where do I have it here? So within the virtual address space, so this basically we have this within the page and we have the address to the page. So we have an offset and a page address. And so we have consecutive page addresses and we have the offset. And on current systems, the address is typically a 64-bit, whereas the first couple of bits are unused.

Starting point is 00:08:40 Then, or let's start from the back, basically, we have an offset, which is the first 12 bits. Then from bit 12, so the 13th bit, to 47. So the next 36 bits are basically the page address. And yeah, so that's basically the address that will point to or in virtual memory will be basically one of the consecutive addresses and this will be translated to the actual addresses that points or that is somewhere in memory, which also can change, right? So this is again important. If we have more memory or we need more memory than we actually have in the system, this

Starting point is 00:09:23 page will be evicted, will be somewhere on disk, paged out, then we need to translate it to something that points to a disk page. And this can also change, right? So this is not this fraction on how much is basically offset and how much is the physical address. This needs to change based on the size of the page. So meaning if we have four kilobytes,

Starting point is 00:09:55 then we can basically address this with these 12 bits. So then we can address each individual byte with 12 bits. If we have something like a huge page say two megabytes for example then we need 21 bits for the offset and this will be cut off from from the page address so with that the page address will basically be smaller and the the offset will be larger the total part of what is used for addressing will still remain the same, right? So we'll still only use 48 bits in these systems,

Starting point is 00:10:33 but just the relation or the ratio will change. And that means we will have less pages, but larger pages. So all in all, we cannot address more memory. We can just address memory in a different way. So meaning we don't need as much page translation if we have huge pages, while if we have smaller pages, we're more flexible in how we lay out the pages in memory. But we need more page translations.

Starting point is 00:11:04 And these physical addresses of the virtual pages are stored in page tables. And today, actually, it's not a single page table. So on a modern x86 CPU, this will be four levels of page tables, each of them having nine bits. And so this is basically how to organize this in a bit more easy fashion rather than having like one gigantic page table. And a page table looks something like this. So we have a validity flag and a physical address. So we have our virtual address that basically points to or gives us an offset

Starting point is 00:11:48 in the page table, where then we will find the physical address and of course the validity flex, or is this actually valid or is this unused, for example. And these page tables can get large. So if we say have 16 gigabytes of RAM with four kilobyte page tables and the 64 bit address, this basically gives us a 32 megabyte page table. All right, so you can basically have, like can calculate this thing. So we have eight byte addresses and for the four kilobyte pages. So then these get large. And this means also if

Starting point is 00:12:31 they are so large, this means they're not going to be in cache. So in the first level cache, we don't have 32 megabytes of first level cache. So this always means we're going to have, if we want to do the Atlas translation, this will actually be costly because that means, I mean, we might have it in a lower level cache, but if it's like all the page tables are very large, this might actually go all the way down to some kind of memory lookup. And that's, of course, not good, right? So I mean, we don't necessarily want to have two lookups just to find an individual page somewhere in memory or to read something from

Starting point is 00:13:09 memory because that would mean we go through all the caches just to find an address and then go back, go again through the whole hierarchy just to find the actual, the actual byte or whatever we're looking for, the actual memory page. So this means we're basically losing a lot of time if we just have the page table within memory. Or we are in memory. So for that, there's actually an accelerator in hardware. And that's called the translation look-aside buffer. You might come across this before this TLB. That actually exists in multiple forms in multiple parts of the hardware

Starting point is 00:13:54 just because we need this address translation everywhere. If we have something on the GPU, for example, some memory, or we have something in main memory, et cetera. Whenever we need to do address translations, we need these page tables. And whenever we need these page tables, they can get large, especially for large memory. So it's going to be costly.

Starting point is 00:14:15 So we want to buffer this somehow. And the translation look-aside buffer implements this buffering in hardware. So this means this is like a small buffer or small associative cache, which is placed on the CPU in hardware in multiple cache levels. So every cache will come with a translation look-aside buffer where we will cache a certain amount of of these address mappings so this looks something like this here so we have the cache on the side where we have is it valid is it a reference is it dirty And then we have a tag. I'll explain this in a bit. This is basically the unique

Starting point is 00:15:08 Identifier for each address. And we have the actual page Address that we're looking for. This translated part. So we don't need to translate the offset. We only need to translate the page address. So this is something that we'll do. And whenever we can see this, so this is super fast, right? This is built in hardware SRAM. So this means within the individual, or within a single cache lookup,

Starting point is 00:15:40 we can do basically while we're doing the cache lookup, we can also basically while we're doing the cache lookup, we can also translate this. And if we have the data within this, then there's no additional cost. If we don't have the address translation in here, then we need to actually go to the page table. The page table might be in cache, might be in a lower level cache,

Starting point is 00:16:02 might be even in memory, won't be on disk, right? So this won't happen, but it might go all the way down to memory. So this basically means if it's not here, then we'll have to pay additional costs to do just the address translation. As I said, it's an associative cache table. I'll go into more details what that means, but it basically means we can place individual cache lines into multiple lines in here. So we're not, so there's the concept of a directly mapped cache where each individual

Starting point is 00:16:39 tag or address would fit exactly in one slot of our cache. So this is basically if we're doing modulo, for example, then we exactly know where it needs to fit. And if we have an associative cache, then we can place it anywhere and the cache will just look it up everywhere at the same time. So we'll compare with all of the different entries at the same time and give back the, like whatever fits.

Starting point is 00:17:05 And so that stores the physical address entries at the same time and give back the like whatever fits. And so that stores the physical address for recently accessed virtual addresses. And the virtual addresses are compared, if it's fully associative, will be compared to all TLB entries at once. And these will then be directly translated. So this happens in hardware. So this is basically within one operation,

Starting point is 00:17:27 and if it's not translated, then we have a miss, then we have to go to the page table. We do need to do the calculation with the page table. So as an example, for an older Intel i7 CPU, we have a trend instruction. So you remember there's two kinds of caches. We have an instruction cache and a data cache on the first level cache typically. And for the instruction cache, we have 64 entries with an 8-way associativity. Again, I'll explain this in a bit more detail, meaning that we can put 8 different cache

Starting point is 00:18:16 lines with the same tag. So we have grouping sets in there. And we have a data level cache with 64 entries, which is four-way associative. And then on the second level cache, so L2 cache, which is shared across instructions and data, we have 1,024 entries, which is eight-way associative. And say, for example, for the MAC, so for this here,

Starting point is 00:18:47 the data level cache, so this is all I could find out when I checked the data label TLB on level one cache is 256 entries. And on the level two cache, have 3072 entries for example. So basically if we have one of these addresses cached then we're going to be fast. If we run out of this, so we don't have one of these entries in here, so we're looking for something that's not cached, then we'll have to go to the next level of the cache or we might have to go to the next level of the cache, or we might have to go to memory. And this is typically something

Starting point is 00:19:26 where we already see some performance degradation. So typically, for each cache, and also for memory, we won't have enough entries in order to fully store all of these translations. So meaning basically for many accesses, so if my data is larger than the cache, for sure at a certain point,

Starting point is 00:19:58 I will also have run into the trouble of having to go to the next level just for the page translation. So I won't get a direct translation of the address, of having to go to the next level just for the page translation. So I won't get a direct translation of the address. If I have to go to a large part of the memory, I first will have to go to the page table and then will have to go to the actual entry.

Starting point is 00:20:23 And this also, of course, gets less of a problem if we have huge pages. So the larger our pages, the less entries we need in these caches. And with this, basically, we get less translation look aside buffer misses. So that's one reason why we would want this, so why we would want the larger pages. On the other hand,

Starting point is 00:20:46 of course, we have less flexibility in how we're accessing the memory. Okay. So, with the concept of virtual memory, let's look at the caches actually, how these work in a bit more detail. So we have the caches work with the principle of locality. So the idea is that we often reuse data many times. So if we have like a small computation, we're counting students in this classroom, then we have one variable that we will update many times. So that's basically locality.

Starting point is 00:21:32 So we're using this over and over again. So keeping this variable in memory makes sense because we're gonna reuse it again. And this is where caches really help. If we have large amounts of memory and we're doing to reuse it again and this is where where caches really help if we have large amounts of memory and we're doing just pointer chasing across the large amount of memory then our caches won't help much because we're basically jumping around in the memory regions somewhere so um yeah so this is basically a way how or if our our looks like this, then we're not gaining anything from caches.

Starting point is 00:22:09 However, there's a certain a second part to that. So it's not only about the data, it's also about the code. If we have like a small amount of code that we're reusing all the time. So we're running in some kind of small hot loop. Then that means also this code will stay in the cache. And this is why we have this instruction cache, right? So often we don't really think about this. If our code is very large and we're jumping in the code a lot, so we're using many different libraries and every individual statement in our code is a new library call,

Starting point is 00:22:47 that basically means we will jump in the memory for the different kinds of code and we will have to go to memory for the code in order to load the next part of our library. But typically that's not the case. Most of the time we're actually in some small part of the code, which we will do over and over again, or there's at least some kind of locality. So with this also the cacher will help again. And there's two ways we can think about locality. So there's spatial locality, meaning we're accessing data that's spatially closed. So within a page, for example, or consecutive pages that are near to each other, or code that's basically a small loop that we're reusing, and there's temporal locality, so we reuse certain variables, right? So I'm updating the count of students every now and then, or not every now and then. I'm updating this frequently. And I'm calling

Starting point is 00:23:48 the same function over and over again. This means it makes sense to cache this. And this is a small example where we're calculating exactly such a sum. So we have a sum variable and then we have a for loop where we go over all students and count them up. And hopefully our area of students is basically the way we go through this area of students. Our A area, it's laid out contiguous in memory or continuous in memory. So that basically means we have the spatial locality.

Starting point is 00:24:23 We're going through the area one by one. So every next axis will be close to the previous axis. And all of the instructions in our code are accessed in sequence. So there is no jumps in the code. We're not jumping across large code bases back and forth or calling some other kind of variables or functions in here. And we have this temporal locality. So the temporal locality here is the sum variable

Starting point is 00:24:52 is called over and over again. So this is, I mean, there's no spatial locality with the same variable, but it's called each and every iteration, and the loop is called in every iteration, right? So we also have spatial locality in the code. And that's important, right? So if we get this, then our caches are super efficient. So basically this small code example, this kind of stuff,

Starting point is 00:25:26 this will all be in first level cache. And the variables, so something like the sum, this will even reside in a register if this is executed efficiently. This won't need to go out of the register ever because not much else is going on. So this kind of code can be super fast. If we convolute this with jumping through large amounts of data because our data access is stupid, if we have many kind of libraries around this which are called all the time, then all of a sudden this will

Starting point is 00:25:58 be much slower, although the same kind of computation is done. So it's really, are we using the cache well or not? So let me give you an example. I think we've seen that before. So this is kind of this micro-benchmark example, but we'll do this also now. So we have a two-dimensional area, so I'm simulating a two-dimensional area with a one-dimensional area, so I don simulating a two-dimensional area with a one-dimensional

Starting point is 00:26:25 area. So I don't have to deal with the actual memory layout that C gives me or C++ gives me. But I'm saying I have a row major order layout. So saying like the area is two-dimensional, we have rows and columns, and we have first the rows rows or the rows are stored continuous and the columns are basically jumps across the array. We could also in other if we actually use a two-dimensional array then you basically have to check what does the programming language do. There are programming languages or compilers that translate into column major, meaning that columns will be laid out basically continuous in memory. So, and we do the same thing. So we have row major order. We're basically having the rows and then the columns, and we're accessing in a row based fashion or row major

Starting point is 00:27:20 order in this loop. So for every row, we go through all the columns. So we're taking row one, and then we're jumping all the columns. Then we take row two, we jump through all the columns, et cetera. So that's one way of accessing this. Or we do it the other way around, which is, in this case, where we're accessing for each column, we're accessing all rows. So we're accessing the first column of the first row, then the first column of the second row, first column.

Starting point is 00:27:53 So we're basically jumping across the area. So it's not a stride one pattern, but it's a other kind of stride. Depends on the number of columns in each row. And if we do this, you will see that basically the execution time for this is very different in these two loops, although we do the exact same thing. And this is only due to caching. So this is only due to how we load the data from memory and

Starting point is 00:28:28 how then we can reuse whatever is in the caches. I mean only due to caching is a bit much. There's a couple of things in here. So the prefetcher will help us, etc. So if we're smart, if we're using this, there's a lot of hardware and concepts that help us. If we don't, then we're basically fighting against what the hardware actually wants to do. And so maybe let's try this briefly. So I'll just show you what this looks like. So I have this here. Let me see. As you can see, this is basically exactly the same program. So we have an array with, I can't fully read this, 100 million entries. And we have 10,000 rows and 10,000 columns.

Starting point is 00:29:20 That's basically if we have a squared array. And at least that's my starting point. And then we go from that to more extreme setups where we basically say, okay, we have 10 rows and 1 million or 10 million columns or and vice versa. So going across these different layouts right and then we fill this with random numbers and then basically do an initial run in order to make sure the caches are fine right so we have kind of the same setup for both runs and then here i have some extra code so i'm going multiple doing doing multiple iterations where I change the ratio of columns to rows. So in the initial iteration, I basically say, we have basically 10 rows.

Starting point is 00:30:21 And so this is basically power of 10. We're starting with zero. so we're starting with 1. We have 10 rows, and then 10 million columns, and then in the next one will be 100 rows, and a million columns, and so on and so forth. And then we just do the same two sums. One time we go column-wise, one time we start with row-wise, and we're doing the kind of measurements, we're changing the sum to zero again, and in the end I'm

Starting point is 00:30:54 outputting all of the individual timings. And so one thing, I mean, the compiler will try to optimize here and there. And this is also why we need to basically have some outputs, why we actually need to do some actual additions. Because if we don't do this or we're not using the additions in the compiler, we say, well, this is dead code. I'm going to clean this out. And our program will be super fast.

Starting point is 00:31:24 Okay. And this, maybe let me show you this. So I'm also going to do it on the command line because there it actually makes some difference. Let's clear this first. Nope, can't read this here. Okay, so... So now you should be able to see this somehow. So this is the program. I've compiled it with optimization flags

Starting point is 00:32:08 so the compiler can actually do something. And first we need to initialize the area. I mean, it takes a few seconds, basically. So we're starting with the one where we have 10 rows and 10 million columns, and then 100 rows, 1,000 rows, 10,000 rows. And now we're getting into the cases where we're basically really jumping a lot into memory.

Starting point is 00:32:35 So with the column-wise order or the column-oriented traversal, if we have 10, if our rows have length of, what is it, 10 million or something, then we're jumping really far, which basically means we actually have to go all the way through the cache, through the memory in order to find something.

Starting point is 00:33:00 So any kind of caching will never help us because we're always jumping way too far. So here you can see in the row-wise traversal, we have a fairly stable throughput. If the rows are very long, all of a sudden, it somehow gets a bit slower. But in the column-wise traversal, if the rows are very long or we have this kind of very wide jumps in the memory, then let me see this.

Starting point is 00:33:31 So then here, the timing is actually much worse. This is basically two orders of magnitude at least slower here. And this is mainly due to caching, or not caching in this case, because we always have to go to memory with each and every individual access. While in the row-wise traversal, we'll still have to go to memory. The data is too much.

Starting point is 00:33:58 This is a couple, I don't remember, a couple of gigabytes. It doesn't fit into the caches. So it is in memory. Basically, we have to pull it up into the caches because, I mean, we can only work with data that's actually in the caches eventually or actually in the end ends up in the registers. But this is happening for every individual access. So this is why this is slow. So this is the code. You can basically check it out.

Starting point is 00:34:32 You can try it out yourself. So what happens actually here? So on each individual memory access, the CPU will check if the respective cache line is already cached. So again, this is hardware. So this happens without you doing anything or even without the OS doing anything. So the CPU will just check, do I have this in cache? If I have it in cache, nice.

Starting point is 00:35:02 Then basically we'll read the data from cache. We don't need to go to any lower level memory or any lower level cache. We're just gonna use this. It will be much faster. We can continue with our computation right away. If we have a cache miss, well, tough luck, right? So we'll have to go to the next lower level in memory.

Starting point is 00:35:25 Maybe there's also a cache miss, so we might have to go all the way down to memory. Then we have to evict some kind of cached block and replace it with the newly read cache line. And the CPU will stall until the data becomes available. If there's anything else to do, the CPU can do some unordering with stuff. It will try to do something else in the meantime.

Starting point is 00:35:50 But if we're only computing this kind of sum, well, then the CPU just has to wait until the memory becomes available. And you see that then this takes a lot more time. And these caches typically are inclusive. This is also something important or not, right? So basically, whatever is in level 1 cache is also in level 2 cache.

Starting point is 00:36:12 But level 2 cache will just have a bit more. So we can find this. And so then if we don't have it in L1 cache, we might find it in L2 cache. So that's basically the hierarchy that we have to go through anyway. Okay, and we've seen this in the example, but we can also calculate this to some degree. It's also logical, right logical that there's a big difference between the cost of a cache hit and a cache miss.

Starting point is 00:36:51 And we've seen there's multiple levels of how we can be good about order cache. We can hit cache or not. So the first thing would even be just the address translation. So if something is in our cache, well, then we don't really need to do the address translation. We can work with this directly. If we need to go through to the memory, so we don't have this, if we have to go to memory and we don't even have the address translation there, then first, before actually going to memory, we will have to load the page table.

Starting point is 00:37:26 So that means basically an additional cache miss. So two cache misses, maybe even through the whole hierarchy for just a single access. And the difference between having something in the first level cache or having something in the register, this is in a factor of 100. So between having your data in the register and going to memory, this for sure is a factor of 100 or even more, like a single memory access. And the miss rate, you can basically

Starting point is 00:37:59 calculate the number of misses divided by the number of accesses. So that's also the 1 minus the hit rate. And then we can calculate, so what's the hit time? So how much time does it take to deliver a cache line from the cache? And there's a certain miss penalty, which is basically how much does it cost us if we actually have a miss.

Starting point is 00:38:28 So then with this, with these two things, we can have an average time to access memory, considering our workload, depending on the hit rate and the hit time or the miss rate and the miss penalty plus the hit time. So if we have a 99% hit rate in our cache, that's actually twice as good as for example, a 97% hit rate. So if we have a hit time of one cycle, so if our data is in the first level cache, it will take us one cycle to read this into the register.

Starting point is 00:39:02 So there's single cycle. If we go all the way down to memory, it's 100 cycles, probably even a bit more. If we have 79% hit rate, then we have one hit, right, plus 1 minus 0.97 times 100, so 97%, or 1 minus 97%, basically. So miss, or 97% miss rate, basically. And this means we have 1 plus 3, or 4 cycles. And for the 99%, on average, for each individual axis. So because for every three hits, we basically have 97 misses.

Starting point is 00:39:53 Other way around. For every 97 hits, we have three misses, basically. So that means in total, like four cycles on average for a memory axis. If you have 99% hit rate, because of this factor 100, this means on average, our memory access will have two cycles. So this is just something to keep in mind. So even if we have very effective caches,

Starting point is 00:40:22 just because the difference is so large, this actually costs us a lot, right? And this is still, let's say, positive numbers, 100 clock cycles for an individual memory access. Okay, so in order, so yeah, let's talk about the cache internals. How does this work internally? So the overhead of caching needs to be reasonable. Also, we cannot use insane amounts of chip space for the caches. This is also why you saw these caches are not really huge. So in order to make this efficient, we are organizing the caches in cache lines.

Starting point is 00:41:06 And this is typically on most CPUs today 64 bytes. On these ARM chips in M1, for example, it's 128 bytes. And so there we're assuming that if we're accessing some kind of byte somewhere, most likely we'll access more bytes around this. And so this is why it actually makes sense to load more than just a single byte into memory. Again, it will be costly to do the addressing, etc., all the data movement. So we're loading more than just a very tiny fraction into the cache. And so there's no way to just access or load into the caches a single byte.

Starting point is 00:41:52 We'll always go with 64 bytes or on this kind of machine with 128 bytes. If we're only using a single byte or even less, then the rest of the data transfer will be useless. And practically, we're not going to be able to move something faster by just moving less. But the other way around, we might be more efficient by using whatever we're moving

Starting point is 00:42:21 through the cache lines. So if we can utilize the whole cache line, so if everything's, that's basically cache efficiency, then if this is spatially local, basically the way we're accessing this data, our caches will be more efficient, our program will be more efficient. And the caches are organized in hierarchies,

Starting point is 00:42:44 so I think we've seen that to some degree. So this is an older AMD processor, but it looks similar in other processors. So we have the individual cores. Then we have the level one caches, which are split into instruction and data caches. Then level two caches, level three caches, main memory. Some CPUs, like the one I have on the first slide, on the title slide, might even have fork levels of caches. Some only have two levels of caches.

Starting point is 00:43:14 So this is basically architectural choices. Depending on the difference of access speeds across different tiers, it might make sense to add additional cache layers or to leave some out. So for the L1 cache, we have the separate caches with each having 64 kilobytes. L2 cache is shared with one megabyte. And then the L1 hit latency might be two cycles, L2 with latency

Starting point is 00:43:48 in this case seven cycles and L2 miss latency is like 160, 180 cycles and these are old numbers but they are not so far like this like these ratios don't change that much. So approximate numbers regarding these cache latencies for registers, we do this within the cycle, basically. In level 1, 2, 2, 4 cycles depending. Level 2, 7, 2, 14, something like this, if we have the translation look-aside buffer, basically latencies that should be done within the access of the cache.

Starting point is 00:44:35 But if we have to go further down, then it basically gets more expensive. So with this, an overview of caches up to here. I'm going to talk about cache associativity in a bit, but do we have questions so far? Yes? Yes, perhaps I misheard that, but I think you said that the L1, 2, 3 caches are redundant, so that everything that's in the L1 cache is also in the L2. Yes.

Starting point is 00:45:09 Why is that redundancy? It's basically, I mean, it doesn't really make a huge difference if you add more, and we will basically always have to go through the layers anyway. Right? we will basically always have to go through the layers anyway. So if you would split this up, the management would get even much more expensive. So we're just propagating data up the cache level. So everything that you load will be in every cache level.

Starting point is 00:45:40 Only if it's in the utmost cache level evicted, then it's basically missing there. It's still on the next level. So that's basically the way it works. So we're not loading things first in the second, like in L2 cache, and then assume maybe it will be used or not. But we're loading it in all of the caches,

Starting point is 00:46:01 and it will be evicted from the higher level caches earlier because they are smaller. Okay, thanks. You're welcome. Yes? When the CPU tries to access some data based on the virtual address, I guess, does it have some kind of table that dictates where in the cache this data might be or might not be?

Starting point is 00:46:11 And to decide whether it has to go to main memory to catch the data? dictates where in the cache this data might be or might not be? And we decide whether it has to go to main memory to catch the data? So the question is, how does the CPU know where to look up if things are in the cache or not in the cache or something? That's basically the question, right? It will just look it up, right?

Starting point is 00:46:42 So the CPU doesn't know what's in the cache and what's not in the cache the CPU will check if it's in the cache and if it's not then it will go to the next level so and regarding the the translation look aside buffer it will actually look like at the same time will check do I have this in the in the translation look aside buffer or do I have this in the translation look-aside buffer, or do I need this read this from the page tables? It happens simultaneous. If I have it in the translation look-aside buffer, don't need to do anything anymore.

Starting point is 00:47:14 I'm just going to work with this. So it goes and memory accesses will always go through the cache hierarchy first. And if we don't find it, then we actually need to do a memory read or memory load. So let's continue. And I'll quickly, well, not so quickly, tell you about cache associativity. That's more

Starting point is 00:47:45 because it's interesting also to know how these caches work and maybe explain some of the behaviors of the caches. So there's associativity in a cache basically explains where an individual cache line can be placed into the cache. So a fully associative cache means that we can place a cache line into anywhere in the cache. So we have a certain number of cache lines that fit into the cache. And the question is, where can I put it? In a fully associative cache, I can put it anywhere. And directly mapped cache means it's not associative it means that we we

Starting point is 00:48:27 have only one location where each individual cache line can be placed and so you can think of this basically as say we're moduling uh the address and then we know exactly where this needs to go right so if we have five cache lines and our address is six, then this will go into position one in the cache. And this means whatever was there before will need to be replaced. And then we have a set associative cache and that kind of tries to combine the two. Fully associative cache means, well, we need a lot of management in order to check everything at the same time. A directly mapped cache is much easier.

Starting point is 00:49:11 We know exactly where to look for a certain cache line and where to place a certain cache line. And in order to get kind of part of both worlds or something in between, we have a set associative cache, which is n-way associative which means a block can be placed anywhere within a set so a set of n items basically so a four-way associative cache means we have four different uh or four places where each individual cache line can go. And so we're not basically fighting with the exact or for the exact same place in the cache

Starting point is 00:49:53 with all of the addresses which have the same module, for example, but we can replace this in some kind of round robin fashion or something like this. And first we have to, or next the question is basically how do we figure out anyway where would we put this in the cache, right? So if it's not fully associative, we have to somehow identify this. And for this, we're splitting up the address again. So we have the offset. The offset means basically wherever on the page we find our address or our data. And then we have the block address. As we already know, this is basically

Starting point is 00:50:37 or the page address. This means where do we have our individual page. And now this block address, or let's say for the cache line size, basically the cache line address will again be split up into a tag and to a set index. So the set index means basically which of the sets are we going to. If we have a directly mapped cache, this means the whole block address is basically the set index. If we have this kind of N-way associative cache, then we need to identify which sets this goes to.

Starting point is 00:51:20 So this is the set index. And then we still want to identify which cache line do we actually mean, and that's then the tag. And of course we could also do this the other way around, right? We could have the set index first and then the tag later. This way, if we're doing it this way, then continuous addresses will basically be placed in different parts of the cache, in different sets. And this means if we're reading local data or we have some data locality in our code,

Starting point is 00:51:57 they won't kick each other out of the cache all the time. If our set index would be in the beginning here, right? So, if the set index would be here, this means if we're reading a number of continuous addresses in memory, then quickly they would basically, or the next address would have the same set index again. And even if we have a larger set index or a larger set, they will basically kick each other out and we only use a smaller portion of the cache. So that's why we have the set index further. So hopefully, if we continue working with a continuous block of data,

Starting point is 00:52:44 then we stretch this out across a larger share of the cache. Of course, it might also be stupid, right? So if we're unlucky, then we have some kind of access pattern where we always have the same kind of set index, and we will again only be in a single set of the cache and not reuse the whole cache. But that's much more unlikely than the other way around. Okay, and for this, so basically in order to address this, we have the set index,

Starting point is 00:53:16 we have the tag which identifies the cache line, or identifies the cache line, and then we have the data. So within each set with the tag, we know exactly, is this basically our cache line or not? And if it's the cache line, then we have the actual cache line in our cache. Okay, if we have a fully associative cache, then we need the whole cache line address basically to find the correct cache line and we can place the the cache line in any portion or in any cache line of the cache in any any part of the cache so this basically means we

Starting point is 00:54:02 have a freely or we can do use any kind of block replacement strategy, but it doesn't really scale, right? So the idea is if we do this lookup, it's in hardware, and we'll have to look it up, like we want to do this simultaneously for all entries. We don't want to basically have like an additional operation where we have to iterate somehow through the cache and figure out is this somewhere or not. We want to directly know if we ask the cache is this cache line in here or not this should be one single operation and this can

Starting point is 00:54:38 be done in hardware but if we do this across thousands of records then it will be expensive because we need a lot of security to build this. So if we think about four megabyte cache, for example, so this might be a level two cache kind of size, then 64 byte cache lines, then we have 65,000 cache lines. So comparing all of those in a single instance means we need to do a lot of comparisons very quickly. But for small caches, so very small caches, this can still be done.

Starting point is 00:55:15 So for 100 entries or something like this, or 64 entries, as we saw earlier, something like this, we can build this in hardware. So this can be directly translated. Then we have the idea of a non-associative cache. So this basically means, as I said, we have something like a modular operation. And for each cache line, based on the block address, we know exactly where it needs to go. So our cache has a certain size.

Starting point is 00:55:44 We're doing modulo, and we put the data exactly there. If there's another cache line that has the same modulo address, that basically means it will replace exactly this cache line. Again, might work well, right? So because continuous addresses or consecutive addresses will be basically placed in new cache lines.

Starting point is 00:56:08 So basically if I'm caching line 12 here, that will go into the cache line or in the cache four. The line 13 will go into five, right? So this works for locality, but if I have a stupid pattern where I'm basically hopping somewhere, this might again be not good because I'm replacing whatever I cached earlier before

Starting point is 00:56:36 all the time, right? And then the way this works, the lookup table basically works, we have this cache line, we have the tag, which is basically the part that's not of the offset, that's basically unique for each cache line within our cache. And with that, we have the data we're going to compare. If we have a success, we'll have the success and directly have the data in our CPU. If we don't, then we basically have to request the data again. So then, or basically, it's a single operation, right?

Starting point is 00:57:24 So we check, is the tag correct? If it's a success, yes. Then we'll use the data. If it's not a success, then we still get the data, but we're not going to read it. We're not going to use it. And we'll have to go to memory or to the next level cache basically.

Starting point is 00:57:42 Bit more complicated is then the set associative. I mean, just more complicated in terms of circuitry. So then we basically say we have these sets. And within these sets, we can basically place our cache line anywhere. And I explained the properties already. In terms of circuitry, this basically means we're comparing with many cache lines, so the complete set basically at the same time. So we're loading all of the tags.

Starting point is 00:58:15 Basically, we compare all of the tags in hardware with the tag that we're requesting out of the set. So say two-way associative here. This would mean we have requesting out of the set. So say two way associative here, this would mean we have like two of these instances. This would be four way associative. So we take the tag, we check for all of the tags. We're comparing is one of the tag the right tag? If yes, then basically this tag will be going into the multiplexer or demultiplexer.

Starting point is 00:58:48 And if we have success, we'll at the same time also have to request the data from the cache. And we can directly work with it. If it's not a success, we'll have to load. And while loading, of course, we remember we'll then also have to replace whatever was before in the cache or at least one cache line. And that will go into one of these. And there's different ways we can replace. I'll talk about this in a bit. And so let me give you an example.

Starting point is 00:59:20 Again, older Intel core, so we have a total cache size of 4 megabytes. I don't remember which should be probably level 2 cache. The cache line size is 64 bytes. We have a 6-bit offset for these 64 bytes in order to address. a cache line we don't need to address. There we just have the offset. We have 65 000 cache lines in total because we're splitting the page or the four byte memory into these individual cache lines. So, 65,000 entries that we can have in the cache. And the cache is 16-way set associative. So, we divide the 65,000 by 16, and we have 4,096 individual sets of each 16 cache lines.

Starting point is 01:00:29 And for that, we basically need a 12-bit set index. So in order to uniquely identify each of the bits, each of the sets, we have the 12-bit set index, which leaves us with an 18-bit tag that we will use to within a set, uniquely identify each of the cache lines. So, in total, we only need 30-bit addresses in this case, so we can have 64 gigabyte address space. In this CPU and other CPUs, we might have more. Okay, so the tag is the unique identifier within the set, and the set index basically gives us the right index.

Starting point is 01:01:16 And this is the same kind of hierarchy that we saw earlier with the virtual memory, but it's differently spaced again, right? So the virtual memory, we have the page size, or the virtual memory page size as an offset. Here we have the cache line as an offset. So it's the same kind of addresses, but we're splitting it up in different ways.

Starting point is 01:01:41 And there's different way, or this is basically, if we're doing like a compilation workloads or some kind of GCC workload, then we can see that the associativity actually helps us by reducing the cache misses, especially for smaller caches. So we have a smaller cache and we have these certain kind of patterns in our code, then associativity will somehow reduce the effect of these patterns. So then we get less cache misses by having more associativity because we have more flexible page replacement. Or block replacement or cache line replacement. Or block replacement or

Starting point is 01:02:25 cache line replacement. Whenever we have pulling in a cache line we have to replace an existing cache line and there's different ways of doing this. A common way of buffer management also is least

Starting point is 01:02:41 recently used. Whatever hasn't been used for the longest time will have to go. That takes some time to manage this, right? So it's a bit more complicated. You can do this by ordering, but you don't want to copy data in the cache around all the time. This will be costly. So then another way would be first one in first or FIFO.

Starting point is 01:03:05 So basically, we're just checking whichever was added in first or where we added the last time. So we'll just go round robin, basically, replacing. This often behaves similar to LRU and is much easier to implement. In some cases, it might not be great. But still, in hardware with small caches, might still be good enough.

Starting point is 01:03:35 Another idea is randomness. Often, actually, it works quite well, especially if we're not super small or we don't have very clear patterns that we could do much better with other replacement strategies. And that's also simple to implement in hardware, at least pseudo-randomness. And then another approximation might be not most recently used. So that's kind of also an approximation.

Starting point is 01:04:04 So we're just keeping in mind what did we use last. And whatever there, we can replace. So we're just not replacing the one that we actually just required. So that might also be a good idea. But this needs to be done in hardware and needs to be fast. So we also don't know what actually is happening, because that's the CPU.

Starting point is 01:04:31 So there might be an option, or you could think of an experiment how you can figure this out, if you know the set associativity, et cetera. So you can basically tweak things maybe to get some ideas about the replacement, but the hardware vendors will probably use something that's very easy to implement and gives you a good way of doing this. Or basically it's a good trade-off of performance versus number of circuits you need for this

Starting point is 01:05:08 and overhead that it generates while doing the replacement. Okay, so, so far, we've only talked about reads, right? So, we've only talked about what do we do if we want to access the data, read the data, right? So, we've not really talked about writes. And there, the cache is actually, well, give us some extra level of thinking what we need to do with this. And especially, I mean, there's two things you have to consider about this, right? So we might have a multi-threaded, or that's the main thing actually, we might have a multi-threaded program.

Starting point is 01:05:51 So all of a sudden, I might have stuff in my individual cache, but some other CPU also wants to access this. So then this situation I need to handle. So I have to think about what do I do in a write? How will this be propagated into memory eventually again? Also, just in general, I have to think about what happens if I write something. Does it stay in cache or does it go all the way back to memory? And there's basically both ways or an option.

Starting point is 01:06:23 So there's write through means we're really just directly writing the data all the way down to the memory, also into the cache, of course. We're not gonna just flush it out of the cache just because we've written it, because we might just use it again. But we also want to have whatever we did in memory. So that means we'll go all the way through until we hit memory

Starting point is 01:06:50 in order to basically make sure that the data is written. This makes data coherency much easier, because we'll always have the right data, the correct data in memory, but it's also slow, because it basically stalls the CPU. We might use something like a write buffer, so we're buffering whatever happens and then combine writes somehow, but it's the more slow way to go. And the other way to go is basically say we're writing back, meaning we're writing into the

Starting point is 01:07:24 cache, and we somehow have to mark this then, right? to go is basically say we have uh we're writing back meaning we're writing into the into the cache and we somehow have to mark this then right so we're marking that this um this cache line that we've written to is dirty now it's uh means it's different what we have in cache than what is in memory and that also means eventually this will have to be cleaned up again somehow. So we have something in cache that deviates from whatever we have in memory. With this, we reduce the amount of traffic to the lower level memory. But we need to somehow do something whenever we evict such a cache line. We need to make sure that it's actually written back into a memory through all of the cache hierarchies, right?

Starting point is 01:08:11 And whenever somebody else wants to access the same cache line, we also need to make sure that they see whatever we've updated and now work on the same cache line with something else. But we'll come to this later when it comes to NUMA, so more or not only NUMA, to parallelism whenever we have to think about what happens if multiple threads access the same kind of data. And modern processors, because of the less traffic

Starting point is 01:08:42 and higher efficiency, will implement write back today. Okay. So with this, let me give you two more examples of what caches actually look like. So this is an older, we've had this before, older CPU. So you see what this is actually on the chip, right? So we see there are six cores on the chip. We have a shared level three cache and this is two times two centimeters, for example, right? So this is basically, we have the six individual cores. Each of these cores will

Starting point is 01:09:21 have level one and level two cache, two level one caches, one for instruction, one for data, then the level 2 cache, and then we have this large shared level 3 cache, which is accessed by all of the cores, where they basically can all read. And then we have the memory controller, so basically our access from the core will be to level 1 cache, level 2 cache, level 3 cache.

Starting point is 01:09:46 Still not there? Well, memory controller, let's go to the memory. And this will be outside of the core in the DIMM, then. Yes? What happens to the two dead cores that are still visible on the graphic? So the question is what happens to these cores or that core. So this is basically, I actually don't know if these are cores or other kind of helper structures on this kind of chip.

Starting point is 01:10:14 There is different ways. You could use this for other kind of, let's say, interconnect or something like helper structures. What this actually is in here, I don't know. These might be extra cores also. And we're only using six of those. So sometimes, CPU vendors also say, OK,

Starting point is 01:10:38 this die didn't really work that well. So we have a small error here. We're just going gonna switch off this core and resell it as a five core CPU, right? I mean, not five core typically would be like four, eight, 16, something like this. And if like the larger the dice get, the more error prone. So then the high core CPUs, they will need to,

Starting point is 01:11:04 like everything needs to be correct. prone. So then the high core CPUs, everything needs to be correct. The lower or cheaper cores, we switch off whatever is incorrect on these cores. So that's one way to deal with this to some degree. And some manufacturers also build very large dice, and then whenever there's something broken on there or larger defects, you have to trash it.

Starting point is 01:11:32 Then you basically, there's a certain amount of chips that are not going to work because of that. And this, in order to see more modern dice, so this is what's, it's even not super modern, right? So there's M3 nowadays, which, I mean, doesn't change much. It's just more of the same kind of dice replicated. But this is what this machine has. So you can see there's two kinds of cores.

Starting point is 01:12:03 So we have the performance cores here, So we have the performance cores here and we have the efficiency cores here. Both have an L1 and an L2 cache. So we have the L1 cache that's, I mean, there I can only assume, so this might be the cache structure and this might be the ALU, etc. functional units. I don't know, right? So, and then we have the shared cache. So, this is the shared L2 cache for the efficiency cores and then we have a shared system level cache. So, that's basically an L3 cache, but it's not only between the cores, the compute cores, but it's also across the GPU cores and the neural engines or inference cores basically here. And then we have also memory controllers.

Starting point is 01:12:58 Right here, you can see the memory controllers. These are basically memory channels where then these would go basically connect to the individual DIMMs and then we'll have some PCI Express here somewhere and some other helper structures. So this is what how does this basically end out and you can see that here we have a fairly large instruction cache in level one instruction cache. So this basically means if this CPU has to work with more complicated code, it will be more efficient than something whoops, I don't have the numbers here, but something

Starting point is 01:13:36 where we only have like a 64 kilobyte L1 instruction cache. So here we can basically have three times as much code within the level one instruction cache. And also larger level one data cache. So this is basically where they made some decisions in order to have a better cache and efficiency up here. And then 12 megabytes shared cache and a 16 megabyte system level cache, which is shared across all of the individual units here. Okay, so these are numbers on this chip. So we've implemented a pointer chasing benchmark at some point. And well, basically, if we have one kilobyte of memory and we're doing

Starting point is 01:14:28 pointer chasing right so this is basically we have like a random numbers across the whole address space within this within the one kilobyte up to one gigabyte basically just pointers that just will point to a next address we're looking up the pointer and then go to whatever address is within this pointer right so that we can just do random hops in this memory if we do this in one kilobyte that will be in first level cache right so then we can see every in each individual axis will be one nanosecond. So this is fairly fast. I mean, we load the pointer. We translate it to an address.

Starting point is 01:15:12 And we go to the next step. Then for 8 kilobytes, that's still level 1 cache. That's still one nanosecond. If we go 64 kilobytes, let me see, what's level one cache? 64 kilobytes is still within level one cache. So we have 128 kilobytes level one cache. So we're still in one nanosecond. As soon as we're moving out of the level one cache, right, so we're hitting level two cache. So we're still in one nanosecond. As soon as we're moving out of the level one cache,

Starting point is 01:15:46 right, so we're hitting level two cache, this will be with 256 kilobytes. This is twice as much as what fits into the level one cache, we will see all of a sudden, oops, now we're in the three nanoseconds, right? So this is basically first level of the cache hierarchy missed. So there we need to go to the next level. Then we have one megabyte. It basically hits us more, right? So we said L2 cache is, the shared L2 cache is 12 megabytes for the performance course.

Starting point is 01:16:21 So up to 16 megabytes, we're basically getting, or eight megabytes, we're still in level two. So we're seeing more of the portion of accesses being in level two than level one. So we said four, let me see again, 128 kilobyte level one cache. So some of these, or 1 eighth of these accesses in the 1 megabyte will still be level 1 cache.

Starting point is 01:16:53 So this is still in the, they still get 1 nanosecond. The rest 7 eighth of the 1 megabyte will go to the level 2 cache. With 8 megabytes, we're completely in level two cache. But still, some of them, of course, are level one cache. 16 megabytes, we're hitting, like we're going out of level two cache into level three cache. And then once we're in the 64 megabytes,

Starting point is 01:17:22 this is even without the shared cache. So there many of the accesses will then already go to memory and there we can see, well, this is already this factor of 100 with a gigabyte, we're getting close to almost all of the accesses being in main memory. So you can see we have this factor of 120 or so in comparison to whatever we have in the cache, the highest level cache and the lowest level cache. And if we want it, we could basically be even a bit more fine-grained and at a certain point within these, basically probably around here, we could see where we start missing TLB,

Starting point is 01:18:06 or where we see TLB misses, basically. So the TLB, level 1 TLB, will have 256 entries. So we cannot calculate quickly enough where we would see this, but probably already somewhere here. And level 2 TLB has 3,000 entries, so this would also be somewhere soon. Because basically that means whenever we're hopping to memory. So we're not using cache in any way smartly, right, because we're jumping far in memory. So this is where this would give us yet another small dent in our performance just because of the TLB misses.

Starting point is 01:18:56 Okay, so of course the TLB also needs to interrelate or work with the cache. And there's basically two ways, again, to do this. Either we're caching physical addresses, then we need to translate the address before the lookup. So then we basically need, like all of the, we should at least have all of the addresses that we want to look up in the cache in our TLB or we're using the virtual addresses in cache. Then we basically need to figure out how to deal with multiple accesses

Starting point is 01:19:42 from different processes to the same virtual, to the same physical address, for example. So then we might have these different virtual addresses there and there's different ways how this can be implemented. It's just something to be slightly aware of if you want to dig deeper into this kind of hardware situations.

Starting point is 01:20:08 Okay, so I'm slightly running out of time, so let me maybe just give you a hint here and then continue from there next time. So now there's different kind of benchmarks for this kind of stuff. This is spec-int, so one of the typical CPU benchmarks where you basically measure how fast is your CPU. And here you can see for different kind of workloads. So these spec benchmarks are typically a suite of different kind of tools where then you gzip something, you compile something with GCC, et cetera.

Starting point is 01:20:43 So it's really just runs different kind of programs and checks how fast does your CPU does it. And then you can see, well, there are certain kind of instruction cache misses and L2 cache misses. And one of these programs is really bad in terms of L2 cache. This probably does lots of large jumps in memory or these kind of pointer chasing things that I showed you.

Starting point is 01:21:08 But all in all, you see kind of number of misses per 1,000 instructions. So this would be basically, for most of these things, would be in the 99% or even higher cache hit ratio. And now if we're looking at a typical database benchmark, then this is TPCC, so an OLTP benchmark, where we have a smallish but reasonably large enough, definitely larger than caches, data set, and we're doing fast transactions over this.

Starting point is 01:21:47 So we're doing many small updates and reads in there and you can see that our instruction cache is actually really bad, right? So we're there and in comparison it's much worse than all of the other programs or all of the other things used there. And the L2 cache is also not great, but it's okay-ish. But the instruction cache, that's actually bad, right? Because basically the CPU continues to just load data in or just load instructions in, right? So, loads new code in order to do whatever we want to do. And, well, we'll look at why this is actually tomorrow, right? So, this is basically, you see that traditional databases are not good when it comes to cache efficiency here, especially instruction cache. So, as a teaser for tomorrow, we'll identify where this comes from

Starting point is 01:22:47 and then look into what can we do about it. Questions? No questions? Well, then thank you very much and see you tomorrow.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - CPU and Caching

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.