Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching & Instruction Execution

Episode Date: May 3, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 So today, later today, we're going to talk about instruction execution. Before that, we want to come back to the CPU and caches, and especially data layout and alignment today. But before we do so, well, I've already talked about the code examples, and I've not updated it yet, but there will be more. But then there was a question in the forum. And because we discussed it just right now, I want to also answer that right here.
Starting point is 00:00:31 And that's regarding recording the tasks. So next week, we're going to start the SIMD task. And the part, like the setup and everything, we will record. Meaning that everything, how to set up, how the tasks are run, etc., all that, and also the task details for the first task we will record. However, all of the solutions, that part, we're not going to record. So wherever we show you how the task would have been solved and our code, et cetera, that part won't be recorded.
Starting point is 00:01:11 So other people and future students can also enjoy the task just like you did. So that's the idea. Then another, because we are with recordings, I managed to not turn off the mic for the second half last time. So yesterday, that means we missed half of the recording from yesterday. We're just going to post the link to last year's recording in Moodle for those of you who enjoyed this asynchronously.
Starting point is 00:01:41 You can see whatever I told you yesterday from last year, basically. Exactly. So, that's basically it. Just as a reminder, so we're still in the CPU realm. And as I said, we already moved stuff a bit. So, next week, we're going to start with SIMD on Tuesday and then have the task on Wednesday. But, as I said, we'll also record this. And otherwise, well, everything is as is. We're in the instructions right now. But, as I said, we're going to move back to the CPU cache first. And this is where we left off.
Starting point is 00:02:29 So we talked about the different kind of caches that we have and some of the cache performance, and also how to calculate cache performance, and then also translation look-aside buffer, because that's all stuff that you need to be aware if you're writing really high performance code. Because I mean, you don't really have to write for that, but you will see the effects depending on how you write your code.
Starting point is 00:02:58 And well, here kind of closing up this cache stuff, not completely, but as one kind of overview example, here's an example of the specint performance numbers. So this has been done by a, I don't actually remember who performed these experiments, but you can see instruction cache misses and L2 cache misses on some CPU, some server, probably x86 server CPU. And just in terms of misses per 1,000 instructions,
Starting point is 00:03:38 and this would be, as we said, if it's misses per 1,000 instructions, 10 misses per 1,000 instructions, our cache efficiency here would be 99%, just as we calculated last time, if you remember. And we can see that for most programs, this is actually much lower than that. So we're in the less than 1% realm, so more in the 5 per mil realm, or even less
Starting point is 00:04:08 than that, of cache efficiency. So many programs. And specint is one of these benchmarks. It's just basically a suite of different kind of small, or not even really small, programs that will be run and where then some profiling happens, or some performance happens.
Starting point is 00:04:26 So say stuff like GCCs or compiler or G-SIP, so some standard programs, let's say, that you would run. And this is used to test the performance of CPUs. And well, we see most of these tasks, except for this MCF, which is typically bad. And I don't actually exactly remember what it does, but it's really bad in terms of caching. So all other than that, well, most programs
Starting point is 00:05:00 have quite good caching behavior. And in comparison to that, if we now look at something like TPCC, and we run this on a standard database, so TPCC is one of these OLTP benchmarks, which I told you about, which just does small transactions. So you have warehouses, and you have things that are bought and sold and stocks that need to be managed. So different kinds of transactions which just touch small parts of the data. And you can see that on the one hand the instruction cache or the
Starting point is 00:05:38 L2 cache misses are quite high in comparison to most of the other programs. But even worse, the instruction cache misses are really bad as well. So here we'll have a poor performance in terms of instruction cache. And the question is, of course, why? So why does this happen? And so why do database systems have such poor cache behavior in general? And I told you all the way in the beginning that classical database systems, MySQL, Postgres, DB2, et
Starting point is 00:06:18 cetera, in their original form were all built around this bottleneck of disk. It's all about disk. Because of that, because that's so dominant, it doesn't really matter what happens inside the caches. So, you can put a lot of instructions, etc., you can do a lot, until you see a problem in the CPU if you're waiting for disk all the time. And because of that, you basically can use, well, let's say, very fancy ways of programming,
Starting point is 00:06:52 lots of libraries, et cetera, and make your code quite, let's say, convoluted before you notice any problems. However, if you then all of a sudden fit your data into memory, you will see these, even if you don't have them in memory, you will see that your code behaves poorly in terms of cache performance. And this is because there's poor code locality, so that's basically all the problem with the instruction cache, right? So for example, you're resolving attribute types
Starting point is 00:07:27 for each tuple individually using some polymorphic function, meaning basically the code base jumps around all the time for each individual attribute. You're going to completely different code paths. And if your code is large, and think about something like Oracle, this is in the gigabytes of code base. Meaning whatever you install this is not, I don't know what's HANA, you might actually know what SAP HANA is. Have you ever touched
Starting point is 00:07:56 that? You don't remember? Anyway, it's also not going to be in the in the kilobytes, the code base. It's going to be larger. And of course, this is optimized to some degree. But say, classical SAP systems, also DB2, et cetera, these are huge systems. And the code base is not just, well, whatever. But even the core code base for just operating on the individual attributes will already be large.
Starting point is 00:08:26 And so basically the system has to go a lot of different code paths while executing. And not just because of different attributes, but also because of the volcano iterator model. So this is the traditional way of processing data in a database, tuple by tuple, in an iterator fashion. So you all know the software engineering pattern of an iterator, and where you basically say, get next, get next. And this is the way, and you will see this later, and also in the lecture separately,
Starting point is 00:09:04 but this is the way how you execute traditional database operators. Basically, you stack them on top of each other, each having an iterator interface, and then each of them calling each other. And this means each individual tuple will basically pass through this query plan individually and going through a lot and large code base, which also means, right, so it's a lot of functions that need to be called, a lot of code that needs to be basically loaded, and that won't be cached in the end if it's too much, or it won't fit into the cache, at least not in the L1 instruction cache.
Starting point is 00:09:44 So that's the one problem. And then the other problem, and that's always going to be a problem to some degree, is we have to work with large amounts of data. If we're talking about an analytical database, well, we have to go through the whole table. If we're trying to find something in a large table, it's not indexed, or we just want an average over the year or something, we have to go through the whole table.
Starting point is 00:10:12 It's always going to be a lot of data. And so because of that, we'll also always see some poor caching behavior, or much poorer caching behavior than something where we just have a small working set that we're just going to iterate over and over again. So I don't know. Think about a machine learning model that is constrained in size.
Starting point is 00:10:37 If you're working on that, that might actually fit into your memory or even into your caches if it's a small model. You do a lot of operations on that, so that's going to be very local if the data sets are not too large, etc. In the database, we're assuming we're going to have a lot of data. We're going to go through this over and over again, but mostly reading through this a lot, and that's going to be slow. If we're using an index, we don't have to read as much data,
Starting point is 00:11:08 but it means we're going to randomly access data. And again, this will have poor caching behavior because we basically jump somewhere randomly in the large amount of data. And basically, the cache won't help us that much. However, we can improve stuff, right? So there it's not, well, nothing we can do because it's just like large amounts of data. We have to go through this. We can still do something by proper data layout and by proper code management. I mean code is always an option to optimize your code, basically.
Starting point is 00:11:51 And you can and you will learn to some degree how you can optimize for cache performance in terms of data structures, for example. So you can build your data structures in a way that they have better caching behavior or worse caching behavior and a lot of it is all around okay how how much does fit or how much can i fit into my cache lines um will i read or use all of the data that's within the cache line or not how much of the data fits into my cache line anyway? Or do I have to, like, if I have multiple, use multiple cache lines, does it fit into cache at all?
Starting point is 00:12:32 So navigating this, that makes a lot of sense. Also how to access the data in the end. So do I just do like random jumps, which means I'm just using small parts of the cache line maybe? Or can I use everything that I have in a cache line? So that's something where we can work on. And not only, I mean, all systems, not only database systems, favor cache-friendly code. And one particular thing, and that's maybe something that we don't teach that much in this course, because you're going to optimize a lot for certain types of
Starting point is 00:13:15 CPUs or for a specific type of CPU, you'll see this is highly platform specific and this is highly CPU specific. So cache sizes, etc. To get the best performance, you have to be really, really specific with your code. In practice, you don't always want that. In practice, you actually want to make sure that your code is reusable and can be used on other systems as well. You always need to do this kind of trade-off. How much do I optimize for a certain platform or for a type of platform?
Starting point is 00:13:54 How much do I, by doing so, limit my potential for reusing the code on a different platform? That's just something to keep in mind. If you're optimizing just for one cache size, just for one block size, for one associativity, et cetera, then your code is highly specific and will perform for sure not as well on another platform, maybe even really poorly on another platform. And this is kind of if you're setting these hard thresholds. There's hope.
Starting point is 00:14:28 I mean, there's some ways to mitigate these problems and we'll also talk to some degree about this. But there's something to keep in mind, right? If you're building systems, don't over-optimize for a single platform. And ideally, you try to write generic code. You try to make your code somehow adaptable. But there is basically some rule of thumbs
Starting point is 00:14:54 how you can always get a good performance. One is work on a small working set. You have this temporal locality. Use small working set. So you have this temporal locality. Use small strides. So try to keep locality, spatial locality, to work on the data closely, like all on the data that's spatially close.
Starting point is 00:15:16 So that will always give you better performance than if you don't have no temporal locality and if you have no spatial locality. And then if you have a loop, then basically try to focus on the inner loop. That will also give you the most performance. But in general, that's also something that can happen quite easily. Don't optimize prematurely, right? So if you have one loop where you think you can optimize the hell out of it make sure that this is actually a loop that counts right so that actually has an impact on
Starting point is 00:15:50 your overall performance so before doing like micro level optimizations check if this is actually a hotspot in your system if everything is bound by something else well you might spend your time better on another interface that you have a problem with rather than that. Or if your application is super slow, super inefficient, often you can get a much better performance just by changing the application code a little bit. Or say, for example, you get really crappy queries and you don't have a query optimizer,
Starting point is 00:16:22 then it doesn't really make sense to improve your system execution if you're just working on crappy queries right so there's there's always you always again need to make a trade-off and make sure that you're focusing on the right on the right bottleneck and that's something where you should always do this kind of benchmarking or profiling. We'll have a session on profiling where we'll talk a bit more about this, how you can actually do this yourself with the code.
Starting point is 00:16:57 And on the other hand, generally, it's also always good to do some back of the envelope calculations if this actually could be the case, if this could be the performance bottleneck or not. So with that, let's talk a bit about data layout, which is also something where we can improve, like I say, caching behavior quite easily by doing some tweaking.
Starting point is 00:17:26 And there's some trade-offs. And this is basically also how to directly bring some of the stuff that we did so far to database systems, right? So, so far, I mean, this is like how do the CPU caches work, etc. More generic works for many, works the same for all kind of systems. But data layout, this is kind of very specific. OK, so let's talk about caches for data processing or for database systems.
Starting point is 00:17:58 And the question is, how can we improve data cache usage? And this is basically, we have to go and look at different data models or data storage models and data layouts and query execution models. And the query execution models, as I said, we'll look into more deeply in a separate session, but data laid out is something that's directly related to what we've seen so far. And say we have a simple query, this would be on the TPC-H dataset, so line item is the big fact table, one of the two big fact tables in TPC-H, and the biggest one. And here we're just going to count all of the line items, so individual cells,
Starting point is 00:18:46 basically, in the table, which were done on a certain date, where the day would be 26th of September in 2009. And if we don't have any index, if we don't know anything else about the table, then this means we have to do a full scan. So I mean, potentially, we have this sorted by date or something, then we could reduce the amount of data that we have to look through. But in general, there's no guarantee about this. If we don't know anything else, we have to do a full table scan.
Starting point is 00:19:26 And there's basically two approaches how to lay out this data, in storage, in memory as well. And then of course there's some hybrid models as well. We'll also talk about this. And this is exactly the same as we had earlier, or it's not exactly the same, but more or less the same as we had earlier with the area layout. So we have a row-based layout and we have a column layout. So this is the same thing that we talked about earlier. So I can store an area, I can store a table either row by row or I can store it column
Starting point is 00:20:08 by column. Row by row is good basically if we just want to write the data, right? Or if we need the whole row or if we like in the area example we needed to go through the complete rows one by one, then a row-based layout makes total sense. And this would mean, this is also called an N-ary storage model, or NSM. And this basically means we're storing all of the attributes of a single tuple one by one within our pages, and then the complete tuples
Starting point is 00:20:47 actually next to each other. The other way of storing this, just basically column by column or attribute by attribute, also called decomposition storage model, would mean we're actually decomposing, and that's why it's called like that, we're decomposing the tuples and just store individual attributes.
Starting point is 00:21:11 And well, this is basically good if we're just looking at a single attribute, right? Makes total sense because then we have all of the information there, but it's more costly if we have to reconstruct the tuple. So say for example, looking at our example here, just looking at an individual attribute, the decomposition model would be faster.
Starting point is 00:21:36 If we need all of the attributes, say, for example, then rather than select count star, we would say select the whole tuple right select star then we would have to reproduce we would have to rejoin these individual subsets and then it becomes this kind of trade-off if we do a full table scan in a row store, and so they are also called row stores and column stores, depending on how they store the data on disk or in memory. If we do a full table scan, then this means one tuple is next to the other.
Starting point is 00:22:21 Again thinking about everything in memory, We would have our virtual memory. Everything is nicely laid out as one long area, basically, where we have one tuple after the other, neatly packed. However, the tuples have different kind of attributes, have different kind of length of attributes, maybe our fixed size, maybe our variable size. But then we have this ship date, which is actually what we're looking for, is somewhere in there.
Starting point is 00:22:52 If we're lucky, it's a fixed position. So it's somewhere in there in the middle of our tuple here. Now, the problem is that this is basically what we're going to read if we're doing our read. This is the cache block boundaries, as we know. So we read individual, like if we're talking everything is in memory, we're going to read individual cache lines into the caches and we're going to work on these cache lines. And here now we see we don't know exactly.
Starting point is 00:23:24 Maybe we have something where we know where these tuples start or not. But in general, if we're just scanning through, we don't know exactly which part we are interested in. So this means we would actually have to read a large amount of irrelevant data to find the individual ship date values that we're looking for. But everything else from the tuple is irrelevant, but we still need to read it to find this ship date and work with that. Even if we're just looking at the individual cache lines
Starting point is 00:23:59 that we're interested in, there's still a lot of information that's completely irrelevant to us. So that means the CPU will wait a lot just reading all this data just to find out, well, this is not relevant for me. This is not relevant for me. Maybe this piece of information is relevant for me, but even the rest of the cache line is not relevant. If we have a column store layout, so we're basically
Starting point is 00:24:26 decomposing the individual tuples into their attributes, then we have pages in memory that look like this. So we have one page that just is full with ship dates. And this means if here then we have to read the individual cache blocks or individual cache lines, then all of the information that we're reading is actually relevant for us. So this means in this case, we're going to be much faster in processing the data. The cache will be much more efficient just because of the layout.
Starting point is 00:25:03 However, of course, if later we have to reconstruct the tuple, this will be, again, more costly, because we'd have to read all of the individual parts. We have to join this back, which is something that in the original layout, in the row store layout, we still have the full tuple. So if my final result, if say, for example, all of my tuples will match and we have to reproduce the full tuple, then this will be cheaper again because I have all of the information there. I can output this directly.
Starting point is 00:25:36 Well, in this setup, I first have to find the relevant tuples quickly, but then I have to reproduce everything else. I have to read everything, like all of the other attributes, and I have to recombine this. However, as I said, it's a trade-off, but typically, I mean, we have some selectivity and then this makes a lot of sense, right? And if we're not looking at the complete And then this makes a lot of sense, right? And if we're not looking at the complete tuple, but say in our count star, we can do the count star just on the ship date, right?
Starting point is 00:26:13 So all that we need for this query, for example, would be just looking at these pages and we can completely calculate the result. And also the cost of reading this. We're still looking at each individual tuple here, each individual attribute only once. This means we don't get good caching behavior per tuple however we're reading complete cache lines so we're reading multiple tuples at the same time so the same cache line will be used multiple times in in our processing so we have much better caching behavior than we had earlier
Starting point is 00:27:01 so we're not going to work with, like, not all of our accesses will still be cache misses. Because in the first case, we're basically here. Each individual access, like every sub, every cache line, basically maximum contains a single attribute. This means i won't have it in the cache unless i'm using some kind of prefetching but if i don't use prefetching i won't have it in the cache that means i will have to read it when i basically require it and that means every single access will be a cache miss here and in this case at least for the number of attributes in the cache line we will
Starting point is 00:27:52 have cache hits and will have better cache caching effects here yeah and if we're lucky the full data might even fit into the caches and then we get actually much better. We're going to have very nice performance. And for some smaller tables that might actually might work. Okay, so but the trade off and I already hinted at that we have this recombination cost, right? So we have to recombine the tuples in the end. We need to perform all these joins
Starting point is 00:28:30 to basically get the final tuples again. And this is really dependent, the trade-off here is really dependent on the workload. If we're just looking, if we're just counting, for example, no cost at all, because we can optimize this away we can just do this on the ship date if we need the full tuple in the end and a lot of this will again be application specific right so some of this you can basically on the application you can decide do i really need to count in the application or do i send the count to the database, then the database could be much faster, for example.
Starting point is 00:29:06 However, there's also ways to kind of get this hybrid approach. And one way is PUCs, the partition attributes across layout. So rather than having the individual attributes in separate pages, we can do mini pages inside a memory page and have all of the attributes of a single tuple within these sub pages and basically group the attributes in a way so that we still have this cache performance. But we also have locality in the memory pages so that we don't
Starting point is 00:29:47 basically might go into the problem that we have like pages maybe swapped out somewhere for parts of the tuples which would give us additional problems. Right and then of course rather than saying okay I want this hybrid model just within a single page and column and whatnot layout, I can also say, well, why not just decompose the data set into two parts? I have a column store layout and I have a row store layout for different parts of the data. And typically, say for example, new incoming data, I will, if I get this fast, it's much faster to write them in a row layout. Or if I have to iterate in an OLTP fashion over this,
Starting point is 00:30:35 so I have to use the whole tuple, I have to update stuff in there, then the row store layout actually makes sense. Or if I delete tuples, right, then I just have to touch this one page rather than everything. So then I can basically say I store the new data, new incoming data in a row layout, and all the data that's not going to change much anymore,
Starting point is 00:30:59 I'm storing in a column store layout. And this is what typical column stores today actually do. So SAP HANA, for example, or C-Store, which puts the commercial version of C-Store Vertica, right, I think. So other systems that use this column layout, they basically have this kind of setup where they
Starting point is 00:31:27 have the new data in a row store and older data then in the column store. So with that, there's more we can do. And one thing that we also can, to some degree, at least think about is alignment. So basically, our data access, any kind of data access, can be aligned or unaligned. And that basically means, do we read the data basically
Starting point is 00:32:02 in cache lines or in, I mean, for the CPU aligned basically so that it can easily be put into the registers or not? Or will the CPU have to move this back? So in the past, basically in older architectures, everything needed to be aligned. So basically I'm having an integer, which is like four bytes, right? So it needs to basically start at address zero or a multiple of like modulo, anything that's modulo four would be zero. So it cannot be shifted somehow
Starting point is 00:32:42 because otherwise the CPU would have to. So it cannot be shifted somehow, because otherwise, the CPU would have to somehow shift it back, and old CPUs wouldn't be able to do this, to have it in the right position. So if you think about memory in this way, basically, if we're having addresses that are 4 bytes, for example, then anything that starts at address 0 or modulus 4, 0 would be aligned.
Starting point is 00:33:08 Everything in between there would not be aligned. And anything not aligned basically needs some extra work for the CPU to move it back into an aligned phase to then have it easily copied into the registers, etc. This basically depends on word size and cache sizes or cache line sizes. In the CPU, also the compiler, they will do automatic padding or they will do automatic alignment to fix this. However, it takes some time. Modern CPUs put a lot of work into making data
Starting point is 00:33:52 or into mitigating these problems. However, if you know that this happens, you can actually get some additional performance. So let's look at the typical structure representation. So here we have basically a very simple translation of something like a table. Say, for example, we have four integers, a long, and then a next pointer.
Starting point is 00:34:18 So this could be something like an index structure that we're using. And this is just then a block of memory. And the compiler will just translate this directly into this kind of structure. And all of the fields, the compiler must basically translate them in the order that we've described them in our struct.
Starting point is 00:34:46 So it cannot reorder this. So this is basically what the compiler just adheres to, so we're not getting into weird effects somewhere. And now the compiler might need to do some padding. And so the compiler basically determines the overall size and determines the positions of fields. And the machine-level program then has no understanding of the structures that you actually put.
Starting point is 00:35:16 So it will just see this area of memory. And the CPU basically expects this to somehow be aligned, or if it's not aligned, meaning it doesn't start at address zero, for example, I mean, if in this case, basically everything would be nicely aligned, in this case, everything is fine. But if something is shifted, so say, for example, we have one byte here, and then our integer starts, the compiler or the CPU will have to realign this. And the compiler will try to optimize this as well. So in x84, Intel, for example, recommends
Starting point is 00:35:57 the data to be aligned. So it's basically everything should be aligned. So meaning that we are, as I said, in the word size, so in four bytes, for example, we want to make sure that every integer starts at these addresses. And the CPU still, even if it's not the case, the CPU still will work, right?
Starting point is 00:36:27 But it needs to do some reordering. So let's look at another example, maybe. So here, we start with a similar data structure as earlier. So we start with a character. Then we have an integer, an array with two elements, and then we have a double. And you can see in this case, basically, we're starting at, again, at 0, for example, or wherever we're starting.
Starting point is 00:36:54 Doesn't really matter. In this case, typically, we're not aligned. So first, we have this character, and then we have the integers. So these will not be aligned because we're not at address 0 but at address 1, for example. And that means, ideally, we want to get some alignment again. And what the compiler typically will do
Starting point is 00:37:23 is it will put some alignment there. So it will put some additional space here. If it doesn't, then the CPU will put some like will have to do some alignment in memory, will have to move it. So the CPU might need to read multiple cache lines or might need to read multiple words to get the correct word and then move it, shift it back to have the right alignment so that in the end, it can actually
Starting point is 00:37:50 work with these individual integers in the right alignment. And also doubles, for example, need to be even more aligned to have them put into the right space. And if they're not, so if the compiler does this, this will actually have better performance than this unaligned data, because there, the CPU will have to realign.
Starting point is 00:38:15 OK, so typically, what the compiler does is the compiler maintains the declared ordering. So as we said, it won't do any reordering. You can do reordering. You can just change your struct if you want, but the compiler cannot. And so what the compiler does, it can insert some padding. So it can insert this kind of empty space here
Starting point is 00:38:42 in order to make sure that all of the data or the elements in the struct are properly aligned. Then the overall struct should be aligned or must be aligned according to the largest field and then the total struct size also must be a multiple of its alignment. And again, the compiler will insert some padding here. And for strings and other variable length data, it will be split into the length, which will be fixed, and then the variable size tail. And the header will contain a pointer to the tail, and the variable data will be placed at the end of the struct. So the fixed part basically will be positioned in the same position
Starting point is 00:39:35 as the order of the struct, but then the variable part will be put towards the end. So now what we can do if we want, we can basically make sure that we align our data or we reorder our data so that it's better aligned in memory and that the compiler doesn't have to do as much panning. So a simple example would be here. Instead of starting with the character,
Starting point is 00:40:02 we can start with the integer and then add the characters later on, which then would give us a better alignment, a more tight representation in memory and less padding. So rather than using 12 bytes, because we want to have the integer aligned at modulo 4 again, so at position 4 in this case. We have the integer at the beginning, and the characters, since they're single-spaced, they can be tightly packed, but the struct overall should be aligned to the total size again.
Starting point is 00:40:41 And, well, you can test this. For this, I actually don't have any example here. But say you can have a simple struct with different kinds of alignments with padding and reordering. And the differences, accordingly, again, this I didn't test, but it's from Andy Pavlos keynote, not keynote, every Pavlos course, the difference can be significant in throughput
Starting point is 00:41:07 that you get. So basically with no alignment, you're below like half a megabyte throughput. With padding, you can get 11 megabytes of throughput in memory speed. And with the reordering, even more than that. So making this efficient can have a significant impact here. Again, this one I didn't test. I just used it from the lecture, actually. But something you can try. So write a small program and test it.
Starting point is 00:41:38 Yes? If reordering has such a big impact, why isn't the compiler able to do so? Is it too hard for the compiler to reason about it? I'm assuming this has, well, it's a good question. I'm assuming this has something to do with compatibility. So if you're basically not sure anymore how is the data laid out in memory, then you cannot work with the data on different areas in your code
Starting point is 00:42:07 or something like that. So that would be my assumption. Because that would basically mean you cannot reason at all anymore about how the data is laid out in memory. So you're writing some code, and whatever comes out, how the data is in memory, you don't know anymore. So that's my assumption. That's why there's basically a clear way,
Starting point is 00:42:36 a clear translation of a struct to how it's laid out in memory so that you're sure this is, if you want to access this again, you compile something else, it will also be able to access this, my assumption. OK, other questions? And a lot of this you can actually also test. I'll show you in a bit how you can play around with this as well.
Starting point is 00:43:08 Okay, so this basically sums up the computer architecture and gives us a nice point to do, let's say, a three-minute break. Okay, I guess we can get started again. So, instruction execution. So, we're still in the CPU architecture. Now this is more about how does the CPU actually execute instructions internally. So I want to talk about micro operations and pipelining and then pipelining hazards. Basically, when does this not work well?
Starting point is 00:43:41 What's bad in or what works well with modern CPU architectures, what does not work well with modern CPU architectures, at least to some degree. And some of the current, I'm going to show you some of the current pipeline architectures if we get that far. If not, we're going to see it next time. So well, let me see. Yeah. So let's get started. So basically, if you think about assembly code that you could write, so later on we'll see in the SIMD task, for example, there's these intrinsics, which appear to be like these direct instructions for the CPU.
Starting point is 00:44:28 But the CPU actually breaks them down into even smaller instructions that will be executed. And any kind of instructions that you can think of will be broken down into micro operations and it in pipelines. So they're not just like I'm doing an addition now, for example, but this will be broken down into smaller operations. And say even for an addition, for example, you need to first prepare the CPU to this, like the execution pipeline, say this will be an addition now and then you need to load the registers.
Starting point is 00:45:04 You have to perform the execution or the instruction and then you have to write the result back. And this is basically the phases. And this is simplified, so modern CPU architectures might have longer, like five steps, six steps pipelines and some additional connections in there. But this is, let's say, the basic theoretical concept. So we're starting with an instruction fetch cycle, which means we are fetch cycle. This is where we're getting the next instruction, basically. So what will be executed next?
Starting point is 00:45:41 And we're updating the program counter. So the program counter basically tells us where are we in our program right now and basically loads the correct instruction. Then we have the instruction decode and register fetch cycle, which could also be two individual steps. But here, for simplicity, and some architectures do it like this, we're basically saying, well, we're getting the instruction,
Starting point is 00:46:10 now we have to translate it to something that the CPU understands. And we're reading the registers. This is typically done in parallel. So we're basically trying to find what we need, basically the data that will be used. And this is done from the registers. And this also might evaluate the branch conditions. So if we have something like an if statement in there or a for loop, then the instruction
Starting point is 00:46:39 decode and register fetch cycle will basically say, well, this branch, we're going here or we're going there right now. Then we have the execution and effective address cycle. So here, say, for example, we have all of the data in our registers that we want to work on right now. Then we're telling the ALU, so the Arithmetic Logical Unit, please add up these two values, or whatever is in the registers,
Starting point is 00:47:14 and write it back to that register, for example. Or, if we have a load or a store, then we would do a read or write from memory in this phase. So this would basically also mean, well, I don't have the data in the registers yet. I need to get it from memory. If it's in the cache, in this case, we'll just read it from the cache.
Starting point is 00:47:38 If it's not in the cache, then this means, well, we're going to go to the memory unit and ask the memory unit, please give us this code. And then that's basically where we are right now. So then in the last step, we have the write back cycle. This basically means rewriting the result and the register file. So we're writing, say, our addition in the register file. So we're writing, say, our addition in the correct register, or we read the data that we have from memory
Starting point is 00:48:12 into the register that we want, or the data that we have from our LEU writing the result there back. And these two steps are also called, whoops, this is too fast. These two, or these four parts are also split up into frontend and backend. This is also what you see
Starting point is 00:48:36 if you do some profiling. You might see frontend and backend stalls. And this basically means, well, frontend is basically all of code. So if your code is not properly working or you have some self-optimizing code or something like that, then you might see problems in the frontend. Or you have bad code locality.
Starting point is 00:49:04 That might be a problem. Or in the backend, it's basically if nothing goes, then you have some contention there, or you have bad cache locality, everything needs to be from memory all the time, you have stalls there, this would be backend problems. And ideally, if you don't have any problems, you see that all like every all of the
Starting point is 00:49:25 time instructions are retired meaning um you perfect like you execute the instruction the instruction is done it will be retired then uh everything went uh worked out nicely so now we have this pipeline right as i said four uh maybe more steps and uh if we have this pipeline, right? As I said, four, maybe more steps. And if we have a non-pipeline execution, like in a very old kind of CPU, this would mean we execute one instruction after the other. steps would basically be one clock cycle right so this means we use basically four clock cycles to execute one instruction and then if the next instruction comes we have again four clock cycles to do the next instruction means we're going to execute basically two instructions per eight clock cycles, or a quarter of an instruction per clock cycle. And this is called the instruction per cycle, so a quarter of 0.25 instructions per cycle,
Starting point is 00:50:38 or four cycles per instruction. So the CPI would be four, or the IPC would be 0.25. And instructions per cycle, of course, you want to have as high as possible and this can go beyond above 1 in typical modern CPUs, even in single threaded. And you want as little instruction, little cycles per instruction as possible. And this can go below 1 in modern CPUs. And we'll see why in a bit. So if we do this in a non-pipeline fashion,
Starting point is 00:51:15 of course we're going to have a bad IPC or CPI, because basically we're just doing one thing at a time. But these individual steps in the pipeline, they are different parts of the CPU actually. So loading and storing, writing to registers, etc. This is something that we can do in parallel, also fetching instructions and decoding them. We've seen all this additional stuff on the CPU or the structures. At least to some degree, we've seen this.
Starting point is 00:51:49 So this can be done separately, also working with the caches, loading, et cetera. So we can actually do multiple things in parallel. And in this case, in these micro instructions, we're using so-called pipelining or pipeline parallelism. Meaning, we can do maybe in one slot, we can do one instruction fetch at a time, but we can do this every cycle. In every CPU or every clock cycle, we can fetch another instruction. And we go, like, while the next instruction is basically decoding, we can already do the next one.
Starting point is 00:52:29 So, and this basically goes on and on. And that means we have a nice fully loaded pipeline. And we get an instruction per cycle count of one or the same the cycles per instruction count of one right so it's basically then will be the same this is like perfectly pipelined and with a single pipeline this is as good as we can get and of course this won't necessarily always work but ideally we get this as often as possible. And in an ideal world, basically, if we break down our pipeline into smaller stages, say K stages, and this is for general pipelining, right? Even if we do stream processing, we put multiple operators on different nodes or stuff like that if we do pipeline parallelism Ideally the hope is we get like a throughput improvement of a factor of the number of pipeline stages, right?
Starting point is 00:53:34 However, of course the slowest sub instruction in this case determines the overall clock frequency So basically whatever is the slowest step here means this is the fastest that we can clock our CPU here. So we try to break down the instructions into these equi-length parts, and then, of course, try to reduce the number of cycles it takes to execute an instruction. We always have to go through all the steps, but the question is, do we have some stalls in there?
Starting point is 00:54:06 Does something need more time? And say, for example, if we go to memory, this will stall everything. There we will need to wait. But this, for example, this is actually truly parallel in here already. And this is also why we get this very high throughput already. And we can execute multiple instructions at the same time.
Starting point is 00:54:29 At the same time, it also means that each instruction will not just take a single clock cycle to execute. This is also you have to think about. It's not going to be one clock cycle and this is done. It's going to be one clock cycle and one stage is done. We'll need a couple of clock cycles to actually in any way perform a single instruction. OK.
Starting point is 00:54:56 Now this is nice, right? We have this pipeline, but there's some problems with the pipeline. So there's some different kind of hazards for the pipeline, different kind of things that can happen that break our pipeline. So there's some different kind of hazards for the pipeline, different kind of things that can happen that break our pipeline. And there's structural hazards. So say, for example, we have different pipeline stages
Starting point is 00:55:15 that need the same functional unit on the CPU. So say, for example, we would have multiple stages that need the memory unit or something like that, or need the decode stage or something like that. Then this means, well, we have a contention there. They have to wait. This will break our pipeline. And this can happen, right?
Starting point is 00:55:40 We will have multiple pipelines also on a single CPU. And if all of them need the same functional units, we'll have a problem. They basically have to wait. Even if, say, for example, we have multiple pipelines, we don't have enough ALUs on the CPU on the single core, well, they have to wait. It doesn't work other way.
Starting point is 00:56:02 But then there's also data hazards, meaning that we have two operations, basically, or two instructions that work on the same data. And one instruction is not ready with the result when the other instruction needs it. So we might be in this fill the register part where we actually want the result of the register part, where we actually want the result of the previous instruction,
Starting point is 00:56:27 but the result is not there. We're not in the write back phase of the first instruction yet. So then we basically have a problem here as well. We might have control hazards, meaning we have different kind of branches, and all of a sudden the branch goes in the wrong way, or we have to wait for the branch condition to be executed to be found out.
Starting point is 00:56:50 This might be a complicated calculation as well. Well, then we're losing our pipelining again. And then, of course, data access. If we have to go to memory, then, well, we cannot. Memory will take a couple of cycles, 200 cycles, something like that. Within these cycles, if we really have to wait for the data to become available,
Starting point is 00:57:18 the CPU won't be able to do anything. So then all of these will basically lead to pipeline stalls. And a pipeline stall is basically, well, the CPU can basically not do anything while waiting for something. So say, for example, we have one memory access unit. This is an example that I just gave you. So our instruction 1 or instruction i wants to do a memory fetch. And then another instruction, i plus 2, for example, is basically issued.
Starting point is 00:57:59 And they want to basically do the same thing. So say say for example instruction fetch and memory access are the same thing and so in whenever we're doing memory access we cannot do any instruction fetch then as soon as instruction i plus 2 wants to use the memory unit to get the instruction fetch then we're in problem and we have to stall installing means well the cpu in this stage doesn't do anything here in this unit or there is no unit to do something so this pipeline won't be executed won't just be stalled or will be stalled for a cycle and we're basically decreasing uh our instructions per cycle counter so we're basically decreasing our instructions per cycle counter. So we're going to have less instructions per cycle, because in one cycle, we're not finishing
Starting point is 00:58:51 up one instruction. So you can see here, this cycle, say cycle four or something, this instruction is ready. Cycle five, this instruction is ready. Cycle six, no instruction is ready. So we have to go all the way to cycle 7 to get the next instruction. And here, this is all about CPU architecture. And it depends on the workloads, of course.
Starting point is 00:59:16 But do you have a CPU that fits your workload? I mean, you might even be able to code in a way. But you probably have to really try that you only work with small parts of the CPU. And then you get into these structural hazards. So typically, the CPU designers will try to make sure that the CPU is well balanced and has proper hardware provisioning.
Starting point is 00:59:48 And then the compiler will try to do some reordering to some degree or the CPU also some reordering of the instructions in a way that it doesn't come to these stalls. But this might happen if you have a very specific kind of workload. Then we have the data hazards. And this is basically whenever two instructions or multiple instructions try to access the same data. And one thing is we have the read after write. This is basically if a register
Starting point is 01:00:28 is read by an instruction before it's actually written by a previous instruction, right? So we have our pipeline all the way at the end of the pipeline, we're writing the result. And now another instruction will try to read this. So say this next instruction will want to work on this result. Then we might issue the read, so the register fetch, before the register was actually written. And then we have a problem.
Starting point is 01:01:01 So then our results, like the result is not in the register. We cannot read it in this cycle. So then we need to put a stall in there. Basically, the CPU somehow has to wait for the result in order then to fetch the register and give it to the next instruction. Then there's also the possibility to have a write after a read.
Starting point is 01:01:28 So basically, the write of a register will be done by instruction after a later instruction actually already did a read. And these kind of problems can basically only happen if the instructions are reordered by the CPU. So the CPU, modern CPUs will reorder instructions in order to get better performance, in order to use up all the functional units properly.
Starting point is 01:02:01 So if some instructions take longer time, need to read something from memory, the CPU will try to do some other instructions in the meantime. But in this case, we might actually write something after it's been reordering. And an earlier instruction writes basically a value that should have been read by another instruction.
Starting point is 01:02:28 If this happens, of course, then the read must be stalled until the write has been done. And write after write is the same thing. So we're basically reordering two writes, and then we might end up with the wrong result unless we're basically stopping and changing the pipeline here. So, let's look at an example. So, here's some basic,
Starting point is 01:02:56 there could be some simple instructions, some assembly. So, we have an addition, a subtraction, an and and an or on the different kind of registers. And real assembly looks something like this, right? Usually, you have different registers. The first one would be the result here. And here, say, for example, in the at, we're writing to register 1. And then in the subtraction, we're reading from register 1. And then in the substraction, we're reading from register 1.
Starting point is 01:03:26 And then now, if we're basically looking at this here, then in the instruction decode and register fetch cycle here, we're basically reading what's been written here. So this would result in a problem. Because this has not been written, we don't have the result yet. And in hardware, this is solved to some degree with some bypass and forwarding hardware.
Starting point is 01:03:55 Or say, for example, you can basically, in the instruction decode, you can basically say, OK, the result will be in this register and I'm directly using this result in my instruction here. So, then I don't have to stall the pipeline, but I need something in here or in this case also, right? So, I want to access this value. I know which register it will be in, so I can basically, the CPU can pass through the value, can say, okay, this will be ready in the cycle when you actually need it, we will have it there for you. So then in this execution, the data will already be here, and in this case as well.
Starting point is 01:04:38 However, it needs special hardware and special support. If not, we would have to stall the pipeline here. So if not, we would basically do the instruction decode at this stage when the writeback has already been performed. Because then we know the data is in the register already. Another example here, we're basically writing back the register results in the load. And then another instruction basically reads what has been written from the load instruction before it was written. And this one actually cannot really be mitigated because we don't know how long this will be taking.
Starting point is 01:05:28 So if we're reading with a load, then the hardware needs to basically be stalled. So then until the load is performed, then the instruction or the pipeline will be stalled, and only then it will continue in this case. So here in the load, while this has been executed, as within the writeback phase, we know where it's going to be. We can actually continue here.
Starting point is 01:06:04 By scheduling the tasks in a different way, by reordering the instructions, you can mitigate some of this by basically not depending on the loads all the time right away. Then we have the control hazards and this is basically from the branches. As soon as we have the control hazards and this is basically from the branches. As soon as we have branches, we have to know how these branches will be evaluated. A simple way of doing this is basically as soon as we're in the branch, we're executing the branch, flushing the pipeline and start from there when we know what the branch result will be.
Starting point is 01:06:49 But of course this is costly because we basically have to completely flush the pipeline. So here we have our branch instruction and we need to evaluate this instruction in order to know where to continue. And this basically means at this point, we're basically waiting and then continue with the target instruction here. If we don't know the result yet, then we might even have to completely flush and continue from there. But this means in this case, we're paying actually a lot.
Starting point is 01:07:27 So now the question is, how can we mitigate this in software? So how can we deal with this? A lot of this hardware will try to mitigate this. But there's also some things that we can do in order to mitigate this. And some stuff, well, first let's look at some of the things that are actually in hardware and that we can utilize in software. And some of it directly utilized by the CPU and the compiler.
Starting point is 01:07:56 So after the fetch, one thing is that there's superscalar hardware. So there is multiple pipelines at the same time. And this is even true for a single core today. So we have the different instructions and we're fetching them, but then we might have separate pipeline stages or multiple units to do the same the same kind of instruction stages, but maybe with different functionality. And multiple functional units would be something, okay, we have, say, for example, the arithmetical logical unit or multiple of those.
Starting point is 01:08:34 We have loading and storing units, and we have floating-point units. And this exists in every core today, right? So you have this instruction fetch system, but it might actually do multiple instructions at a time, decode them, have kind of a buffer there. You have the decode stages, and then you have multiple units, or you even have multiple decode stages or units, and then you have multiple units that execute different parts,
Starting point is 01:09:04 and some of them even replicated multiple times, even in a single core. So this already will give you some mitigation of especially these structural hazards, because basically we have multiple units at the same time. We can use multiple of them at the same time. And also we cannot use all of them typically at the same time, we can use multiple of them at the same time. And also, we cannot use all of them typically at the same time. So basically, there is some flexibility what we're using at which time.
Starting point is 01:09:34 So the CPU will have a certain number of ports where we can actually execute, or the core, where we can execute multiple instructions in parallel or multiple pipelines in parallel. And then we can choose out of the functional units which ones are used right now or which ones are needed right now. And with this, we actually get these instructions per cycle with larger than one because we have parallelism, not even only pipeline parallelism, but actual parallelism as well with these different kind of units.
Starting point is 01:10:15 Pipeline parallelism is also actual parallelism, but additional superscalar parallelism here. One thing that the CPU does, because the branches are so bad, we're flushing the pipeline. We have to wait for the result. The CPU will try to guess which branch we will take. And we'll predict and then just speculatively execute this branch.
Starting point is 01:10:44 And this is basically the branch prediction. This is something that we saw in this very first sort example. So this is good. If it works, it's bad. If it doesn't work, because then we basically have to go back like all of the work was done in vain, and the CPU has to do the other task. And the prediction needs to be done very early.
Starting point is 01:11:07 And so this is usually there's some very simple buffer in the CPU, so branch target buffers or branch target caches that basically says, well, in the program counter here, this stage at the program counter, in this part of the program counter here, right this stage at the program counter, I'm in this part of the program. Well, what was the last branch that we used? Or what's my prediction that we did? And then also did this prediction,
Starting point is 01:11:37 was this prediction correct or was it incorrect? And this is really done in parallel to the instruction fetch. So while the branching instruction is basically fetched, the CPU will look up in the branch target buffer or branch target cache and make its prediction. And then basically based on that, already schedule the next instructions,
Starting point is 01:12:03 meaning for the branch that it predicts. And this basically, well if there's no entry for a program counter then it will just use like a random prediction. If there's an entry it will follow the prediction. And while the inner workings are actually branch secrets, branch brand secrets, so the CPU manufacturers will do this as they think or will have their own techniques. But we can see we can easily trick this right by using uniform distributions and randomness. This won't work well. It will basically be wrong half of the time, because it cannot do anything better there.
Starting point is 01:12:54 But we can do something. We can change something. Also, the compiler can do something. So rather than doing this branching, we can actually say we're not going to do branching. We're doing just turning this control flow, meaning if the data is this, then do this. If the data is that, then do that. We can turn this into a data flow where we can say with this data, do this.
Starting point is 01:13:21 With this type of data, do that. So we're changing this. We're always going to operate. And then there's no branching. The CPU doesn't have to select which type of code to use. The CPU will always know, okay, this is the next thing that I have to do. And say, for example, here we have a selection query. Again, similar thing. Line item, we're doing a count star.
Starting point is 01:13:46 If we do a basic C++ program or a C program with a branch, then we would say if the quantity is smaller than n or is equal n, I mean, here, should be equal in this case, then we're increasing our counter. And this is a branching condition if we directly translate it. And this would look something like this. So the CPU will speculatively always
Starting point is 01:14:19 try to find the right branch. And if we have 0% selectivity, the branching, the branch prediction will be good. If we have 100% selectivity, the branch prediction will be very good because we're always taking the right branches. If we have 50% selectivity and we have a uniform distribution, of course,
Starting point is 01:14:44 then the branching prediction will be bad because we're always running into the, like, or in 50% of the time, we're taking the wrong branch because we're just guessing. So then the performance will be less. And well, another way of doing this is in a data flow way, rather than saying we're doing an if statement here, we can just say we're always doing an addition,
Starting point is 01:15:12 but we're adding the result of this logical operation. So this basically says, is this greater or smaller? And this is not a branch or anything. It's just a logic operation. And the result will be true or false, or 0 or 1 in C++. And we can just add this to our account. Then we get the same result, basically. But in this case, we'll always do the same.
Starting point is 01:15:43 Doesn't matter what the selectivity is, because we'll always basically operate, we'll always do this addition. We'll also do the addition if it's not necessary, because we're just going to add zero. And this means we'll have a performance something like this. So in the cases where the branch prediction would be better, or where we have say for example
Starting point is 01:16:06 very little selectivity then when there's very little selectivity we never have to execute this and the branch prediction will also always be correct so then this code will be faster in the cases where we have to do a lot of a lot of wrong branch predictions, then this code will be faster because we don't never get any branch predictions. So we're always just doing our simple addition, which is faster than a wrong branch prediction. So wrong branch prediction will be multiple cycles.
Starting point is 01:16:44 One addition is basically a single cycle extra. And well, this is also done, of course, in hardware. So this also the CPU will do this with hardware. So the CPU will also try to figure this out while it's running. And I mean, we can also do this with different kind of conjunctive predicates. So rather than doing like,
Starting point is 01:17:18 say we have multiple predicates in a conjunctive query and we would do this with this double end. This will translate into many branches typically. So if we're doing this kind of code, this will be translated into this. Alternatively, we can do a bitwise end, and that will be translated into, well, would be less branches basically would just be one branching conditions but we could also do this in this again
Starting point is 01:17:56 predicated range by basically just adding or doing the same kind of selectivity or doing the same kind of resultivity or doing the same kind of result by just doing the operation on each individual tuple rather than doing an if statement or a branching statement here. And well, if we're using a query compilation, so if we're generating code while we're executing the queries or when we already have some statistics
Starting point is 01:18:28 over the code, then the code can actually use a cost model to find out which makes sense. So then we can basically see, and this is basically from a paper by Ross. You can see if we're just using the no branched version, as we would expect, we get this very clean flat line because we always do the same kind of work. We're looking at each individual tuple, we're performing something. If we have a very low selectivity,
Starting point is 01:19:01 then this bitwise end will basically be better. But notice this. Let me see. I'm not saying something. Oh, no. Yeah, no. And then the regular AND will basically be lower because then the branch prediction will be good.
Starting point is 01:19:22 We're doing less work. We don't have to work on every tuple. The compiler will order the CPU will just do less work. And the bitwise or might be a bit better in some cases and will always be or will for especially in the cases where the regular and will be slower will be better. But we'll also have more overhead than predicated version with a higher selectivity and let me see I actually
Starting point is 01:19:54 have an example for this somewhere don't remember where I wanted to put this. I have it later. So let me show this later. Okay. Also, then the CPU, it can do some rescheduling and can do some out of order. And this is, it does so by having some reservation stations, right? So the instruction stream is basically read from the instruction cache. So it will be loaded into the instruction cache. And all of the instructions that are going to be executed one by one later on will then go into reservation stations for the proper units. So all of them that need to be doing some memory access
Starting point is 01:20:47 will go to the memory units that go to the floating point units or that do some floating point operations go to the floating point units, et cetera. And they will then be executed when the hazards are cleared. And as soon as we have data hazards, then basically in this kind of count operation, then we basically also have a limit of parallel execution. So this is a simple example.
Starting point is 01:21:28 So here again, we're basically going through an array and we're basically going through we're doing the predicated version of our account. Right. So we're going to execute this on every step, but there's not that much way, or we cannot use this very parallel. However, what we could do is we can add additional parallelism by splitting the area up into, or the cursor up into two cursors, and then running through the area on two different parts in parallel.
Starting point is 01:22:10 And that gives us some additional parallelism here because we directly say that this is independent independent parts of the data. It won't be basically touched by anything else. So we'll have basically half the amount of loop steps with twice the instructions. And these instructions can basically then be reordered and can be put into different kind of functional units as well. So the CPU, by giving the CPU this,
Starting point is 01:22:43 the CPU might use a bit more parallelism. In practice, I might show you later, or let's see, either next time or this time, I will show you, in practice, the CPU or the compiler does this automatically for you, this kind of optimization. But if the compiler doesn't do it, this will actually give you some performance benefit. And then the CPU will basically reorder the instruction in the reservation stations and will give you a bit better performance. So say, for example, we had our earlier example where we have the initial branched version, then we have the predicated version that's somewhere here,
Starting point is 01:23:28 and this might give us a bit better performance because of this additional reordering. And this is something we'll briefly try, and then we're gonna stop for today. So I'll show you, it's basically the same code here. So I'm starting by filling an array with a certain number of values with random data and recounting basically the number of instances that only have less than half of the int size. So a random generator gives us from zero to max int size and we're counting
Starting point is 01:24:07 the ones that are less than that. And initially we have this simple branch here and I'm doing an initial count just to warm up the cache but this doesn't actually help that much but theoretically you could then we're doing this regular branched version then we're doing a predicated version here and finally we're doing a final version with these two individual counters or where we then basically sum up this afterwards and now I mean you can already see the result here. We can see that the branched version is actually much slower than the predicated version. And the two predicated versions, they're almost the same. So there's not that much difference here. And, but this is only, and this is basically just, I mean, you can see the effect however the compiler does this for you as
Starting point is 01:25:06 well if you're using optimization parameters so here basically have the same code i'm going to compile this again with optimization flags and i'm going to run the same code here and you can see well results are identical in this case and And this is because the compiler actually, rather than leaving this part as a single, so it will actually do some predication here. And it will also unroll the loop. So it will basically make multiple instances of this operation.
Starting point is 01:25:45 And then the CPU can just, as in this case, where we ourselves make multiple instances of this instruction or this statement, the CPU can also reorder this as it pleases and can basically use this additional parallelism here. And how you can see this, I will show you next time. For this time, I would say we're done here. Do we have any questions? Yes.
Starting point is 01:26:19 So about the misprediction from the crash prediction, maybe then the CPU needs to have a copy of all the registers or data exchanged at the same time? It has to be set with the wrong prediction? So that it has a copy of all the registers? So if it does the wrong prediction, yes, I think so. So it will basically then rewrite whatever the state that was before and then start with the other branch. And this is why this takes so long, right? So on the one hand, it does more optimize or more execution already until then. And so that
Starting point is 01:27:03 takes time, but then it sees the branch condition is wrong. So then it needs to go back and start basically from scratch where it was. It needs the same state again. Yes. OK. Other questions? No other questions?
Starting point is 01:27:20 OK. Next time, we're briefly going to talk about prefetching and then finally start with vectorized execution with SIMD. Thank you very much.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.