Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Instruction Execution

Starting point is 00:00:00 Okay, so let's get started. We're going to start with the instruction execution today, but I'm just using this for two announcements. One that's important for you is there's still this survey on who wants to participate in the programming exercises, and it closes today. Because we need your information in order to set up all of the environment. So if you have not done and you want to participate in the course, please do this now. Maybe we should also do an announcement in Moodle. So people who are not here today, which is not or there's probably a couple, they will also see this online. And then we have our joint seminar with you, Darmstadt, this week again. So today at 4.30, if you're curious, Julian Eisenschloss from Google DeepMind will talk about visual language,

Starting point is 00:01:02 how generating drives understanding. So there's a Zoom link you can join. This is online. The Zoom link is also in the slides that you can find in Moodle. With this, I'm going to switch to the other slides and go back to where we left off. So we saw this as a teaser. We saw that database workloads, on average, or example database workloads, perform much worse than average programs on typical CPUs.

Starting point is 00:01:44 So you have like checking these benchmark programs you have, and I mean, not perform, but have like worse cache misses, both instruction and data cache misses. So let's check or let's see why that is, right? So how does this happen? I mean, technically a a database has very well knowledge about how the data is accessed. So caching should actually work quite well because it's all about data movement, etc.

Starting point is 00:02:19 So if we know exactly how we're accessing this, we should also be able to optimize for this. So let's see why that is actually a problem. So why do database systems show such poor cache behavior? And there's two things. On the one hand, you have poor code locality. So traditionally, and I've talked about this quite a bit in the first sessions. Traditionally, database systems were built for this bottleneck of disk.

Starting point is 00:02:50 Because this was the major bottleneck, caches, et cetera, were not really a problem. So there, you wouldn't have to be efficient there. So it was somewhat over-engineered in the sense that you have polymorphic functions. So, say, for example, you're resolving the data types or attribute types for each process tuple individually every time. And that means you have lots and lots of code flying around in your active working set. And then you have this volcano iterator model,

Starting point is 00:03:27 which allows you for pipelining. And that means you have like the whole query plan and each individual tuple will be passed through the whole query plan, essentially. That means for each individual tuple, you will go through a lot of code. I mean, queries can potentially be large and then each of these individual operators have is a polymorphic function so you will walk through

Starting point is 00:03:52 quite a bit of code for each individual tuple and that means basically the code will not fit into your instruction cache and that means we will get cache misses here for the code. And the same is true for data, right? So database systems have to walk through a lot of or large amounts of data quickly, which in general is not super for caches, right? So if you're just scanning through large amounts of data, that's, well, it's not cache friendly in the first place. But on top of that, often we're not using, or even there, we're not using the data that we're loading efficiently.

Starting point is 00:04:35 So if we're just interested in small parts of the data, but we're loading all of the data into the cache, well, then most of the, or we're polluting our cache with data that's not super relevant, and we're polluting our cache with data that's not super relevant. And we're going to be slow again, right? We're going to get a lot of cache misses. And at the same time, doing something like a tree or walking through something like a tree means randomly jumping around. I've already mentioned this, right? So this is also how we can actually check cache performance

Starting point is 00:05:05 by just doing random jumps in memory. Like we will always go out of the cache with many of our accesses. So we have to think about how maybe we can do this more cache efficient if we know that most of our data will be in memory somehow. Okay, so how can you do cache-friendly code? We'll talk more about the data in a bit,

Starting point is 00:05:30 but first let's think about the code. And even for your code, you can optimize, right? So you can make your code more cache-friendly. I mean, you can optimize how data structures are organized. We will see this in a couple of sessions where we have a specific lecture on data structures and not only how they're organized in terms of cache lines, et cetera,

Starting point is 00:05:57 but also how they are accessed, right? So are you gonna access them in one way that you have close, like you have code locality and you have the cache locality for the data? And then in general, I mean, all systems will favor cache-friendly code. Of course, if the cache is not your bottleneck, maybe you don't want to go all in and optimize for cache-friendliness because it's not going to help you much if Anyway, you're gonna go like each individual interaction will go through the network or something like this

Starting point is 00:06:32 but Optimizing to some degree or as soon as the caches are your bottleneck then it actually makes sense to optimize there or check a bit more. But the tricky part about this is that the caches are very different on different systems. So these kind of optimizations are very platform dependent. So we've already seen that this processor that I have in my machine is very different in terms of caches compared to a server than x86 server. And even within the different generations of servers, the caches will look different. So you basically also have to think about if you want to write cache-friendly code,

Starting point is 00:07:19 will this just run on one system? Or if you're benchmarking this, will this just run on one system or if you benchmarking this, will this just run on one system in a production environment probably you want to be able to also port this to something else to a newer version of your chip, maybe to a different architecture, right? Right now we see a huge shift from x86 to ARM. Then you will have to deal with this somehow. And well, that's basically the third point here. If you can make this generic, then well, you get these, you can still work with performant code,

Starting point is 00:07:57 but it's still portable. So, and generic code where you can be cache friendly means that you have a reasonable small working set. So you're not working with very large amounts of data in a small loop or in small code. Use small strides. So we already said that. So if you have an area, you're not jumping in the area everywhere, but you're staying local or in your data structure, you stay local within the data structure, then you get cache friendliness and you focus on the inner loops, right? So whatever,

Starting point is 00:08:35 if you have nested loops, the inner loop is the one that will be more executed most frequently or like temporally local and that needs to be cache friendly, right? So this is basically where time will go. If this inner loop is not cache friendly, then you will have like constant cache misses essentially. Of course, at the same time, I mean, in our programming exercises, that's maybe not true,

Starting point is 00:09:02 but in general, you don't want to optimize too prematurely because, I mean, you also have to check what is actually the hotspot. I mean, you can optimize your code away in terms of cache performance, but if you have another bottleneck, it's not going to help you much, right? So you always should check what is actually your bottleneck.

Starting point is 00:09:23 So am I already utilizing my system utilizing my other system resources properly? Then it makes sense to actually go there. And we'll have a profiling session later also in this course, where you will get some insights how to look for these hotspots and how then to optimize for these. So, in some examples. Okay, so with this, let's look a bit more into this data layout idea, right? This is not only true for cache friendliness,

Starting point is 00:09:57 this is also true in general. If I'm accessing data across like large amounts of disk space, etc. Whenever I have some kind of a blocked access or I have not, like my random access is more expensive than my sequential access somehow, then the different accesses make, let's say, have different performance, then it makes sense to think about data layout. How are we storing the data?

Starting point is 00:10:26 How are we accessing the data, et cetera? And well, so the question is, how can we improve data cache usage? But again, as I said, this is also true for how can we improve disk usage, essentially. So you can just do the same thing. It's just a larger scale. It depends if your data will end up in disk or not. And so essentially we have to think about different data storage models and query execution models, and we have to think about temporal and spatial locality. So can we somehow get temporal and spatial locality

Starting point is 00:11:06 in our data accesses? And then based on that, or using different kinds of execution models, how we actually execute our queries, we can also improve this to some extent. So let's look at a simple example. So we're counting the number of line items. So this is line item. If you ever see line item in an academic setting,

Starting point is 00:11:33 you know this is the TPC-H. So this is basically one of the benchmarks. You probably heard briefly about this. So one of the standard benchmarks for OLAP processing or OLAP databases. So we are counting the line items that were shipped on a certain date. And typically this requires a full table scan, unless we have some kind of partitioning already

Starting point is 00:12:02 or we have some kind of index already, but let's leave that aside. So no helper structures. Let's go through the full table. And now the question is, how much data will we have to have to access? So the line item table will be the largest. So if we have scale factor one, this will be 700 megabytes. And I mean, this is of course not like a scale

Starting point is 00:12:25 that we're typically will be using, but it might be much larger there. And now just storing the data, we have different options to do so, right? So I mean, first thing that we think about and that we naturally would always store the data in is the row layout. So basically we have these line item rows or tuples and we're just going to store them one

Starting point is 00:12:52 by one, one after the other. And that's how most databases did it for a long time. And that's actually quite efficient if you want to write a lot of data because you get the tuples in a row layout and then you just write them. An alternative is the column layout. So rather than storing row by row, so each individual tuple, we can also store attribute by attribute. So that's a different way of storing. And that basically means that similar attributes

Starting point is 00:13:22 will be closely located or will have spatial locality. So, line item is an OLAP table, means it has a lot of attributes in practice. So, I don't remember, it's like 20 or so attributes, so it's not super large, but it's already large in practice. You will see tables that have a hundred attributes, a thousand attributes in larger scale systems, right? So, and then that actually makes a huge difference in terms of how data is laid out, right? So, either I have all of the attributes of the same type next to each other in memory, or I have the individual rows next to each other.

Starting point is 00:14:03 So, if I need the full row for my data processing, of course it makes sense to have this row layout. If I only need a single attribute, of course it makes sense to have this column layout. Okay, so now if we have like a full table scan and we have a row store and we're doing what we saw just earlier, right? So this means we have our tuples in memory laid out,

Starting point is 00:14:28 and the ship date is somewhere in the middle in this tuple. So, it's a very tiny fraction of the tuple, and this is all we're actually caring about, right? So, this is like our cache block boundaries, and we can see, well, we're going to load not all of the cache blocks, so we're not completely inefficient by loading the full rows, but we have to load a lot of data that we actually don't need. With every access to the ship date field, we load a lot of data that's not relevant.

Starting point is 00:15:04 Let me see if I get this here. To the ship date field, we load a lot of data that's not relevant. Let me see if I get this here. So basically, if we load this ship date, for example, we have to load the full cache line. So everything else is irrelevant for us. If we have a column store or a columnar layout, that means basically all of the ship dates will be close to each other. I mean here I'm putting them on at the beginning of the table. Of course, this will not be the first attribute because we already saw there's other attributes before but well with the certain offset this will look something like this. So they're tightly packed and then

Starting point is 00:15:40 if we're loading all these ship dates, so it's the same amount as on the previous table, or in the previous slide, so then all the data that we're loading into our cache is actually relevant for the query, right? So this means we're much more efficient in loading the data. We're not gonna pollute our cache with additional data. We have a lot less accesses to memory in there. Right? So this means we have also less data, so this will probably be much more efficient.

Starting point is 00:16:13 And on top of that, right, so I mean we know the memory access, that's actually the slow part. And this is where our CPU will wait until something is happening because the comparisons, et cetera, I mean, the comparisons might be wrong branches, et cetera. But still, this part is actually where we've spent most time just going to the memory and bring it back. But here, we're loading a lot of relevant data at the same time, and we can do a couple of loops

Starting point is 00:16:44 on these cache lines because they are all relevant, right? So this means we'll hopefully amortize the costs for the fetch over more tuples, right? And if we're lucky, I mean, if we're very lucky, the whole table might fit into the cache, right? The whole attribute. And then everything will be the cache, right? The whole attribute, and then everything will be very fast.

Starting point is 00:17:08 So of course, this is great. So now you're saying, oh, why don't we do column store all the time? So why would anybody ever do row store anymore? And well, there's a trade-off of course, right? So it's not the end of the story because many queries don't just use one single attribute. But in the end, they might still use the whole tuple,

Starting point is 00:17:31 or at least parts of the whole tuple. And that's additional cost. It's not only just fetching this, but we have to recombine this somehow later again. That means we have to perform these kind of joins. And this means it's a workload dependent trade-off. So whenever I'm just writing the data, the column store will be much more expensive because I have to decompose it. Whenever then I'm writing the data and I'm

Starting point is 00:17:58 always going to retrieve the full tuple, I'm always going to work with the full tuple, then the column store will just kill me because I basically just do overhead work, right? So I'm always going to work with the full tuple. Then the column store will just kill me because I basically just do overhead work. So I'm decomposing the data first, re-layouting it in memory, and then I have to recombine everything later on again. So it's a trade-off. And because of that, there's also hybrid approaches. I mean, well, who would have guessed, right? And one idea is the packs or partition across attributes across layout.

Starting point is 00:18:33 And that basically means we're doing this column store idea, but we're just doing it on individual memory pages, right? So we're not doing it for the whole table, but we're just basically splitting this up into mini pages and doing this for partition of the data. So that means we're again a bit more flexible in how we lay out the data. We don't have to manage the complete column at once. But all our changes will be more local. And we can still be like the recombination cost will go down a bit.

Starting point is 00:19:13 And of course, there's also hybrid storage models in terms of temporal or in terms of how we insert data. So rather than saying, we're always gonna use the column store, we could just say, well, the writing part, if we get new data that comes in a row fashion typically. So let's keep it at that, right? So let's keep it in a row fashion.

Starting point is 00:19:38 Let's keep the working set there reasonably small. So we're not, we don't get into this trouble of having large amounts of data in a row oriented fashion, although we might want to access it in a column fashion. And if we have some updates still on this data, that's also faster than having to somehow realign the columns individually. And as soon as the data is old enough,

Starting point is 00:20:04 or our buffer for this row-oriented storage is full, then we will move it to the column store. So this is what basically many in-memory database systems these days do. So systems like HANA, for example, they have a delta store, which is basically whatever comes in new, and then they have like a cold store or like a column store for the more older data that's still accessed frequently, but where we have mostly these analytical workloads where then we can basically do faster scans

Starting point is 00:20:41 through the individual columns. Okay, questions so far? Right, so makes sense. I mean the tricky part is basically when to do what, right? So this is basically, it's really dependent on your workload. So if you're just reading through the data, if you're just scanning parts of the tuples, like individual attributes, then for sure the column store makes a lot of sense. If you're just updating, like you're getting new tuples, you're updating these tuples,

Starting point is 00:21:19 then the row store makes more sense because you don't have as much additional maintenance work within these individual tuples, right? In the hybrid layout, if your updates stay local, basically you don't have the same amount of reorganization. So this PACS approach, for example, that makes sense in these cases where then you have some updates,

Starting point is 00:21:44 some data stays like blue, or most of the accesses are still column oriented, then that basically makes sense. Is this a question? Yes. Yeah, that's not a smart question, but why do we even bother putting data in the cache in this case if we're just doing the code operation

Starting point is 00:22:03 because I don't expect to re-access the table values right so the question is why do we put the data into the cache in the first place and it's not a not smart question so the thing is basically all data will always go into the cache right so we we cannot access the data. So, let's go back here. So, I cannot access this individual. Let's say it's four bytes. These individual four bytes, I cannot access. I'm always going to read 64 bytes into my caches. This is my cache line and this is how accessing the memory always works. And so basically I will always have this overhead. This will always somehow pollute my cache. And with this layout I don't have this problem anymore.

Starting point is 00:22:58 Just to reiterate, technically in this case we could just override the same cache line? Yes. So the question is can we override the same cache line. Yes. So the question, can we override the same cache line? Yes. In this case, we can basically, I mean, if we're not going to reuse, and we can actually also, there's some commands or some hints to the processor and to the compiler that we say, well, we need this data. So maybe you also want to prefetch this,

Starting point is 00:23:26 but we're only going to use it once. So then basically, you don't need to cache it anymore. It can be flushed out of the cache quickly. In general, even if we're doing scans, even if we're just doing scans, we're still going to hope that eventually we might have some data that we will reuse so then um i mean in the instruction cache of course we're going to reduce something uh maybe some other values right that we're going to reuse in the cache so the like some um variables that we will later on use again. So things like that. So that basically means we will always use the cache somehow.

Starting point is 00:24:07 For the scans, the cache doesn't help us this much. But let's say the axis granularity makes a difference here. Make sense? OK. If there's more questions, feel free to speak up. Okay. So, we saw this trade-off.

Starting point is 00:24:38 Now, as a last topic, sort of along the same lines, right? So, basically, we can, with the data layout, we can make stuff more cache efficient. Now, there's an additional thing where we can make our layout more cache efficient. But it's not so much data layout. It's more data structure layout. And that's alignment. So alignment or memory alignment is basically the idea that your variables or your data starts at an address that is aligned,

Starting point is 00:25:16 which basically means that if you're starting at address 0, basically everything is always aligned. And then you could say, I'm starting at address 1, for example. Then this is only aligned if your data has size 1. So if your data is 1 byte, then starting at address 1 would basically make sense. If we're starting at address two, this would be aligned for values that have size two.

Starting point is 00:25:50 If we're starting at four, so from zero to four, then this would basically be okay for values that have size four. So this basically, that's the definition of alignment. You can basically think of it, if I have a data structure, it's going to start at a cache line somewhere. If I'm starting somewhere that like some odd number in there, then my access will be not aligned and I have the CPU basically has to remap this somehow so it fits into the registers.

Starting point is 00:26:28 And that's costly. So you basically need Word and cache-aligned attributes for cache efficiency, otherwise the CPU needs to do extra work. In general, today the CPU will cover this up for you. The compiler and the CPU will basically make sure that your code still runs. In the past, data would have to be, like your variables, et cetera, would have to be aligned.

Starting point is 00:26:58 Otherwise, freaky things would happen, because basically the way the stuff is laid out in the registers etc would somehow go wild right so now of course this is somehow padded but if it's not aligned there is extra work that needs to be done and the compiler typically will pad this because then um the alignment is better uh the cpu is uh is for the the way way the CPU works is better. So let's look at this, right? So we have an integer array, we have a long, and we have some kind of structure or struct,

Starting point is 00:27:37 which is essentially a pointer. And each of those will be basically, or like the pointer will be 8-bit. I'm not saying anything wrong. Help me here. It's 8 bytes. So we're talking about bytes here. Sorry about this. And the int, of course, will be 16.

Starting point is 00:28:00 So the for integer array will be 16 bytes, so each 4 byte, essentially, and the structure in the end will end up in memory, right? So this, and the compiler cannot change the order. So the compiler basically needs to represent this in the same way, and this is nicely aligned here, right? So we can see this starts at 0, and then each individual item basically ends at the nice boundaries such that the data or the individual items are aligned and that

Starting point is 00:28:38 the whole thing basically fits nicely into the memory. This will always be, so the way the struct is laid out in memory will always be done in the way you define this, right? So if you define this this way, in this order, it will also be laid out in this, even if another order would be better, right? So the compiler will not change this just for safety reasons, etc. The compiler will also determine the overall size

Starting point is 00:29:13 and the position of fields such that it is aligned. And so that's basically what it does. And for good system performance, Intel and other processors say that the data needs to be aligned or should always be aligned. The memory is accessed in these word chunks. And so whenever you're basically somehow overlapping there in between words, or weirdly offset it within there, then your accesses will be slower. There needs to be extra work for the CPU.

Starting point is 00:29:57 So essentially also around the cache line boundaries, if it's not well aligned, then we need to load extra cache lines, we need to remap this somehow so that it fits in the registers, etc. The hardware will work correctly, but it will be slower. It will basically do some extra management. And the alignment, I basically tried to somehow explain this already. I always find it a bit hard to explain. So essentially, if you have a single byte, there's no restrictions. If you have two bytes, then your address, basically, like the lowest bit, needs to be zero.

Starting point is 00:30:38 So it's basically like you cannot start at a one byte address. So you cannot be shifted by one if you have four you cannot be shifted by one two or three It needs to be at zero or at four right and so on. So the larger you basically the larger your address Or the larger your your data The more space you need in between. Either, I mean, if you only have elements of the same data type, then alignment is easy, right? It's just gonna be aligned naturally.

Starting point is 00:31:14 But if you're mixing this in a structure or in a struct, then you have these problems, then you can think about this, right? So, and here we have an example. So we saw this struct earlier or something similar. And here we're starting with the character. Then we have an integer array and then a double. And here you can see we're starting with the single byte.

Starting point is 00:31:42 Then we have these two integers. And there all of a sudden they're not aligned anymore, because they're not starting at 0 or at 2 or at 4, but they're starting at 1, which is not an alignment. So the compiler, in order to make this efficient, will actually shift this. The compiler will insert some padding in order to make this faster,

Starting point is 00:32:06 if you give them optimization flags. You can still pack it, but then if it's not padded, then the CPU will have to do extra work in order to have this alignment later on in order to be able to properly work with the data. And so essentially, this kind of this padded version here will have better performance than this. Even though we're using more space, like we'll have to load more data or our caches are filled up more quickly, this will be faster just because there is much more overhead that the system has to do. And additionally, I mean, the yet, the other thing that you can do is basically make sure what's the alignment, like what's the proper

Starting point is 00:32:59 alignment and just flip the data around in such a way, or change the struct in a way that it nicely fits into, or that it actually is aligned, then there is no padding needed. And I mean, if it's possible, right? So if your struct makes sense in that sense, in that way, and then you will have the best performance essentially, because you don't have the padding, you don't have to realign something,

Starting point is 00:33:22 or the CPU doesn't Have to do any alignment. Okay. So just basically as like what this is from the compiler point Of view. But first there's a question. Yes? Sorry, say again? So that's basically a question. Why doesn't the compiler reorder the fields in the struct automatically? And the rule of thumb, does it work to always start with the biggest variable, start with the floats, and the doubles, and the integers, and the characters, and last? So the question is, does it always help to work, start with the largest variables?

Starting point is 00:34:22 I mean, most likely it will happen or will mostly work but in the end it's something like a packing problem right I'm not saying bin packing but might be even a bin packing problem so I mean if you have very large structs. So in general I would assume that this probably helps, but you can also, I mean, you can know how large your individual values will be, and then, or individual data items, and then basically make sure that you have the alignment. What makes sure that my struct itself is not already starting at the offset? So the question is, how do I make sure that the struct is not starting at an offset?

Starting point is 00:35:13 I think this will always be starting. The compiler will make sure that this will start at the address, like a zero address or something proper. Okay. More? Okay. More? Okay, so let's look at what basically the compiler does. So the compiler always maintains the ordering that you defined, so that it can change.

Starting point is 00:35:41 And each field should be aligned within the struct. So, I mean, that's basically what the compiler also does. The CPU will work also without that, but it will be much slower. And you can somehow use offset off in order to get the actual field offset. So, you can basically check what's my offset of an individual item in the struct. The overall struct will be aligned according to the largest field.

Starting point is 00:36:15 So basically making sure that we have an alignment there. And the total struct size must then be a multiple of its alignment. So there we also might see some additional padding in order to have an aligned struct. So this is basically your question. So the compiler will basically add some padding in order to align the structs within. And then the variable length data that will be split and has a fixed header size, so it's basically

Starting point is 00:36:54 the pointer to the variable length data, and then the variable tail that basically points then to the tail. And any kind of variable data will be placed at the then to the tail. And any kind of variable data will be placed at the end of the struct. So there we have some flexibility. And in order to save space, as we said, the compiler needs to respect the order of the elements, how

Starting point is 00:37:19 they are declared. So we can basically make sure that we're ordering the data in a way that it actually fits the alignment or that the alignment works better, right? So that the alignment is more efficient. So say we're starting with a character and then have an int, then we will have some padding in there. So the compiler will add some padding. And then at the end, and then the other structure and some additional padding if we're reorganizing this. So in essence, putting the first element, the largest element first, then we get a better padding.

Starting point is 00:37:59 We get a better alignment and less padding. So all in all, this will be more cache efficient. And this is an experiment that I didn't try. So I'm somewhat surprised about the numbers and eventually maybe we'll try it out, right? So then I'll update you on this. But apparently if you don't use alignment, you get actually really bad throughput.

Starting point is 00:38:22 So this is, I took this out of a colleague's slide deck. They made this experiment. And so reordering and padding can actually make a significant impact on how fast you can access the data. Again, I'm somewhat surprised about these numbers. We might want to check this eventually. And then we'll give you some small code to try it out yourself okay good questions so far no questions then let me switch gears into into instruction execution.

Starting point is 00:39:08 Okay, so I'm going to start this quickly. Let me see how long. Start this quickly before we do a break. And, yeah. Okay, so in the rest of the lecture, we'll see how far I get. But we're going to talk about instruction execution and different kinds of hazards for instruction execution,

Starting point is 00:39:36 for instruction pipelines. And so in general, if you think about instructions, so if you think about your assembly, et cetera, this assembly is not what is actually executed. So at least that's not the end of the story. So the CPU does not just get an individual assembly instruction and then execute this magically and continue with its life. Right, so but actually this is further broken down. This is broken down, each operation

Starting point is 00:40:11 is broken down into micro operations and these are then executed in pipelines. And these micro operations, so this is, I mean in this, it's four micro-operations. In practice, these pipelines might be six or seven or eight steps long. So individual micro-operations that contribute to a single higher level, which still is kind of assembly-level instruction. And from an abstract point of view, this is basically all you need to know. Some of these might be further broken down. You might have different steps depending on the type of instruction you see.

Starting point is 00:40:53 But essentially, initially, we're starting with fetching the instruction. So whatever needs to be done next, so we're getting the instruction, and we're updating the program counter. So the program counter basically tells us us where are we in the program? so let's get the next instruction in the program and and decode or First get this instruction actually, right? So this will be in like a in a buffer will read it and

Starting point is 00:41:21 And increase the program counter. And then we decode the instructions, so we check what is this actually, what do we need to do, and we're fetching the registers. So we read the registers in order to do the actual operation. This will also evaluate any kind of branch conditions, right? And this is where also then branch prediction comes in, right? And then we have the actual execution cycle. So this is basically where the calculation is done. So if we have like a simple math operation, then the ALU, the arithmetic logical unit, will start doing its work. So do some addition, do some subtraction, do some logical operation like AND and OR and stuff like this.

Starting point is 00:42:17 Or if this is a load and store operation, this might also be the case, then it will basically read to the read and write from memory. And this, of course, then it will basically redo the read and write from memory. And this, of course, then will take a bit longer again. And then we have the write-back cycle. So it's not the write-through, it's the write-back cycle. So this is where then basically the data, the result will be put into the register. This could also be from the memory subsystem. So we basically have the memory load

Starting point is 00:42:50 and here then in the memory load, like in the write back phase, we'll write whatever we loaded from memory into the register. Or it's the result that comes from the ALU, which will then end up in the register as well. And if you ever see front-end stalls, back-end stalls, etc., this is basically where this comes from. So the instruction fetch and instruction decode and register fetch cycle,

Starting point is 00:43:20 that's front-end, and the execution and write-back cycle, that's back end and the execution and write back cycle that's back end right so if we're basically in the execution if we have a stall there then we don't have enough uh operational units on our thing if we're front end we're not loading or getting the instructions fast enough and now this is a small pipeline, right? This is multiple instructions that we're doing or multiple micro instructions that we're doing one after the other. And this is done like with the CPU cycles, right? So basically one CPU cycle, one micro instruction. And if we're, so first cycle, we're getting the instruction fetch, then second cycle, we're decoding the instruction,

Starting point is 00:44:11 we're executing the instruction, we're writing back the results. So basically, I mean, in my example here, four cycles to do an individual instruction. Now, if we're just doing this non-pipelined or one after the other, then this means the next instruction will basically start at time four. So if we're starting at zero at time four and we're actually going to be slow, right? So we're basically waiting and different kind of

Starting point is 00:44:39 units on the core won't have anything to do while we're waiting. And we need multiple clock cycles for a single instruction. We always need multiple clock cycles for a single instruction, but we can parallelize here. We can use pipeline parallelism. But still, I mean, if we do it like this, then what we can say is we have basically 0.25 instructions per cycle, or IPC, or four cycles per instruction, CPI. So I mean, these are two measures of the same thing.

Starting point is 00:45:13 One is basically one divided by the other. And of course, we want something that's higher than 0.25. So how many instructions can we do per individual cycle, right? So typically, you would see something that's one or even above one. And this is done by actually basically doing these stages, the individual pipeline stages in parallel, right? So we have the instruction fetch, and this is different pieces of hardware

Starting point is 00:45:47 that are running here or that are working here. And this is also how the pipeline is set up. So it's not set up that the same piece of hardware has to do multiple things one after the other. It's really different parts of the hardware, and we're just clocking this in order to break it down further. So now what we can do is basically we can do pipeline parallelism here, right? So while we're decoding

Starting point is 00:46:10 the first instruction, we can actually fetch the next instruction. And while we're fetching or decoding the second instruction and executing the first instruction, we can fetch the third instruction. And this really depends on the length or the depth of the pipeline, how parallel we can do this. And ideally, we can basically just do, like, continuously just fill up the pipeline, such that every step is always filled, like every functional unit always has something to do. And with this, we get this instructions per cycle of one or cycles per instruction of one. So we basically fully utilize our instruction or our system all the time. And ideally, I mean, this is basic pipeline parallelism

Starting point is 00:47:01 that you've hopefully already heard. If we have a K stage pipeline then our throughput will be improved or the performance will be improved by a factor of K. Of course the slowest part in there will determine the clock frequency or will determine also the throughput of the pipeline. So let's say either we're clocking the CPU in a way that each of these things can actually be executed on a single cycle, then probably the execution is most expensive, right?

Starting point is 00:47:37 So then whatever the slowest execution is will basically determine how fast our clock cycle can run. Or if we're using multiple cycles in one of these stages, then basically the other stages will have to wait. So that's basically how we determine the throughput in here. And this also, but the more stages we have, and if we have a single cycle per individual stage, this also means the latency goes up, right? So, for each individual operation, we'll actually need multiple cycles to finish them,

Starting point is 00:48:13 but we can do more operations in parallel. So, that basically means a single operation won't be done in a single clock cycle, but we will need a couple of clock cycles to actually run a single operation. But this here, this part will basically completely execute it in parallel, if all goes well. Okay, questions so far? As I said, like a typical CPU will have longer pipelines and we'll go to this in a bit more. And now we're going to do a quick 4-minute break. And then I'm going to talk to you about the pipeline hazards.

Starting point is 00:48:54 So, when does the pipelining not work well? Okay, so let's get started again. And talk about pipelining hazards, right? So, pipelines are these. And now the question is, well, ideally, this always works nicely. But in practice, we might have some problems with this. And this is what we call pipeline hazards.

Starting point is 00:49:16 So whenever the pipeline doesn't really fully work out, so the pipeline parallelism is somehow broken, this happens due to some kind of hazards. And there's structural hazards, so meaning we have something or some kind of operations that need the same functional unit, right? So we have some kind of resource conflict on the CPU. And this is, for example, so the instruction fetch and the memory access. So they both need to get something from the caches. They both need to get data in form

Starting point is 00:49:52 of instructions or actual data. So that's the same kind of unit. If there's not enough on the CPU or not enough thing, like we were basically constrained on the CPU or not enough thing. We were basically constrained by this unit on the CPU. Then all of a sudden, we have a structural hazard. We will not get the perfect pipe planning. Then we have a data hazard.

Starting point is 00:50:18 This is basically whenever we have instructions that somehow need the same kind of data or data from previous instructions, right? So then all of a sudden we're in a conflict because of the pipelining, right? Especially if the pipelines are very long. If the next instruction works with the same data, then it might be already in the execution whenever the previous pipeline hasn't, or the previous

Starting point is 00:50:46 instruction hasn't written the data yet. So this is because of this overlapping, we might get into a problem. Then control hazards, so whenever we're branching, I mean without any branch prediction, we will have to wait for the branching to actually be, like for the branch to be evaluated to continue in our instructions. So that's why there is branch prediction, because otherwise we basically, every branch, we first have to completely execute the branching condition in order then to continue with whatever happens next. Data access is a problem, so whenever we have to go all the way to memory,

Starting point is 00:51:31 then the CPU will have to wait the pipeline is broken to some degree, or if we're instructions are not loaded, et cetera, so that might also be stalling things. And these hazards, so these are all hazards, these are things that can happen, and if they happen, they will stall the pipeline, or that's what we call a pipeline stall. So all of a sudden, our pipeline is in trouble, right? So we cannot continue, we cannot further fill our pipeline. So let's look at a structural hazard.

Starting point is 00:52:06 And this is basically whenever the CPU cannot support all possible combinations of instructions. So if the CPU only has a single memory access unit, and the instruction fetch and memory access are scheduled in the same cycle, then, well, we have two memory accesses at the same time. One has to wait. So then basically, we see this here.

Starting point is 00:52:32 We have an instruction, like our memory access here in the first instruction. The second instruction is fine, because we do this while we're doing the instruction decode. But here, we want to fetch something, the memory unit is blocked through this actual execution or through this instruction, so then we need some kind of stall and the pipeline just like we're basically just using losing one cycle where then the next pipeline will get started.

Starting point is 00:53:07 And this can also happen because individual functional units are not fully pipelined. So this is like some functional units are actually quite complex. They need to do a lot of operation, and they need multiple cycles to execute. So that basically means we basically have to wait until we send something new into this functional unit. So this is especially true for what we're going to see next time.

Starting point is 00:53:38 So this vector processing. So these operations are quite complex. They might use multiple cycles to execute. Then a new instruction basically won't be able to use the same unit because we just have to wait until this unit is done. Until the current operation is done. And this is basically where the CPU manufacturers need to make trade-offs, right? So we basically, more functional units means more parallelism or more complex pipelines breaking down things further means longer pipelines, means better parallelism there maybe,

Starting point is 00:54:17 but also means more cost because it's more complex to maintain this, more circuitry is needed to actually properly do this. So, well, in the end, then often we need to, or they need to make a trade-off and say, well, in this case, if you're doing this, it will take a couple of cycles, won't be a single execution and your pipeline will suffer to some extent. And for this, I mean, it's basically something that the hardware needs to do, right? So we might come up with something where we somehow specifically target these things, but typically in a regular workload, well, if that happens, let's hope that the CPU is properly designed so such that it can somehow mitigate these problems and that the scheduling

Starting point is 00:55:10 works well so that some reordering etc. happens that it works well. Yeah so then we have these data hazards and they can basically this can happen if the pipelining changes the relative timing of reads and write accesses. In general, from a single threaded point of view, we have a single core, we're talking about single core still, so I don't have multi-threaded or something like in there. I'm expecting one operation to be executed after the other logically. Practically now all of a sudden there's some overlap in here. We're doing some parts of the operation, the previous operation still while we're doing the next operation already.

Starting point is 00:56:02 And this is where things can somehow get awkward. If I'm doing, for example, a read after a write. So I'm reading in the next instruction, I'm reading the register that is written by the previous instruction. If this is done at the very end of the pipeline, then my next step in the pipe or my next pipeline might want to access this

Starting point is 00:56:25 before it's actually written back. So then we have this data asset. Or we have a write after read. So all of a sudden what could happen, I'm basically, that we're writing a register that should have been, or we're writing a register that, before it's actually read by a previous operation, right? And this happens only if we're reordering instructions. So typically this should not be happening

Starting point is 00:57:00 if we're still in the same order, because the writing will happen all the way at the end of the pipeline. But since CPUs can reorder instructions to some degree, this might all of a sudden be a problem for us, that we have an instruction that writes something that the previous instruction shouldn't have read yet. So then we need to somehow stall this and wait until the read has occurred.

Starting point is 00:57:25 And write after write, of course, same problem. somehow stall this and wait until the read has occurred. And write after write, of course, same problem, right? So we're writing something that a previous instruction should actually write, so we're overwriting, or the previous instruction will overwrite what we're writing in the later instruction if we have some reordering in here. Again, only happens in reordering, but the read after write actually can happen even if

Starting point is 00:57:46 we don't do any reordering of instructions. So let's look at an example. So here we have an addition, a subtraction, an and, and an or, and basically the way this works in assembly here is that we have two arguments, R2 and R3, and we're writing the result in the register 1. So, we're basically reading R2 and R3, making an addition and writing it back into register 1. And in the second instruction, we're reading register 1 and register 5, and we're writing to register 4, and so on and so forth. And here you can see that basically the instruction, the subtraction basically reads the register R1 before it was written by the addition. This can be mitigated. So there is an option that we actually have

Starting point is 00:58:47 some kind of forwarding or bypassing hardware. So basically, in the hardware, we can say, well, the next instruction will use the same register in here. And the computation will actually happen later, right? But the instruction fetch or the register load happens before the actual execution so then we somehow need to make sure that this works in one way or the other right so here we can see basically in the instruction decode step and of the subtraction will read the register which should have been written here.

Starting point is 00:59:29 So now there is a bit of hardware in the CPU in order to make sure that this result here will also be available here. We only need it actually in here, right? So the actual calculation will happen in here. But if we're just reading whatever is in the register right now then this will be wrong right so because at this stage the register still has something which might be from somewhere else right it's not what we've written here

Starting point is 00:59:58 the same is true in in here right so basically, here we actually don't know exactly, right, so it might be written, might not be written, et cetera, so here there is some additional hardware that will save us from this, but technically this might be a problem. Then basically here we have a similar setup where we're doing a load, right? So we're loading something from memory. So we have the address R2 and we're loading it into R1.

Starting point is 01:00:30 And then we're reading or we're subtracting whatever was in there and write the results somewhere else. So here, technically, the instructions, so this execution here is a load, so this will take time, right? This is not a single cycle, essentially. If we're doing the instruction decode here and we're reading from the register, also forwarding hardware won't help us because this is not available in time. So here we actually need to stall, right? So we basically need to wait until this is available

Starting point is 01:01:08 and nothing can happen until this is available. So the pipeline will basically just be stalled until we have the data available. Then we can basically start our decode and only then the next instructions can be fetched and so on and so forth. So here, scheduling might help us. And this is why the CPU all of a sudden then starts to reorder the instructions, because basically getting the data from memory will take time, so maybe I can do something else in between that's useful. And all of a sudden we get into this trouble

Starting point is 01:01:50 that we do something like a write after write, which also might be what the CPU needs to be aware of, right? If we're storing or if we're reordering here. Okay. But anyway, at some point, I mean, once we're always going into memory with our execution, this will basically empty our pipeline or stall our pipeline and slow things down. So there we have to think if we can do something. And then finally, we have the control hazards and these come from branches and these are actually

Starting point is 01:02:25 more severe than data hazards because essentially we don't know what's going to happen next. So in a branch we can predict what will happen but technically we have to wait what happens. And the very simple way of doing this, of dealing with this kind of control hazard, like if we're doing like a wrong prediction, is that we're flushing the pipeline. Meaning basically we're just emptying the pipeline and start from scratch. And redo the instruction fetch. And this is basically costs in the order of the length of the instruction, of the pipeline. So even if we're executing the branch, we don't know if this is the right instruction. So I mean, we're just reading the program counter. This is the next instruction in the program counter. But until we know exactly what's in here or what the correct thing is, or whenever we know this,

Starting point is 01:03:33 we only can continue to compute. So we'll basically wait and then go with the two target instructions. Or we can basically start once we know what the target instruction is. So this might really just remove, like completely empty our pipeline until we know what exactly is going on next.

Starting point is 01:04:01 So another question is, I mean, some of these we know we cannot really do anything. Some of these we might be able to do something about. Another question is, how can we mitigate some of these pipeline hazards? And some stuff is actually done in hardware, right? So some things the hardware vendors try to improve in order to fix these problems. And so one thing is that we actually have superscalar hardware. Meaning it's not just a single pipeline that we're constantly filling.

Starting point is 01:04:45 So, I mean, even if I draw here multiple pipelines, it's actually a single stage pipeline where we're adding new things. So, we have instruction fetch, instruction decode, execute, and write back pipeline steps. And there's just these four steps basically. We're just doing different things here. So it's not four individual pipelines, right? So, but what we can do, we can actually do multiple pipelines, right? So if we know we have different kind of like,

Starting point is 01:05:18 we wanna do multiple things in parallel or things take longer time, well, let's just replicate some of these things. So we can have multiple pipelines and these actually can have different functionality. So we use them for different parts or we can have multiple functional units that we're using and we're using them separately

Starting point is 01:05:39 or we're using them in different pipelines. So just in order to make sure that we do the right things. And this, for example, you will always have, right? So we will always have like a arithmetical logical unit, a store and a load and floating point operation units. And now if we have multiple pipelines, we can also use them separately with different kinds of instructions in parallel. And with this, if we do this right, we actually can get instructions per cycle larger than one. So all of a sudden, we can do more parallelism even in a single core, and we can execute more operations

Starting point is 01:06:18 or more instructions than we would usually do within a single clock cycle. I mean, one thing that we saw, these control hazards, this comes from branching. And here, the CPUs try to predict these branches. And the CPUs, they have some hardware just to do this. There's a question. Sorry, just when we have the superscalar architecture

Starting point is 01:06:46 and we have four instructions at once, why doesn't this affect the instruction decoding? Would that become more complex? Then we use four different hardware units? So the question is, why don't we have more different instruction decoding if we have different hardware units or different functional units? Technically, we just need to know which unit the instruction needs to be executed. So in the instruction decode, the micro operation looks to the instruction decode looks the same. It just needs to route to the right functional unit.

Starting point is 01:07:35 And I mean, let me also rephrase. So this is just a schematic. So how this is actually done in hardware, you might actually have separate pipelines. So where these pipelines start, which parts are replicated, this really depends on the core architecture. So how an individual core then will be. I have some schematics for some different cores later on.

Starting point is 01:08:02 So there you can basically at least guess to some degree how this is done. You're welcome. And we'll probably not be able to get there today. So then we'll do it next week. OK. OK, so branch prediction we already heard is kind of tricky because it might flush our pipelines

Starting point is 01:08:23 or at least stall everything. So let's do something smarter. And the CPU tries to speculate what would be the right branch. And guessing which one is the right branch, it will just start execution. And the prediction needs to be done early. So this needs, in the instruction decode, this is too late

Starting point is 01:08:47 because we already need to know where to go. This needs to be done in the instruction fetch. So this, again, needs to be done in hardware and fairly fast. So there's a branch target buffer or branch target cache. It's basically a simple lookup table that says for this program counter, so for this wherever we are in the program, what's the predicted branch and has it been taken or not. Right, so this is basically telling us like initially say we're always taking the true branch or we're doing taking a random branch and then then we're updating was this taken or not.

Starting point is 01:09:27 So it's really just a binary decision. And if it's been taken, then yes, the prediction was good. We're going to keep this prediction. If it's not been taken, then the prediction was bad. So we'll update the prediction and use the other. And maybe there's a bit more magic to it, but that's basically how it will work. And so the branch target buffer will be consulted

Starting point is 01:09:51 during the instruction fetch, and then we'll see where do we need to go. And if there's no entry, or if there's an entry, then we follow the prediction. If not, then we do some kind of static prediction. As I said, true or false or random. And then check after the branching if that was correct or not. And it's, of course, like many of these things, it's secret how it's actually done in the hardware.

Starting point is 01:10:24 But you can guess something along these lines. And we can try it out and see, well, if we're randomly doing our randomly or our branch targets are random, then the prediction is bad. I mean, there's no magic behind that, right? So it cannot predict the future properly. It can only see whatever it has learned already. So if we've seen branches have been used many times in this direction,

Starting point is 01:10:57 or last branch was this, then it will just try this. So, I mean, this is basically branching, right? So what we also can do from our point of view, we can, I mean, the branch is basically the problem, right? So we have, if we have to, or if we're using a branch, then we have to do code like different kind of codes. The CPU will do a prediction. If it's right, it's good. If it's right, it's good.

Starting point is 01:11:26 If it's wrong, it's bad, because then we'll have this control hazard and we'll flush our pipeline. We go back, do a new instruction fetch, et cetera. We get a bad instruction per cycle count. What we can do in our program still is we can mitigate or we can avoid branching. And this is basically in data processing. This is essentially fairly easy. So we can control, turn the control flow into data flow. So again, let's look at our line item query, right? So we have, it's basically the same query that we had before, just a slightly different predicate.

Starting point is 01:12:08 So we're doing the count with a different predicate. It could be the same predicate, it doesn't really matter. And if we directly translate this into C code or C++ code, then this is a branch, right? So this is basically here. If it matches, if our predicate matches, then we increase the count. But we don't have we increase the count. But we don't have to do the branch. Anybody know how this works? You could just add the comparison? Exactly.

Starting point is 01:12:36 We can add the comparison. So this is basically, I forgot how this is built up. So this is basically if we execute this. Somebody did this nice experiment if the selectivity is about 50% random. So if most of the time, or half of the time, the branch prediction will be wrong, then this takes much longer than if the branch prediction works well. So in these cases, the branch prediction,

Starting point is 01:13:11 if the branch is always the same, then the branch prediction is good. We never get into this pipeline or this control hazards, and we never have to flush the pipeline. So this will work actually quite quickly. As soon as we're we have 50% selectivity then it's not working well because we're basically doing a lot of overhead here because of this branch branch misses and now what we can do that what you correctly said is we can basically just add the comparison

Starting point is 01:13:45 right so this is no branch this is basically just checking does it match and we're we're adding does it match to our result so here we don't have a branch we always will execute this right so this code the upper code the count is only or the addition is only executed if it matches. This addition is always executed. So we'll always have an addition, but the addition will be with one if it matches, and will be with zero if it doesn't match. So we'll always execute this.

Starting point is 01:14:18 So we do a bit more work in the cases where it matches where it doesn't match because we we basically also do the addition which we don't do in the in the upper part but we never have to branch and this is basically what we'll see in performance is basically like in the cases where the branch prediction works well we do a bit more work, right? Because we always have to calculate. But in the cases where the branch prediction doesn't work, because half of the branch predictions are wrong, there we actually do a lot less work.

Starting point is 01:14:56 So there we're actually quite a bit faster. And then we can also do basically hardware predication and execute both branches and discard one result. So this is also possible. So the hardware can also do this for us to some degree, but this is something that we can do. To be honest, the compiler will do this for you. In simple cases, if you're doing optimizations,

Starting point is 01:15:30 the compiler already knows these kind of tricks, so you don't have to be smart about it. If you write this code and this code, the assembly is the same, and basically the compiler will already have seen this and you get the same performance. Yes? Question, the predicated version. So I do the comparison and then I get a Boolean, which is most of the cases I have a 0 or 1. But in some cases, I think this C standard doesn't demand a true case to be won.

Starting point is 01:16:06 So am I at risk that those comparisons could, for example, return 1,000? So the question is, in the C standard, so I have to be honest. In the C standard, this might also evaluate to something else. And there, I have to be honest, I would have to look it up. So what could happen? I mean, this is basically one of the standard examples.

Starting point is 01:16:31 Whatever the compiler does probably is correct. So you could go check the compiler and see whatever the compiler makes out of this. Probably does some safeguards around it to make sure that the result will always be correct. Typically, you also have multiple predicates. And then you can do also different kind of, I mean, you can do the same thing.

Starting point is 01:17:01 So you can do also predication with multiple predicates. So if you do a classical and, then that will be translated into many if statements that are nested. But you can also do this with a bitwise and, which then will just be evaluated as a single if statement later on. Or again, we can do this similarly with a predicated version without branches. So if we do this, basically, then we'll see,

Starting point is 01:17:43 I mean, there's again, somewhat of a trade-off, right? So if we have the branched version, this is good. If the branching always works well, right? So then if the branch prediction is perfect and we don't have anything to do in here, right? So our predicates basically always evaluate to 0, meaning that we don't have to do additional work. Then this is faster, because we don't have to do the addition.

Starting point is 01:18:14 If we have a high select, or if the branching mostly is wrong, or half of the time is wrong, then basically the branching mostly is wrong, or half of the time is wrong, then basically the branching will be bad, and the predicated version will be faster. If we do bitwise, and so that's basically as fast as a predicated version in the cases where we basically are mostly or often evaluating to false. But if we're evaluating to true a lot, then this will be just as slow as the branched version again. Right. Okay. But the predicated version basically always needs to do this memory write.

Starting point is 01:19:07 And this is why this is actually the overhead. If we don't actually do, or if our branch prediction is bad, or if we don't actually have to execute the query, then this will have a little bit of an overhead. OK, let me see. Yeah, I'm going to get two more. So the other thing that the CPU can do is that it can do out-of-order processing.

Starting point is 01:19:39 So instructions can be dynamically scheduled. So there's basically reservation stations where the instructions go. And then they will be executed whenever all hazards are cleared. And I mean, we can or the registers can be renamed in order to reduce hazards. So then some stuff can be run in parallel

Starting point is 01:20:03 by not overriding the same kind of registers. And we have different kind of reservation stations for the different kind of units. And within these, to get some speed up basically, or to fill the pipelines, the CPU might reorder operations. And to some degree, we can actually also use this. Because here in our count, we have a data hazard. So in our count, we're always accessing the same...

Starting point is 01:20:40 Where is this getting empty? I don't know. So the count, this is basically the same register in the end, right? That we'll always be writing to. So the addition, like the new addition needs the previous addition in order to continue. And we can ourselves basically change this by using two cursors, right? So all of a sudden having two cursors means we have two operations, we're using essentially two registers, two additions that can be reordered

Starting point is 01:21:13 because there's no dependence in between the calculations. So we're basically going through the data from two different positions, doing two additions. And this can be done in parallel. Whatever the CPU wants to do here, it can do it. And in the end, we just have to add up whatever we're doing there.

Starting point is 01:21:37 So that basically means there's some additional gains for the CPU in here. And hypothetically here, we could see even a bit additional performance improvement by this, right? So because the CPU doesn't have the kind of data stalls that we have with the previous version. But as a last example, I will actually show this to you.

Starting point is 01:22:10 So let's look at some code here. Before we end today. So I'm doing basically the very same thing. Let me see if I can find my cursor here. So I'm just using a large data array. I'm filling it with random data, as I usually do. I'm just going to do the sum of the array in the end. And here, I'm basically checking is the'm with a certain, let me see.

Starting point is 01:22:48 So I'm adding up all the numbers. I'm basically checking for half. It's random. So it's random numbers within integer max. And I'm using half of integer max as my branching condition. So if it's lower, I'm going to add it. If it's larger, I'm not going to add it. So 50% of the time, most likely, if everything's well randomized,

Starting point is 01:23:14 this branch will not be hit. And 50% is true, right? So this is basically what I'm doing here. If I do, then I count, and then I output. I'm just going to go through the array and check. This is basically very similar to what we earlier did. And then a predicated version of this looks just like this. I'm doing the comparison.

Starting point is 01:23:35 I'm adding the comparison to my count. And that should, of course, give us the same result. And then finally, we have this out of order version where I'm going through the area with two different counters, basically. I'm doing two steps, and I remove this data hazard with the counter to some extent. So there's some additional reordering that the compiler can do. And then I'm just outputting everything and I'm

Starting point is 01:24:10 outputting the timings, right? So this is as always just basically getting some timings here. And I'm going to show you this on the command line. And I'm first compiling it without optimization flags and and that's important. So this is basically running this without optimization flags. We see that the branched version takes 91 or 92 milliseconds. The predicated version takes 15 milliseconds and the out of order, so basically where the reordering works, takes 13 milliseconds. So we can see some difference in here. What if you compile with O3? I'm going to do this next.

Starting point is 01:25:03 That's exactly what I wanted to show you. So let me just do this with O3. And we're going to do the same thing. And all of the nice benefits are gone because the compiler actually does a lot of stuff here for us, right? So here we can see this is 800 microseconds for each of these executions. They're all basically the same. I mean, there's a slight difference, but it's very slight, right? And the reason for this, I mean, you can basically check this if you want, like analyzing the code. The reason for this is the compiler sees that there's an option for predication in here and the compiler will vectorize this code.

Starting point is 01:25:50 So the compiler will use additional functional units in there because it's so simple, right? It doesn't need to be done on a single, like these individual calculations. It can do some parallelism in there, even though it's like on a single functional unit in there. So using this vectorized operations that I'll show you next time. Okay.

Starting point is 01:26:16 Further questions? Regarding this. Okay. So this is also uh one of the examples where premature optimizations don't really help you that much right so because the compiler actually is already quite smart so you can complicate your code a lot without achieving a lot of benefit yes how many cursors do you need or Or how many should you not use? Because in an extreme case, you could just access them all one by one, right? So would you iterate?

Starting point is 01:26:52 JANET RANKINSTEINER- You don't iterate at all, you mean? And you just build so many cursors that you have everything in count. You can just access it at once. JANET RANKINSTEINER- Yes. It wouldn't help much, right? JANET RANKINSTEINER- Yes. So the question is, how many curses would you want?

Starting point is 01:27:07 I mean, it's really just, can you mitigate some of these data assets in here? So it's basically, do we get some additional benefits from reordering? We see slight benefits. I haven't tried it with 4, if this would give us an additional benefit in here. I mean, as soon as you're, this is sort of already somehow

Starting point is 01:27:28 vectorizing this, right? So we're working in blocks. The CPU or the compiler will see this and optimize already in there. But then we need to deal all of a sudden with corner cases, what with the border of the areas. So I would say we'll probably not see any further improvements after four or so, because then, I mean,

Starting point is 01:28:02 think about the pipeline, right? The pipeline is not endlessly long. So reordering won't help you very soon anymore. So basically, because I mean, with 4, you probably don't have any data hazards anymore because it's only a few micro-instructions that actually would result into this data hazards. But to be honest, I haven't tried.

Starting point is 01:28:28 I mean, you can easily, I don't know, I don't think I have this code online yet. I can put this code online. You can try change it to 4 and see if you're not optimizing it if you get some further improvements here. OK. No further questions. We're already a bit over time.

Starting point is 01:28:49 Then thank you very much. I'll see you next week. We'll talk about prefetching and the core architecture, and then fully dive into vectorization, where we can get these kind of performance improvements. Thank you.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Instruction Execution

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.