Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching & Instruction Execution
Episode Date: May 3, 2023...
 Transcript
 Discussion  (0)
    
                                         So today, later today, we're going to talk about instruction execution.
                                         
                                         Before that, we want to come back to the CPU and caches, and especially data layout and
                                         
                                         alignment today.
                                         
                                         But before we do so, well, I've already talked about the code examples, and I've not updated
                                         
                                         it yet, but there will be more.
                                         
                                         But then there was a question in the forum.
                                         
                                         And because we discussed it just right now,
                                         
                                         I want to also answer that right here.
                                         
    
                                         And that's regarding recording the tasks.
                                         
                                         So next week, we're going to start the SIMD task.
                                         
                                         And the part, like the setup and everything, we will record.
                                         
                                         Meaning that everything, how to set up, how the tasks are run, etc.,
                                         
                                         all that, and also the task details for the first task we will record.
                                         
                                         However, all of the solutions, that part, we're not going to record.
                                         
                                         So wherever we show you how the task would have been solved
                                         
                                         and our code, et cetera, that part won't be recorded.
                                         
    
                                         So other people and future students can also enjoy the task
                                         
                                         just like you did.
                                         
                                         So that's the idea.
                                         
                                         Then another, because we are with recordings,
                                         
                                         I managed to not turn off the mic
                                         
                                         for the second half last time.
                                         
                                         So yesterday, that means we missed half of the recording from yesterday. We're just going to
                                         
                                         post the link to last year's recording in Moodle for those of you who enjoyed this asynchronously.
                                         
    
                                         You can see whatever I told you yesterday from last year, basically.
                                         
                                         Exactly. So, that's basically it. Just as a reminder, so we're still in the CPU realm.
                                         
                                         And as I said, we already moved stuff a bit. So, next week, we're going to start with SIMD on Tuesday and then have the task on Wednesday.
                                         
                                         But, as I said, we'll also record this.
                                         
                                         And otherwise, well, everything is as is.
                                         
                                         We're in the instructions right now.
                                         
                                         But, as I said, we're going to move back to the CPU cache first.
                                         
                                         And this is where we left off.
                                         
    
                                         So we talked about the different kind of caches
                                         
                                         that we have and some of the cache performance,
                                         
                                         and also how to calculate cache performance,
                                         
                                         and then also translation look-aside buffer,
                                         
                                         because that's all stuff
                                         
                                         that you need to be aware if you're writing really high performance code.
                                         
                                         Because I mean, you don't really have to write for that, but you will see the effects depending
                                         
                                         on how you write your code.
                                         
    
                                         And well, here kind of closing up this cache stuff, not completely,
                                         
                                         but as one kind of overview example,
                                         
                                         here's an example of the specint performance numbers.
                                         
                                         So this has been done by a, I don't actually
                                         
                                         remember who performed these experiments,
                                         
                                         but you can see instruction cache misses and L2 cache
                                         
                                         misses on some CPU, some server, probably x86 server CPU.
                                         
                                         And just in terms of misses per 1,000 instructions,
                                         
    
                                         and this would be, as we said, if it's
                                         
                                         misses per 1,000 instructions, 10 misses per 1,000
                                         
                                         instructions, our cache efficiency here would be 99%,
                                         
                                         just as we calculated last time, if you remember.
                                         
                                         And we can see that for most programs,
                                         
                                         this is actually much lower than that.
                                         
                                         So we're in the less than 1% realm,
                                         
                                         so more in the 5 per mil realm, or even less
                                         
    
                                         than that, of cache efficiency.
                                         
                                         So many programs.
                                         
                                         And specint is one of these benchmarks.
                                         
                                         It's just basically a suite of different kind of small,
                                         
                                         or not even really small, programs that will be run
                                         
                                         and where then some profiling happens,
                                         
                                         or some
                                         
                                         performance happens.
                                         
    
                                         So say stuff like GCCs or compiler or G-SIP, so some standard programs, let's say, that
                                         
                                         you would run.
                                         
                                         And this is used to test the performance of CPUs.
                                         
                                         And well, we see most of these tasks,
                                         
                                         except for this MCF, which is typically bad.
                                         
                                         And I don't actually exactly remember what it does,
                                         
                                         but it's really bad in terms of caching.
                                         
                                         So all other than that, well, most programs
                                         
    
                                         have quite good caching behavior.
                                         
                                         And in comparison to that, if we now look at something like TPCC,
                                         
                                         and we run this on a standard database,
                                         
                                         so TPCC is one of these OLTP benchmarks, which I told you
                                         
                                         about, which just does small transactions.
                                         
                                         So you have warehouses, and you have things that are bought and sold
                                         
                                         and stocks that need to be managed. So different kinds of transactions which just touch small
                                         
                                         parts of the data. And you can see that on the one hand the instruction cache or the
                                         
    
                                         L2 cache misses are quite high in comparison to most of the other programs. But even worse, the instruction cache misses are really bad as well.
                                         
                                         So here we'll have a poor performance in terms of instruction cache.
                                         
                                         And the question is, of course, why?
                                         
                                         So why does this happen?
                                         
                                         And so why do database systems have such poor cache behavior
                                         
                                         in general?
                                         
                                         And I told you all the way in the beginning
                                         
                                         that classical database systems, MySQL, Postgres, DB2, et
                                         
    
                                         cetera, in their original form were all built
                                         
                                         around this bottleneck of disk.
                                         
                                         It's all about disk.
                                         
                                         Because of that, because that's so dominant, it doesn't really matter what happens inside
                                         
                                         the caches.
                                         
                                         So, you can put a lot of instructions, etc., you can do a lot, until you see a problem
                                         
                                         in the CPU if you're waiting for disk all the time. And because of that, you basically
                                         
                                         can use, well, let's say, very fancy ways of programming,
                                         
    
                                         lots of libraries, et cetera, and make your code quite,
                                         
                                         let's say, convoluted before you notice any problems.
                                         
                                         However, if you then all of a sudden fit your data
                                         
                                         into memory, you
                                         
                                         will see these, even if you don't have them in memory, you will see that your code
                                         
                                         behaves poorly in terms of cache performance. And this is because there's
                                         
                                         poor code locality, so that's basically all the problem with the
                                         
                                         instruction cache, right? So for example, you're resolving attribute types
                                         
    
                                         for each tuple individually using
                                         
                                         some polymorphic function, meaning basically
                                         
                                         the code base jumps around all the time
                                         
                                         for each individual attribute.
                                         
                                         You're going to completely different code paths.
                                         
                                         And if your code is large, and think about something
                                         
                                         like Oracle, this is in the gigabytes of code base. Meaning whatever you install this is not, I don't
                                         
                                         know what's HANA, you might actually know what SAP HANA is. Have you ever touched
                                         
    
                                         that? You don't remember? Anyway, it's also not going to be in the in the
                                         
                                         kilobytes, the code base. It's going to be larger.
                                         
                                         And of course, this is optimized to some degree.
                                         
                                         But say, classical SAP systems, also DB2, et cetera,
                                         
                                         these are huge systems.
                                         
                                         And the code base is not just, well, whatever.
                                         
                                         But even the core code base for just operating
                                         
                                         on the individual attributes will already be large.
                                         
    
                                         And so basically the system has to go a lot of different code paths while executing.
                                         
                                         And not just because of different attributes, but also because of the volcano iterator model.
                                         
                                         So this is the traditional way of processing data in a database, tuple by tuple, in an iterator fashion.
                                         
                                         So you all know the software engineering
                                         
                                         pattern of an iterator, and where you basically say,
                                         
                                         get next, get next.
                                         
                                         And this is the way, and you will see this later,
                                         
                                         and also in the lecture separately,
                                         
    
                                         but this is the way how you execute traditional database operators.
                                         
                                         Basically, you stack them on top of each other, each having an iterator interface,
                                         
                                         and then each of them calling each other.
                                         
                                         And this means each individual tuple will basically pass through this query plan individually and going through a lot
                                         
                                         and large code base, which also means, right, so it's a lot of functions that need to be
                                         
                                         called, a lot of code that needs to be basically loaded, and that won't be cached in the end
                                         
                                         if it's too much, or it won't fit into the cache, at least not in the L1 instruction
                                         
                                         cache.
                                         
    
                                         So that's the one problem.
                                         
                                         And then the other problem, and that's always going to be a problem to some degree,
                                         
                                         is we have to work with large amounts of data.
                                         
                                         If we're talking about an analytical database,
                                         
                                         well, we have to go through the whole table.
                                         
                                         If we're trying to find something in a large table,
                                         
                                         it's not indexed, or we just want an average over the year
                                         
                                         or something, we have to go through the whole table.
                                         
    
                                         It's always going to be a lot of data.
                                         
                                         And so because of that, we'll also always
                                         
                                         see some poor caching behavior, or much poorer caching behavior
                                         
                                         than something where we just have a small working set
                                         
                                         that we're just going to iterate over and over again.
                                         
                                         So I don't know.
                                         
                                         Think about a machine learning model
                                         
                                         that is constrained in size.
                                         
    
                                         If you're working on that, that might actually
                                         
                                         fit into your memory or even into your caches
                                         
                                         if it's a small model.
                                         
                                         You do a lot of operations on that,
                                         
                                         so that's going to be very local if the data sets are not too large, etc. In the database,
                                         
                                         we're assuming we're going to have a lot of data. We're going to go through this
                                         
                                         over and over again, but mostly reading through this a lot, and that's going to be slow.
                                         
                                         If we're using an index, we don't have to read as much data,
                                         
    
                                         but it means we're going to randomly access data.
                                         
                                         And again, this will have poor caching behavior
                                         
                                         because we basically jump somewhere randomly
                                         
                                         in the large amount of data.
                                         
                                         And basically, the cache won't help us that much.
                                         
                                         However, we can improve stuff, right? So there it's not, well, nothing we can do because it's
                                         
                                         just like large amounts of data. We have to go through this. We can still do something by proper
                                         
                                         data layout and by proper code management. I mean code is always an option to optimize your code, basically.
                                         
    
                                         And you can and you will learn to some degree how you can optimize
                                         
                                         for cache performance in terms of data structures, for example.
                                         
                                         So you can build your data structures in a way that they have
                                         
                                         better caching behavior or worse caching behavior and a lot of it is all around okay how how much
                                         
                                         does fit or how much can i fit into my cache lines um will i read or use all of the data that's within
                                         
                                         the cache line or not how much of the data fits into my cache line anyway?
                                         
                                         Or do I have to, like, if I have multiple,
                                         
                                         use multiple cache lines, does it fit into cache at all?
                                         
    
                                         So navigating this, that makes a lot of sense.
                                         
                                         Also how to access the data in the end.
                                         
                                         So do I just do like random jumps,
                                         
                                         which means I'm just using small parts of the cache line maybe?
                                         
                                         Or can I use everything that I have in a cache line?
                                         
                                         So that's something where we can work on.
                                         
                                         And not only, I mean, all systems, not only database systems, favor cache-friendly code. And one particular thing, and that's maybe something that we don't
                                         
                                         teach that much in this course, because you're going to optimize a lot for certain types of
                                         
    
                                         CPUs or for a specific type of CPU, you'll see this is highly platform specific and this is highly CPU specific.
                                         
                                         So cache sizes, etc.
                                         
                                         To get the best performance, you have to be really, really specific with your code.
                                         
                                         In practice, you don't always want that.
                                         
                                         In practice, you actually want to make sure that your code is reusable and can be used
                                         
                                         on other systems as well.
                                         
                                         You always need to do this kind of trade-off.
                                         
                                         How much do I optimize for a certain platform or for a type of platform?
                                         
    
                                         How much do I, by doing so, limit my potential for reusing the code on a different platform?
                                         
                                         That's just something to keep in mind. If you're optimizing just for one cache size, just
                                         
                                         for one block size, for one associativity, et cetera,
                                         
                                         then your code is highly specific
                                         
                                         and will perform for sure not as well on another platform,
                                         
                                         maybe even really poorly on another platform.
                                         
                                         And this is kind of if you're setting these hard thresholds.
                                         
                                         There's hope.
                                         
    
                                         I mean, there's some ways to mitigate these problems
                                         
                                         and we'll also talk to some degree about this.
                                         
                                         But there's something to keep in mind, right?
                                         
                                         If you're building systems,
                                         
                                         don't over-optimize for a single platform.
                                         
                                         And ideally, you try to write generic code.
                                         
                                         You try to make your code somehow adaptable.
                                         
                                         But there is basically some rule of thumbs
                                         
    
                                         how you can always get a good performance.
                                         
                                         One is work on a small working set.
                                         
                                         You have this temporal locality.
                                         
                                         Use small working set. So you have this temporal locality.
                                         
                                         Use small strides.
                                         
                                         So try to keep locality, spatial locality,
                                         
                                         to work on the data closely, like all
                                         
                                         on the data that's spatially close.
                                         
    
                                         So that will always give you better performance
                                         
                                         than if you don't have no temporal locality
                                         
                                         and if you have no spatial locality.
                                         
                                         And then if you have a loop, then basically
                                         
                                         try to focus on the inner loop. That will also give you the most performance. But in general,
                                         
                                         that's also something that can happen quite easily. Don't optimize prematurely, right? So
                                         
                                         if you have one loop where you think you can optimize the hell out of
                                         
                                         it make sure that this is actually a loop that counts right so that actually has an impact on
                                         
    
                                         your overall performance so before doing like micro level optimizations check if this is actually a
                                         
                                         hotspot in your system if everything is bound by something else well you might spend your time
                                         
                                         better on another interface that you have a problem with rather than that.
                                         
                                         Or if your application is super slow, super inefficient,
                                         
                                         often you can get a much better performance just
                                         
                                         by changing the application code a little bit.
                                         
                                         Or say, for example, you get really crappy queries
                                         
                                         and you don't have a query optimizer,
                                         
    
                                         then it doesn't really make sense to improve your system execution if you're just working on crappy queries
                                         
                                         right so there's there's always you always again need to make a trade-off
                                         
                                         and make sure that you're focusing on the right on the right bottleneck and
                                         
                                         that's something where you should always
                                         
                                         do this kind of benchmarking or profiling.
                                         
                                         We'll have a session on profiling
                                         
                                         where we'll talk a bit more about this, how you can actually
                                         
                                         do this yourself with the code.
                                         
    
                                         And on the other hand, generally, it's also always
                                         
                                         good to do some back of the envelope calculations
                                         
                                         if this actually could be the case,
                                         
                                         if this could be the performance bottleneck or not.
                                         
                                         So with that, let's talk a bit about data layout, which
                                         
                                         is also something where we can improve,
                                         
                                         like I say, caching behavior quite easily
                                         
                                         by doing some tweaking.
                                         
    
                                         And there's some trade-offs.
                                         
                                         And this is basically also how to directly bring some of the stuff
                                         
                                         that we did so far to database systems, right?
                                         
                                         So, so far, I mean, this is like how do the CPU caches work, etc.
                                         
                                         More generic works for many, works the same for all kind of systems.
                                         
                                         But data layout, this is kind of very specific.
                                         
                                         OK, so let's talk about caches for data processing
                                         
                                         or for database systems.
                                         
    
                                         And the question is, how can we improve data cache usage?
                                         
                                         And this is basically, we have to go and look at different data
                                         
                                         models or data storage models and data layouts and query execution models. And
                                         
                                         the query execution models, as I said, we'll look into more deeply in a separate
                                         
                                         session, but data laid out is something that's directly related to what we've seen so far.
                                         
                                         And say we have a simple query, this would be on the TPC-H dataset,
                                         
                                         so line item is the big fact table, one of the two big fact tables in TPC-H, and the biggest one.
                                         
                                         And here we're just going to count all of the line items, so individual cells,
                                         
    
                                         basically, in the table, which were done on a certain date,
                                         
                                         where the day would be 26th of September in 2009.
                                         
                                         And if we don't have any index, if we don't know anything else about the table, then this
                                         
                                         means we have to do a full scan.
                                         
                                         So I mean, potentially, we have this sorted by date or something, then we could reduce
                                         
                                         the amount of data that we have to look through.
                                         
                                         But in general, there's no guarantee about this.
                                         
                                         If we don't know anything else, we have to do a full table scan.
                                         
    
                                         And there's basically two approaches how to lay out this data,
                                         
                                         in storage, in memory as well.
                                         
                                         And then of course there's some hybrid models as well.
                                         
                                         We'll also talk about this.
                                         
                                         And this is exactly the same as we had earlier,
                                         
                                         or it's not exactly the same, but more or less the same as we had earlier with the area layout.
                                         
                                         So we have a row-based layout and we have a column layout. So this is the same thing that
                                         
                                         we talked about earlier. So I can store an area, I can store a table either row by row or I can store it column
                                         
    
                                         by column. Row by row is good basically if we just want to write the data, right? Or if we need
                                         
                                         the whole row or if we like in the area example we needed to go through the complete rows one by one,
                                         
                                         then a row-based layout makes total sense.
                                         
                                         And this would mean, this is also
                                         
                                         called an N-ary storage model, or NSM.
                                         
                                         And this basically means we're storing
                                         
                                         all of the attributes of a single tuple one by one
                                         
                                         within our pages, and then the complete tuples
                                         
    
                                         actually next to each other.
                                         
                                         The other way of storing this, just basically column
                                         
                                         by column or attribute by attribute,
                                         
                                         also called decomposition storage model,
                                         
                                         would mean we're actually decomposing,
                                         
                                         and that's why it's called like that,
                                         
                                         we're decomposing the tuples
                                         
                                         and just store individual attributes.
                                         
    
                                         And well, this is basically good
                                         
                                         if we're just looking at a single attribute, right?
                                         
                                         Makes total sense because then we have
                                         
                                         all of the information there,
                                         
                                         but it's more costly if we have to reconstruct the tuple.
                                         
                                         So say for example, looking at our example here,
                                         
                                         just looking at an individual attribute,
                                         
                                         the decomposition model would be faster.
                                         
    
                                         If we need all of the attributes, say, for example,
                                         
                                         then rather than select count star,
                                         
                                         we would say select the whole tuple right
                                         
                                         select star then we would have to reproduce we would have to rejoin these individual
                                         
                                         subsets and then it becomes this kind of trade-off
                                         
                                         if we do a full table scan in a row store, and so they are also called row stores and
                                         
                                         column stores, depending on how they store the data on disk or in memory.
                                         
                                         If we do a full table scan, then this means one tuple is next to the other.
                                         
    
                                         Again thinking about everything in memory, We would have our virtual memory.
                                         
                                         Everything is nicely laid out as one long area, basically,
                                         
                                         where we have one tuple after the other, neatly packed.
                                         
                                         However, the tuples have different kind of attributes,
                                         
                                         have different kind of length of attributes,
                                         
                                         maybe our fixed size, maybe our variable size.
                                         
                                         But then we have this ship date, which is actually
                                         
                                         what we're looking for, is somewhere in there.
                                         
    
                                         If we're lucky, it's a fixed position.
                                         
                                         So it's somewhere in there in the middle of our tuple here.
                                         
                                         Now, the problem is that this is basically
                                         
                                         what we're going to read if we're doing our read.
                                         
                                         This is the cache block boundaries, as we know.
                                         
                                         So we read individual, like if we're talking everything is in memory, we're going to read
                                         
                                         individual cache lines into the caches and we're going to work on these cache lines.
                                         
                                         And here now we see we don't know exactly.
                                         
    
                                         Maybe we have something where we know where these tuples start or not.
                                         
                                         But in general, if we're just scanning through, we don't know exactly which part we are interested in.
                                         
                                         So this means we would actually have to read a large amount of irrelevant data to find the individual ship date values
                                         
                                         that we're looking for.
                                         
                                         But everything else from the tuple is irrelevant,
                                         
                                         but we still need to read it to find this ship date
                                         
                                         and work with that.
                                         
                                         Even if we're just looking at the individual cache lines
                                         
    
                                         that we're interested in, there's still
                                         
                                         a lot of information that's completely irrelevant to us.
                                         
                                         So that means the CPU will wait a lot just reading all this
                                         
                                         data just to find out, well, this is not relevant for me.
                                         
                                         This is not relevant for me.
                                         
                                         Maybe this piece of information is relevant for me,
                                         
                                         but even the rest of the cache line is not relevant.
                                         
                                         If we have a column store layout, so we're basically
                                         
    
                                         decomposing the individual tuples into their attributes,
                                         
                                         then we have pages in memory that look like this.
                                         
                                         So we have one page that just is full with ship dates.
                                         
                                         And this means if here then we have
                                         
                                         to read the individual cache blocks or individual cache lines,
                                         
                                         then all of the information that we're reading is actually relevant for us.
                                         
                                         So this means in this case, we're going to be much faster in processing the data.
                                         
                                         The cache will be much more efficient just because of the layout.
                                         
    
                                         However, of course, if later we have to reconstruct the tuple,
                                         
                                         this will be, again, more costly, because we'd have to read all of the individual parts. We have
                                         
                                         to join this back, which is something that in the original layout, in the row store layout,
                                         
                                         we still have the full tuple. So if my final result, if say, for example, all of my tuples
                                         
                                         will match and we have to reproduce the full tuple,
                                         
                                         then this will be cheaper again because I
                                         
                                         have all of the information there.
                                         
                                         I can output this directly.
                                         
    
                                         Well, in this setup, I first have to find the relevant tuples
                                         
                                         quickly, but then I have to reproduce everything else.
                                         
                                         I have to read everything, like all of the other attributes, and I have to
                                         
                                         recombine this.
                                         
                                         However, as I said, it's a trade-off, but typically,
                                         
                                         I mean, we have some selectivity and then this makes a lot of sense, right? And if we're not looking at the
                                         
                                         complete And then this makes a lot of sense, right? And if we're not looking at the complete tuple, but say in our count star, we can do the count
                                         
                                         star just on the ship date, right?
                                         
    
                                         So all that we need for this query, for example, would be just looking at these pages and we
                                         
                                         can completely calculate the result.
                                         
                                         And also the cost of reading this.
                                         
                                         We're still looking at each individual tuple here,
                                         
                                         each individual attribute only once.
                                         
                                         This means we don't get good caching behavior per tuple however we're reading complete cache
                                         
                                         lines so we're reading multiple tuples at the same time so the same cache line will be used
                                         
                                         multiple times in in our processing so we have much better caching behavior than we had earlier
                                         
    
                                         so we're not going to work with, like, not all of our accesses
                                         
                                         will still be cache misses.
                                         
                                         Because in the first case, we're basically here.
                                         
                                         Each individual access, like every sub, every cache line,
                                         
                                         basically maximum contains a single attribute. This means i won't have it in the cache
                                         
                                         unless i'm using some kind of prefetching but if i don't use prefetching i won't have it in the cache
                                         
                                         that means i will have to read it when i basically require it and that means every single access will be a cache miss here and in
                                         
                                         this case at least for the number of attributes in the cache line we will
                                         
    
                                         have cache hits and will have better cache caching effects here yeah and if
                                         
                                         we're lucky the full data might even fit into the caches and then we get actually
                                         
                                         much better.
                                         
                                         We're going to have very nice performance.
                                         
                                         And for some smaller tables that might actually might work.
                                         
                                         Okay, so but the trade off and I already hinted at that we have this recombination cost, right?
                                         
                                         So we have to recombine the tuples in the end.
                                         
                                         We need to perform all these joins
                                         
    
                                         to basically get the final tuples again.
                                         
                                         And this is really dependent,
                                         
                                         the trade-off here is really dependent on the workload.
                                         
                                         If we're just looking, if we're just counting, for example,
                                         
                                         no cost at all, because we can optimize this away we can
                                         
                                         just do this on the ship date if we need the full tuple in the end and a lot of this will again be
                                         
                                         application specific right so some of this you can basically on the application you can decide
                                         
                                         do i really need to count in the application or do i send the count to the database, then the database could be much faster, for example.
                                         
    
                                         However, there's also ways to kind of get
                                         
                                         this hybrid approach.
                                         
                                         And one way is PUCs, the partition attributes
                                         
                                         across layout.
                                         
                                         So rather than having the individual attributes
                                         
                                         in separate pages, we can do mini pages inside a memory page and have all of the
                                         
                                         attributes of a single tuple within these sub pages and basically group the attributes in a
                                         
                                         way so that we still have this cache performance. But we also have locality in the memory pages so that we don't
                                         
    
                                         basically might go into the problem that we have like pages maybe swapped out somewhere
                                         
                                         for parts of the tuples which would give us additional problems.
                                         
                                         Right and then of course rather than saying okay I want this hybrid model just within a single page and column and whatnot layout,
                                         
                                         I can also say, well, why not just decompose the data set into two parts?
                                         
                                         I have a column store layout and I have a row store layout for different parts of the data. And typically, say for example, new incoming data,
                                         
                                         I will, if I get this fast, it's much faster
                                         
                                         to write them in a row layout.
                                         
                                         Or if I have to iterate in an OLTP fashion over this,
                                         
    
                                         so I have to use the whole tuple,
                                         
                                         I have to update stuff in there,
                                         
                                         then the row store layout actually makes sense.
                                         
                                         Or if I delete tuples, right, then I just have to touch this one page rather than
                                         
                                         everything.
                                         
                                         So then I can basically say I store the new data,
                                         
                                         new incoming data in a row layout,
                                         
                                         and all the data that's not going to change much anymore,
                                         
    
                                         I'm storing in a column store layout.
                                         
                                         And this is what typical column stores today actually do.
                                         
                                         So SAP HANA, for example, or C-Store,
                                         
                                         which puts the commercial version of C-Store Vertica,
                                         
                                         right, I think.
                                         
                                         So other systems that use this column layout,
                                         
                                         they basically
                                         
                                         have this kind of setup where they
                                         
    
                                         have the new data in a row store and older data
                                         
                                         then in the column store.
                                         
                                         So with that, there's more we can do.
                                         
                                         And one thing that we also can, to some degree, at least
                                         
                                         think about is alignment.
                                         
                                         So basically, our data access, any kind of data access,
                                         
                                         can be aligned or unaligned.
                                         
                                         And that basically means, do we read the data basically
                                         
    
                                         in cache lines or in, I mean, for the CPU aligned basically so that it can easily be put into the registers or not?
                                         
                                         Or will the CPU have to move this back?
                                         
                                         So in the past, basically in older architectures, everything needed to be aligned. So basically I'm having an integer,
                                         
                                         which is like four bytes, right?
                                         
                                         So it needs to basically start at address zero
                                         
                                         or a multiple of like modulo,
                                         
                                         anything that's modulo four would be zero.
                                         
                                         So it cannot be shifted somehow
                                         
    
                                         because otherwise the CPU would have to. So it cannot be shifted somehow, because otherwise, the CPU
                                         
                                         would have to somehow shift it back,
                                         
                                         and old CPUs wouldn't be able to do this,
                                         
                                         to have it in the right position.
                                         
                                         So if you think about memory in this way,
                                         
                                         basically, if we're having addresses that are 4 bytes,
                                         
                                         for example, then anything that starts at address 0
                                         
                                         or modulus 4, 0 would be aligned.
                                         
    
                                         Everything in between there would not be aligned.
                                         
                                         And anything not aligned basically needs some extra work for the CPU to move it back into an aligned phase
                                         
                                         to then have it easily copied into the registers, etc.
                                         
                                         This basically depends on word size and cache sizes or cache line sizes.
                                         
                                         In the CPU, also the compiler, they will do automatic padding or they will do automatic
                                         
                                         alignment to fix this.
                                         
                                         However, it takes some time.
                                         
                                         Modern CPUs put a lot of work into making data
                                         
    
                                         or into mitigating these problems.
                                         
                                         However, if you know that this happens,
                                         
                                         you can actually get some additional performance.
                                         
                                         So let's look at the typical structure representation.
                                         
                                         So here we have basically a very simple translation
                                         
                                         of something like a table.
                                         
                                         Say, for example, we have four integers, a long,
                                         
                                         and then a next pointer.
                                         
    
                                         So this could be something like an index structure
                                         
                                         that we're using.
                                         
                                         And this is just then a block of memory.
                                         
                                         And the compiler will just translate this directly
                                         
                                         into this kind of structure.
                                         
                                         And all of the fields, the compiler
                                         
                                         must basically translate them in the order
                                         
                                         that we've described them in our struct.
                                         
    
                                         So it cannot reorder this.
                                         
                                         So this is basically what the compiler just adheres to,
                                         
                                         so we're not getting into weird effects somewhere.
                                         
                                         And now the compiler might need to do some padding.
                                         
                                         And so the compiler basically determines the overall size
                                         
                                         and determines the positions of fields.
                                         
                                         And the machine-level program then has no understanding
                                         
                                         of the structures that you actually put.
                                         
    
                                         So it will just see this area of memory.
                                         
                                         And the CPU basically expects this to somehow be aligned, or if it's not aligned,
                                         
                                         meaning it doesn't start at address zero, for example, I mean, if in this case, basically
                                         
                                         everything would be nicely aligned, in this case, everything is fine.
                                         
                                         But if something is shifted, so say, for example, we have one byte here, and then our integer
                                         
                                         starts, the compiler or the CPU will have to realign this.
                                         
                                         And the compiler will try to optimize this as well.
                                         
                                         So in x84, Intel, for example, recommends
                                         
    
                                         the data to be aligned.
                                         
                                         So it's basically everything should be aligned.
                                         
                                         So meaning that we are, as I said, in the word size,
                                         
                                         so in four bytes, for example, we
                                         
                                         want to make sure that every integer starts
                                         
                                         at these addresses.
                                         
                                         And the CPU still, even if it's not the case,
                                         
                                         the CPU still will work, right?
                                         
    
                                         But it needs to do some reordering.
                                         
                                         So let's look at another example, maybe.
                                         
                                         So here, we start with a similar data structure as earlier.
                                         
                                         So we start with a character.
                                         
                                         Then we have an integer, an array with two elements, and then we have a double.
                                         
                                         And you can see in this case, basically, we're
                                         
                                         starting at, again, at 0, for example,
                                         
                                         or wherever we're starting.
                                         
    
                                         Doesn't really matter.
                                         
                                         In this case, typically, we're not aligned.
                                         
                                         So first, we have this character,
                                         
                                         and then we have the integers.
                                         
                                         So these will not be aligned because we're not at address 0
                                         
                                         but at address 1, for example.
                                         
                                         And that means, ideally, we want to get some alignment again.
                                         
                                         And what the compiler typically will do
                                         
    
                                         is it will put some alignment there.
                                         
                                         So it will put some additional space here.
                                         
                                         If it doesn't, then the CPU will put some like
                                         
                                         will have to do some alignment in memory, will have to move it.
                                         
                                         So the CPU might need to read multiple cache lines
                                         
                                         or might need to read multiple words to get the correct word
                                         
                                         and then move it, shift it back to have the right alignment
                                         
                                         so that in the end, it can actually
                                         
    
                                         work with these individual integers
                                         
                                         in the right alignment.
                                         
                                         And also doubles, for example, need
                                         
                                         to be even more aligned to have them put into the right space.
                                         
                                         And if they're not, so if the compiler does this,
                                         
                                         this will actually have better performance
                                         
                                         than this unaligned data, because there, the CPU
                                         
                                         will have to realign.
                                         
    
                                         OK, so typically, what the compiler does
                                         
                                         is the compiler maintains the declared ordering.
                                         
                                         So as we said, it won't do any reordering.
                                         
                                         You can do reordering.
                                         
                                         You can just change your struct if you want,
                                         
                                         but the compiler cannot.
                                         
                                         And so what the compiler does, it can insert some padding.
                                         
                                         So it can insert this kind of empty space here
                                         
    
                                         in order to make sure that all of the data or the elements
                                         
                                         in the struct are properly aligned.
                                         
                                         Then the overall struct should be aligned or must be aligned according to the largest
                                         
                                         field and then the total struct size also must be a multiple of its alignment.
                                         
                                         And again, the compiler will insert some padding here.
                                         
                                         And for strings and other variable length data, it will be split into the length, which will be fixed, and then the variable size tail. And the header will contain a pointer to the tail,
                                         
                                         and the variable data will be placed at the end of the struct.
                                         
                                         So the fixed part basically will be positioned in the same position
                                         
    
                                         as the order of the struct,
                                         
                                         but then the variable part will be put towards the end.
                                         
                                         So now what we can do if we want,
                                         
                                         we can basically make sure that we align our data
                                         
                                         or we reorder our data so that it's better aligned in memory
                                         
                                         and that the compiler doesn't have to do as much panning.
                                         
                                         So a simple example would be here.
                                         
                                         Instead of starting with the character,
                                         
    
                                         we can start with the integer and then add the characters later on,
                                         
                                         which then would give us a better alignment, a more tight representation in memory and less padding.
                                         
                                         So rather than using 12 bytes, because we want to have the integer aligned at modulo 4 again,
                                         
                                         so at position 4 in this case.
                                         
                                         We have the integer at the beginning,
                                         
                                         and the characters, since they're single-spaced,
                                         
                                         they can be tightly packed,
                                         
                                         but the struct overall should be aligned to the total size again.
                                         
    
                                         And, well, you can test this.
                                         
                                         For this, I actually don't have any example here.
                                         
                                         But say you can have a simple struct
                                         
                                         with different kinds of alignments
                                         
                                         with padding and reordering.
                                         
                                         And the differences, accordingly, again,
                                         
                                         this I didn't test, but it's from Andy Pavlos keynote,
                                         
                                         not keynote, every Pavlos course, the difference can be significant in throughput
                                         
    
                                         that you get.
                                         
                                         So basically with no alignment, you're below like half a megabyte throughput.
                                         
                                         With padding, you can get 11 megabytes of throughput in memory speed.
                                         
                                         And with the reordering, even more than that. So making this efficient can have a significant impact here.
                                         
                                         Again, this one I didn't test.
                                         
                                         I just used it from the lecture, actually.
                                         
                                         But something you can try.
                                         
                                         So write a small program and test it.
                                         
    
                                         Yes?
                                         
                                         If reordering has such a big impact,
                                         
                                         why isn't the compiler able to do so?
                                         
                                         Is it too hard for the compiler to reason about it?
                                         
                                         I'm assuming this has, well, it's a good question.
                                         
                                         I'm assuming this has something to do with compatibility.
                                         
                                         So if you're basically not sure anymore how is the data laid out in memory,
                                         
                                         then you cannot work with the data on different areas in your code
                                         
    
                                         or something like that.
                                         
                                         So that would be my assumption.
                                         
                                         Because that would basically mean you cannot reason at all
                                         
                                         anymore about how the data is laid out in memory.
                                         
                                         So you're writing some code, and whatever comes out,
                                         
                                         how the data is in memory, you don't know anymore.
                                         
                                         So that's my assumption.
                                         
                                         That's why there's basically a clear way,
                                         
    
                                         a clear translation of a struct to how it's laid out in memory
                                         
                                         so that you're sure this is, if you want to access this again,
                                         
                                         you compile something else, it will also
                                         
                                         be able to access this, my assumption.
                                         
                                         OK, other questions?
                                         
                                         And a lot of this you can actually also test.
                                         
                                         I'll show you in a bit how you can play around with this
                                         
                                         as well.
                                         
    
                                         Okay, so this basically sums up the computer architecture and gives us a nice point to do, let's say, a three-minute break.
                                         
                                         Okay, I guess we can get started again.
                                         
                                         So, instruction execution.
                                         
                                         So, we're still in the CPU architecture. Now this is more about how does the CPU actually execute
                                         
                                         instructions internally.
                                         
                                         So I want to talk about micro operations and pipelining
                                         
                                         and then pipelining hazards.
                                         
                                         Basically, when does this not work well?
                                         
    
                                         What's bad in or what works well with modern CPU architectures, what does not
                                         
                                         work well with modern CPU architectures, at least to some degree.
                                         
                                         And some of the current, I'm going to show you some of the current pipeline architectures
                                         
                                         if we get that far.
                                         
                                         If not, we're going to see it next time.
                                         
                                         So well, let me see. Yeah. So let's get started.
                                         
                                         So basically, if you think about assembly code that you could write, so later on we'll
                                         
                                         see in the SIMD task, for example, there's these intrinsics, which appear to be like these direct instructions for the CPU.
                                         
    
                                         But the CPU actually breaks them down into even smaller instructions that will be executed.
                                         
                                         And any kind of instructions that you can think of will be broken down into micro operations and it in pipelines. So they're not just like I'm doing an addition now,
                                         
                                         for example, but this will be broken down
                                         
                                         into smaller operations.
                                         
                                         And say even for an addition, for example,
                                         
                                         you need to first prepare the CPU to this,
                                         
                                         like the execution pipeline, say this will be an addition now
                                         
                                         and then you need to load the registers.
                                         
    
                                         You have to perform the
                                         
                                         execution or the instruction and then you have to write the result back. And this is basically
                                         
                                         the phases. And this is simplified, so modern CPU architectures might have longer, like five steps,
                                         
                                         six steps pipelines and some additional connections in there. But this is, let's say, the basic theoretical concept.
                                         
                                         So we're starting with an instruction fetch cycle,
                                         
                                         which means we are fetch cycle.
                                         
                                         This is where we're getting the next instruction, basically.
                                         
                                         So what will be executed next?
                                         
    
                                         And we're updating the program counter.
                                         
                                         So the program counter basically tells us
                                         
                                         where are we in our program right now
                                         
                                         and basically loads the correct instruction.
                                         
                                         Then we have the instruction decode and register
                                         
                                         fetch cycle, which could also be two individual steps.
                                         
                                         But here, for simplicity, and some architectures do it like this,
                                         
                                         we're basically saying, well, we're getting the instruction,
                                         
    
                                         now we have to translate it to something that the CPU understands.
                                         
                                         And we're reading the registers.
                                         
                                         This is typically done in parallel.
                                         
                                         So we're basically trying to find what we need,
                                         
                                         basically the data that will be used.
                                         
                                         And this is done from the registers.
                                         
                                         And this also might evaluate the branch conditions.
                                         
                                         So if we have something like an if statement in there or a for loop, then the instruction
                                         
    
                                         decode and register fetch cycle will basically say, well, this branch, we're going here or we're
                                         
                                         going there right now.
                                         
                                         Then we have the execution and effective address cycle.
                                         
                                         So here, say, for example, we have all of the data in our registers that we want to
                                         
                                         work on right now.
                                         
                                         Then we're telling the ALU, so the Arithmetic Logical Unit,
                                         
                                         please add up these two values,
                                         
                                         or whatever is in the registers,
                                         
    
                                         and write it back to that register, for example.
                                         
                                         Or, if we have a load or a store,
                                         
                                         then we would do a read or write from memory in this phase.
                                         
                                         So this would basically also mean, well, I
                                         
                                         don't have the data in the registers yet.
                                         
                                         I need to get it from memory.
                                         
                                         If it's in the cache, in this case,
                                         
                                         we'll just read it from the cache.
                                         
    
                                         If it's not in the cache, then this means, well,
                                         
                                         we're going to go to the memory unit and ask the memory unit, please give us this code.
                                         
                                         And then that's basically where we are right now.
                                         
                                         So then in the last step, we have the write back cycle.
                                         
                                         This basically means rewriting the result and the register file.
                                         
                                         So we're writing, say, our addition in the register file. So we're writing, say, our addition
                                         
                                         in the correct register,
                                         
                                         or we read the data that we have from memory
                                         
    
                                         into the register that we want,
                                         
                                         or the data that we have from our LEU
                                         
                                         writing the result there back.
                                         
                                         And these two steps are also called,
                                         
                                         whoops, this is too fast.
                                         
                                         These two, or these four parts
                                         
                                         are also split up into frontend and backend.
                                         
                                         This is also what you see
                                         
    
                                         if you do some profiling.
                                         
                                         You might see frontend and backend stalls.
                                         
                                         And this basically means,
                                         
                                         well, frontend is basically all of code.
                                         
                                         So if your code is not properly working
                                         
                                         or you have some self-optimizing code or something like that,
                                         
                                         then you might see problems in the frontend.
                                         
                                         Or you have bad code locality.
                                         
    
                                         That might be a problem.
                                         
                                         Or in the backend, it's basically if nothing goes,
                                         
                                         then you have some contention there,
                                         
                                         or you have bad cache locality,
                                         
                                         everything needs to be from memory all the time,
                                         
                                         you have stalls there, this would be backend problems.
                                         
                                         And ideally, if you don't have any problems,
                                         
                                         you see that all like every all of the
                                         
    
                                         time instructions are retired meaning um you perfect like you execute the instruction the
                                         
                                         instruction is done it will be retired then uh everything went uh worked out nicely
                                         
                                         so now we have this pipeline right as i said four uh maybe more steps and uh if we have this pipeline, right? As I said, four, maybe more steps. And if we have a non-pipeline execution, like in a very old kind of CPU, this would mean we execute one instruction after the other. steps would basically be one clock cycle right so this means we use basically
                                         
                                         four clock cycles to execute one instruction and then if the next
                                         
                                         instruction comes we have again four clock cycles to do the next instruction
                                         
                                         means we're going to execute basically two instructions per eight clock cycles, or a quarter of an instruction per clock cycle.
                                         
                                         And this is called the instruction per cycle,
                                         
                                         so a quarter of 0.25 instructions per cycle,
                                         
    
                                         or four cycles per instruction.
                                         
                                         So the CPI would be four, or the IPC would be 0.25.
                                         
                                         And instructions per cycle, of course, you want to have as high as possible and
                                         
                                         this can go beyond above 1 in typical modern CPUs, even in single threaded.
                                         
                                         And you want as little instruction, little cycles per instruction as possible.
                                         
                                         And this can go below 1 in modern CPUs.
                                         
                                         And we'll see why in a bit.
                                         
                                         So if we do this in a non-pipeline fashion,
                                         
    
                                         of course we're going to have a bad IPC or CPI,
                                         
                                         because basically we're just doing one thing at a time.
                                         
                                         But these individual steps in the pipeline, they are different parts of the CPU actually.
                                         
                                         So loading and storing, writing to registers, etc.
                                         
                                         This is something that we can do in parallel, also fetching instructions and decoding them.
                                         
                                         We've seen all this additional stuff on the CPU
                                         
                                         or the structures.
                                         
                                         At least to some degree, we've seen this.
                                         
    
                                         So this can be done separately, also working with the caches,
                                         
                                         loading, et cetera.
                                         
                                         So we can actually do multiple things in parallel.
                                         
                                         And in this case, in these micro instructions,
                                         
                                         we're using so-called pipelining or pipeline parallelism.
                                         
                                         Meaning, we can do maybe in one slot, we can do one instruction fetch at a time, but we can do this every cycle.
                                         
                                         In every CPU or every clock cycle, we can fetch another instruction.
                                         
                                         And we go, like, while the next instruction is basically decoding, we can already do the next one.
                                         
    
                                         So, and this basically goes on and on.
                                         
                                         And that means we have a nice fully loaded pipeline.
                                         
                                         And we get an instruction per cycle count of one or the same the cycles per instruction count of one right so it's basically
                                         
                                         then will be the same this is like perfectly pipelined and with a single pipeline this is
                                         
                                         as good as we can get and of course this won't necessarily always work but ideally we get this as often as possible. And in an ideal world, basically, if we break down our pipeline into smaller stages,
                                         
                                         say K stages, and this is for general pipelining, right?
                                         
                                         Even if we do stream processing, we put multiple operators on different nodes or stuff like that if we do pipeline parallelism
                                         
                                         Ideally the hope is we get like a throughput improvement of a factor of the number of pipeline stages, right?
                                         
    
                                         However, of course the slowest sub instruction in this case determines the overall clock frequency
                                         
                                         So basically whatever is the slowest step here means this is the fastest that we can clock our CPU here.
                                         
                                         So we try to break down the instructions
                                         
                                         into these equi-length parts, and then, of course,
                                         
                                         try to reduce the number of cycles
                                         
                                         it takes to execute an instruction.
                                         
                                         We always have to go through all the steps,
                                         
                                         but the question is, do we have some stalls in there?
                                         
    
                                         Does something need more time?
                                         
                                         And say, for example, if we go to memory,
                                         
                                         this will stall everything.
                                         
                                         There we will need to wait.
                                         
                                         But this, for example, this is actually truly parallel in here
                                         
                                         already.
                                         
                                         And this is also why we get this very high throughput already.
                                         
                                         And we can execute multiple instructions at the same time.
                                         
    
                                         At the same time, it also means that each instruction will not
                                         
                                         just take a single clock cycle to execute.
                                         
                                         This is also you have to think about.
                                         
                                         It's not going to be one clock cycle and this is done.
                                         
                                         It's going to be one clock cycle and one stage is done.
                                         
                                         We'll need a couple of clock cycles
                                         
                                         to actually in any way perform a single instruction.
                                         
                                         OK.
                                         
    
                                         Now this is nice, right?
                                         
                                         We have this pipeline, but there's
                                         
                                         some problems with the pipeline.
                                         
                                         So there's some different kind of hazards for the pipeline,
                                         
                                         different kind of things that can happen that break our pipeline. So there's some different kind of hazards for the pipeline, different kind of things that can happen
                                         
                                         that break our pipeline.
                                         
                                         And there's structural hazards.
                                         
                                         So say, for example, we have different pipeline stages
                                         
    
                                         that need the same functional unit on the CPU.
                                         
                                         So say, for example, we would have multiple stages that
                                         
                                         need the memory unit or something like that,
                                         
                                         or need the decode stage or something like that.
                                         
                                         Then this means, well, we have a contention there.
                                         
                                         They have to wait.
                                         
                                         This will break our pipeline.
                                         
                                         And this can happen, right?
                                         
    
                                         We will have multiple pipelines also on a single CPU.
                                         
                                         And if all of them need the same functional units,
                                         
                                         we'll have a problem.
                                         
                                         They basically have to wait.
                                         
                                         Even if, say, for example, we have multiple pipelines,
                                         
                                         we don't have enough ALUs on the CPU on the single core,
                                         
                                         well, they have to wait.
                                         
                                         It doesn't work other way.
                                         
    
                                         But then there's also data hazards,
                                         
                                         meaning that we have two operations, basically,
                                         
                                         or two instructions that work on the same data.
                                         
                                         And one instruction is not ready with the result
                                         
                                         when the other instruction needs it.
                                         
                                         So we might be in this fill the register part
                                         
                                         where we actually want the result of the register part, where we actually
                                         
                                         want the result of the previous instruction,
                                         
    
                                         but the result is not there.
                                         
                                         We're not in the write back phase
                                         
                                         of the first instruction yet.
                                         
                                         So then we basically have a problem here as well.
                                         
                                         We might have control hazards, meaning
                                         
                                         we have different kind of branches,
                                         
                                         and all of a sudden the branch goes in the wrong way, or we have
                                         
                                         to wait for the branch condition to be executed to be found out.
                                         
    
                                         This might be a complicated calculation as well.
                                         
                                         Well, then we're losing our pipelining again.
                                         
                                         And then, of course, data access.
                                         
                                         If we have to go to memory, then, well, we cannot.
                                         
                                         Memory will take a couple of cycles, 200 cycles, something
                                         
                                         like that.
                                         
                                         Within these cycles, if we really
                                         
                                         have to wait for the data to become available,
                                         
    
                                         the CPU won't be able to do anything.
                                         
                                         So then all of these will basically
                                         
                                         lead to pipeline stalls. And a pipeline stall
                                         
                                         is basically, well, the CPU can basically not do anything while waiting for something.
                                         
                                         So say, for example, we have one memory access unit. This is an example that I just gave you. So our instruction 1 or instruction i
                                         
                                         wants to do a memory fetch.
                                         
                                         And then another instruction, i plus 2, for example,
                                         
                                         is basically issued.
                                         
    
                                         And they want to basically do the same thing.
                                         
                                         So say say for example
                                         
                                         instruction fetch and memory access are the same thing and so in whenever we're doing
                                         
                                         memory access we cannot do any instruction fetch then as soon as instruction i plus 2 wants to use the memory unit to get the instruction fetch then we're in problem and we have to stall
                                         
                                         installing means well the cpu in this stage doesn't do anything here in this unit or there
                                         
                                         is no unit to do something so this pipeline won't be executed won't just be stalled or will be
                                         
                                         stalled for a cycle and we're basically decreasing uh our instructions per cycle counter so we're basically decreasing our instructions per cycle counter.
                                         
                                         So we're going to have less instructions per cycle, because in one cycle, we're not finishing
                                         
    
                                         up one instruction.
                                         
                                         So you can see here, this cycle, say cycle four or something, this instruction is ready.
                                         
                                         Cycle five, this instruction is ready.
                                         
                                         Cycle six, no instruction is ready.
                                         
                                         So we have to go all the way to cycle 7
                                         
                                         to get the next instruction.
                                         
                                         And here, this is all about CPU architecture.
                                         
                                         And it depends on the workloads, of course.
                                         
    
                                         But do you have a CPU that fits your workload?
                                         
                                         I mean, you might even be able to code in a way.
                                         
                                         But you probably have to really try
                                         
                                         that you only work with small parts of the CPU.
                                         
                                         And then you get into these structural hazards.
                                         
                                         So typically, the CPU designers will
                                         
                                         try to make sure that the CPU is well balanced
                                         
                                         and has proper hardware provisioning.
                                         
    
                                         And then the compiler will try to do
                                         
                                         some reordering to some degree or the CPU also some reordering of the instructions in a way that it doesn't come to these stalls.
                                         
                                         But this might happen if you have a very specific
                                         
                                         kind of workload.
                                         
                                         Then we have the data hazards.
                                         
                                         And this is basically whenever two instructions
                                         
                                         or multiple instructions try to access the same data.
                                         
                                         And one thing is we have the read after write. This is basically if a register
                                         
    
                                         is read by an instruction before it's actually written by a previous instruction, right? So we
                                         
                                         have our pipeline all the way at the end of the pipeline, we're writing the result. And now
                                         
                                         another instruction will try to read this.
                                         
                                         So say this next instruction will
                                         
                                         want to work on this result. Then we might issue the read,
                                         
                                         so the register fetch, before the register was actually
                                         
                                         written.
                                         
                                         And then we have a problem.
                                         
    
                                         So then our results, like the result is not in the register.
                                         
                                         We cannot read it in this cycle.
                                         
                                         So then we need to put a stall in there.
                                         
                                         Basically, the CPU somehow has to wait for the result in order
                                         
                                         then to fetch the register and give it
                                         
                                         to the next instruction.
                                         
                                         Then there's also the possibility
                                         
                                         to have a write after a read.
                                         
    
                                         So basically, the write of a register
                                         
                                         will be done by instruction after a later instruction
                                         
                                         actually already did a read.
                                         
                                         And these kind of problems can basically only happen
                                         
                                         if the instructions are reordered by the CPU.
                                         
                                         So the CPU, modern CPUs will reorder instructions
                                         
                                         in order to get better performance,
                                         
                                         in order to use up all the functional units properly.
                                         
    
                                         So if some instructions take longer time,
                                         
                                         need to read something from memory,
                                         
                                         the CPU will try to do some other instructions
                                         
                                         in the meantime.
                                         
                                         But in this case, we might actually
                                         
                                         write something after it's been reordering.
                                         
                                         And an earlier instruction writes basically a value that should have been
                                         
                                         read by another instruction.
                                         
    
                                         If this happens, of course, then the read
                                         
                                         must be stalled until the write has been done.
                                         
                                         And write after write is the same thing.
                                         
                                         So we're basically reordering two writes,
                                         
                                         and then we might end up with the wrong result
                                         
                                         unless we're basically stopping and changing the pipeline here.
                                         
                                         So, let's look at an example.
                                         
                                         So, here's some basic,
                                         
    
                                         there could be some simple instructions, some assembly.
                                         
                                         So, we have an addition, a subtraction, an and and an or
                                         
                                         on the different kind of registers.
                                         
                                         And real assembly looks something like this, right?
                                         
                                         Usually, you have different registers.
                                         
                                         The first one would be the result here.
                                         
                                         And here, say, for example, in the at, we're writing to register 1.
                                         
                                         And then in the subtraction, we're reading from register 1. And then in the substraction, we're reading from register 1.
                                         
    
                                         And then now, if we're basically looking at this here,
                                         
                                         then in the instruction decode and register fetch cycle here,
                                         
                                         we're basically reading what's been written here.
                                         
                                         So this would result in a problem.
                                         
                                         Because this has not been written,
                                         
                                         we don't have the result yet.
                                         
                                         And in hardware, this is solved to some degree
                                         
                                         with some bypass and forwarding hardware.
                                         
    
                                         Or say, for example, you can basically,
                                         
                                         in the instruction decode, you can basically say, OK,
                                         
                                         the result will be in this register and I'm directly using this result in my instruction here.
                                         
                                         So, then I don't have to stall the pipeline, but I need something in here or in this case also, right?
                                         
                                         So, I want to access this value.
                                         
                                         I know which register it will be in, so I can basically, the CPU can pass through the value,
                                         
                                         can say, okay, this will be ready in the cycle when you actually need it, we will have it there
                                         
                                         for you. So then in this execution, the data will already be here, and in this case as well.
                                         
    
                                         However, it needs special hardware and special support. If not, we would have to stall the pipeline here.
                                         
                                         So if not, we would basically do the instruction decode
                                         
                                         at this stage when the writeback has already been performed.
                                         
                                         Because then we know the data is in the register already.
                                         
                                         Another example here, we're basically writing back the register results in the load.
                                         
                                         And then another instruction basically reads what has been written from the load instruction
                                         
                                         before it was written.
                                         
                                         And this one actually cannot really be mitigated because we don't know how long this will be taking.
                                         
    
                                         So if we're reading with a load, then the hardware
                                         
                                         needs to basically be stalled.
                                         
                                         So then until the load is performed,
                                         
                                         then the instruction or the pipeline will be stalled,
                                         
                                         and only then it will continue in this case.
                                         
                                         So here in the load, while this has been executed,
                                         
                                         as within the writeback phase, we know where it's going to be.
                                         
                                         We can actually continue here.
                                         
    
                                         By scheduling the tasks in a different way, by reordering
                                         
                                         the instructions, you can mitigate some of this by basically not depending on
                                         
                                         the loads all the time right away. Then we have the control hazards and
                                         
                                         this is basically from the branches. As soon as we have the control hazards and this is basically from the branches.
                                         
                                         As soon as we have branches, we have to know how these branches will be evaluated.
                                         
                                         A simple way of doing this is basically as soon as we're in the branch, we're executing
                                         
                                         the branch, flushing the pipeline and start from there when we know what the
                                         
                                         branch result will be.
                                         
    
                                         But of course this is costly because we basically have to completely flush the pipeline.
                                         
                                         So here we have our branch instruction and we need to evaluate this instruction in order
                                         
                                         to know where to continue.
                                         
                                         And this basically means at this point, we're basically waiting and then continue with the target
                                         
                                         instruction here.
                                         
                                         If we don't know the result yet, then we
                                         
                                         might even have to completely flush and continue from there.
                                         
                                         But this means in this case, we're paying actually a lot.
                                         
    
                                         So now the question is, how can we mitigate this in software?
                                         
                                         So how can we deal with this?
                                         
                                         A lot of this hardware will try to mitigate this.
                                         
                                         But there's also some things that we
                                         
                                         can do in order to mitigate this.
                                         
                                         And some stuff, well, first let's
                                         
                                         look at some of the things that are actually in hardware and that we can utilize in software.
                                         
                                         And some of it directly utilized by the CPU and the compiler.
                                         
    
                                         So after the fetch, one thing is that there's superscalar hardware.
                                         
                                         So there is multiple pipelines at the same time.
                                         
                                         And this is even true for a single core today.
                                         
                                         So we have the different instructions and we're fetching them,
                                         
                                         but then we might have separate pipeline stages or multiple units to do the same
                                         
                                         the same kind of instruction stages, but maybe with different functionality.
                                         
                                         And multiple functional units would be something, okay, we have, say, for example,
                                         
                                         the arithmetical logical unit or multiple of those.
                                         
    
                                         We have loading and storing units, and we have floating-point units.
                                         
                                         And this exists in every core today, right?
                                         
                                         So you have this instruction fetch system,
                                         
                                         but it might actually do multiple instructions at a time,
                                         
                                         decode them, have kind of a buffer there.
                                         
                                         You have the decode stages, and then you have multiple units,
                                         
                                         or you even have multiple decode stages or units,
                                         
                                         and then you have multiple units that execute different parts,
                                         
    
                                         and some of them even replicated multiple times, even in a single core.
                                         
                                         So this already will give you some mitigation of especially these structural hazards,
                                         
                                         because basically we have multiple units at the same time.
                                         
                                         We can use multiple of them at the same time. And also we cannot use all of them typically at the same time, we can use multiple of them at the same time.
                                         
                                         And also, we cannot use all of them typically
                                         
                                         at the same time.
                                         
                                         So basically, there is some flexibility what
                                         
                                         we're using at which time.
                                         
    
                                         So the CPU will have a certain number of ports
                                         
                                         where we can actually execute, or the core,
                                         
                                         where we can execute multiple instructions in parallel or multiple pipelines in parallel.
                                         
                                         And then we can choose out of the functional units which ones are used right now or which
                                         
                                         ones are needed right now. And with this, we actually get these instructions per cycle with
                                         
                                         larger than one because we have parallelism, not even only pipeline
                                         
                                         parallelism, but actual parallelism as well
                                         
                                         with these different kind of units.
                                         
    
                                         Pipeline parallelism is also actual parallelism,
                                         
                                         but additional superscalar parallelism here.
                                         
                                         One thing that the CPU does, because the branches are so
                                         
                                         bad, we're flushing the pipeline.
                                         
                                         We have to wait for the result. The CPU
                                         
                                         will try to guess which branch we will take.
                                         
                                         And we'll predict and then just speculatively execute
                                         
                                         this branch.
                                         
    
                                         And this is basically the branch prediction.
                                         
                                         This is something that we saw in this very first sort example.
                                         
                                         So this is good.
                                         
                                         If it works, it's bad.
                                         
                                         If it doesn't work, because then we basically
                                         
                                         have to go back like all of the work was done in vain,
                                         
                                         and the CPU has to do the other task.
                                         
                                         And the prediction needs to be done very early.
                                         
    
                                         And so this is usually there's some very simple buffer
                                         
                                         in the CPU, so branch target buffers or branch target
                                         
                                         caches that basically says, well, in the program counter
                                         
                                         here, this stage at the program counter, in this part of the program counter here, right this stage at the program counter,
                                         
                                         I'm in this part of the program.
                                         
                                         Well, what was the last branch that we used?
                                         
                                         Or what's my prediction that we did?
                                         
                                         And then also did this prediction,
                                         
    
                                         was this prediction correct or was it incorrect?
                                         
                                         And this is really done in parallel
                                         
                                         to the instruction fetch.
                                         
                                         So while the branching instruction is basically
                                         
                                         fetched, the CPU will look up in the branch target buffer
                                         
                                         or branch target cache and make its prediction.
                                         
                                         And then basically based on that,
                                         
                                         already schedule the next instructions,
                                         
    
                                         meaning for the branch that it
                                         
                                         predicts. And this basically, well if there's no entry for
                                         
                                         a program counter then it will just use like a random prediction. If there's an
                                         
                                         entry it will follow the prediction. And while the inner workings are actually branch secrets,
                                         
                                         branch brand secrets, so the CPU manufacturers will do this as they think or will have their
                                         
                                         own techniques. But we can see we can easily trick this right by using uniform distributions
                                         
                                         and randomness. This won't work well. It will basically be wrong half of the time,
                                         
                                         because it cannot do anything better there.
                                         
    
                                         But we can do something.
                                         
                                         We can change something.
                                         
                                         Also, the compiler can do something.
                                         
                                         So rather than doing this branching,
                                         
                                         we can actually say we're not going to do branching.
                                         
                                         We're doing just turning this control flow, meaning if the data is this, then do this.
                                         
                                         If the data is that, then do that.
                                         
                                         We can turn this into a data flow where we can say with this data, do this.
                                         
    
                                         With this type of data, do that.
                                         
                                         So we're changing this. We're always going to operate.
                                         
                                         And then there's no branching.
                                         
                                         The CPU doesn't have to select which type of code to use.
                                         
                                         The CPU will always know, okay, this is the next thing that I have to do.
                                         
                                         And say, for example, here we have a selection query.
                                         
                                         Again, similar thing.
                                         
                                         Line item, we're doing a count star.
                                         
    
                                         If we do a basic C++ program or a C program with a branch,
                                         
                                         then we would say if the quantity is smaller than n
                                         
                                         or is equal n, I mean, here, should be equal in this case,
                                         
                                         then we're increasing our counter.
                                         
                                         And this is a branching condition
                                         
                                         if we directly translate it.
                                         
                                         And this would look something like this.
                                         
                                         So the CPU will speculatively always
                                         
    
                                         try to find the right branch.
                                         
                                         And if we have 0% selectivity,
                                         
                                         the branching, the branch prediction will be good.
                                         
                                         If we have 100% selectivity,
                                         
                                         the branch prediction will be very good
                                         
                                         because we're always taking the right branches.
                                         
                                         If we have 50% selectivity
                                         
                                         and we have a uniform distribution, of course,
                                         
    
                                         then the branching prediction will be bad
                                         
                                         because we're always running into the, like,
                                         
                                         or in 50% of the time, we're taking the wrong branch
                                         
                                         because we're just guessing.
                                         
                                         So then the performance will be less.
                                         
                                         And well, another way of doing this is in a data flow way,
                                         
                                         rather than saying we're doing an if statement here,
                                         
                                         we can just say we're always doing an addition,
                                         
    
                                         but we're adding the result of this logical operation.
                                         
                                         So this basically says, is this greater or smaller?
                                         
                                         And this is not a branch or anything.
                                         
                                         It's just a logic operation.
                                         
                                         And the result will be true or false, or 0 or 1 in C++.
                                         
                                         And we can just add this to our account.
                                         
                                         Then we get the same result, basically.
                                         
                                         But in this case, we'll always do the same.
                                         
    
                                         Doesn't matter what the selectivity is,
                                         
                                         because we'll always basically operate,
                                         
                                         we'll always do this addition.
                                         
                                         We'll also do the addition if it's not necessary,
                                         
                                         because we're just going to add zero.
                                         
                                         And this means we'll have a performance something like this.
                                         
                                         So in the cases where the branch prediction would be better,
                                         
                                         or where we have say for example
                                         
    
                                         very little selectivity then when there's very little selectivity we
                                         
                                         never have to execute this and the branch prediction will also always be
                                         
                                         correct so then this code will be faster in the cases where we have to do a lot
                                         
                                         of a lot of wrong branch predictions,
                                         
                                         then this code will be faster because we don't never get any branch predictions.
                                         
                                         So we're always just doing our simple addition,
                                         
                                         which is faster than a wrong branch prediction.
                                         
                                         So wrong branch prediction will be multiple cycles.
                                         
    
                                         One addition is basically a single cycle extra.
                                         
                                         And well, this is also done, of course, in hardware.
                                         
                                         So this also the CPU will do this with hardware.
                                         
                                         So the CPU will also try to figure this out
                                         
                                         while it's running.
                                         
                                         And I mean, we can also do this
                                         
                                         with different kind of conjunctive predicates.
                                         
                                         So rather than doing like,
                                         
    
                                         say we have multiple predicates in a conjunctive query
                                         
                                         and we would do this with this double end.
                                         
                                         This will translate into many branches typically.
                                         
                                         So if we're doing this kind of code,
                                         
                                         this will be translated into this.
                                         
                                         Alternatively, we can do a bitwise end,
                                         
                                         and that will be translated into,
                                         
                                         well, would be less branches basically would just be one branching conditions but we could also do this in this again
                                         
    
                                         predicated range by basically just adding or doing the same kind of
                                         
                                         selectivity or doing the same kind of resultivity or doing the same kind of result
                                         
                                         by just doing the operation on each individual tuple
                                         
                                         rather than doing an if statement
                                         
                                         or a branching statement here.
                                         
                                         And well, if we're using a query compilation,
                                         
                                         so if we're generating code
                                         
                                         while we're executing the queries or when we already have some statistics
                                         
    
                                         over the code, then the code can actually use a cost model to find out which makes sense.
                                         
                                         So then we can basically see, and this is basically from a paper by Ross.
                                         
                                         You can see if we're just using the no branched version,
                                         
                                         as we would expect, we get this very clean flat line
                                         
                                         because we always do the same kind of work.
                                         
                                         We're looking at each individual tuple,
                                         
                                         we're performing something.
                                         
                                         If we have a very low selectivity,
                                         
    
                                         then this bitwise end will basically be better.
                                         
                                         But notice this.
                                         
                                         Let me see.
                                         
                                         I'm not saying something.
                                         
                                         Oh, no.
                                         
                                         Yeah, no.
                                         
                                         And then the regular AND will basically
                                         
                                         be lower because then the branch prediction will be good.
                                         
    
                                         We're doing less work.
                                         
                                         We don't have to work on every tuple.
                                         
                                         The compiler will order the CPU will just do less work.
                                         
                                         And the bitwise or might be a bit better in some cases
                                         
                                         and will always be or will for especially in the cases
                                         
                                         where the regular and will be slower will be better.
                                         
                                         But we'll also have more overhead
                                         
                                         than predicated version with a higher selectivity and let me see I actually
                                         
    
                                         have an example for this somewhere don't remember where I wanted to put this. I have it later. So let me show this later. Okay. Also, then the CPU, it can do some
                                         
                                         rescheduling and can do some out of order. And this is, it does so by having some reservation
                                         
                                         stations, right? So the instruction stream is basically read from the instruction cache.
                                         
                                         So it will be loaded into the instruction cache.
                                         
                                         And all of the instructions that are going to be executed one
                                         
                                         by one later on will then go into reservation stations
                                         
                                         for the proper units.
                                         
                                         So all of them that need to be doing some memory access
                                         
    
                                         will go to the memory units that go to the floating point units
                                         
                                         or that do some floating point operations
                                         
                                         go to the floating point units, et cetera.
                                         
                                         And they will then be executed when the hazards are cleared.
                                         
                                         And as soon as we have data hazards,
                                         
                                         then basically in this kind of count operation,
                                         
                                         then we basically also have a limit of parallel execution.
                                         
                                         So this is a simple example.
                                         
    
                                         So here again, we're basically going through an array and we're basically
                                         
                                         going through we're doing the predicated version of our account.
                                         
                                         Right.
                                         
                                         So we're going to execute this on every step,
                                         
                                         but there's not that much way, or we cannot use this very parallel.
                                         
                                         However, what we could do is we can add additional parallelism
                                         
                                         by splitting the area up into, or the cursor up into two cursors,
                                         
                                         and then running through the area on two different parts in parallel.
                                         
    
                                         And that gives us some additional parallelism here because we directly say that this is independent
                                         
                                         independent parts of the data. It won't be basically touched by anything else.
                                         
                                         So we'll have basically half the amount of loop steps
                                         
                                         with twice the instructions.
                                         
                                         And these instructions can basically then be reordered
                                         
                                         and can be put into different kind of functional units
                                         
                                         as well.
                                         
                                         So the CPU, by giving the CPU this,
                                         
    
                                         the CPU might use a bit more parallelism.
                                         
                                         In practice, I might show you later, or let's see, either next time or this time, I will show you,
                                         
                                         in practice, the CPU or the compiler does this automatically for you, this kind of optimization.
                                         
                                         But if the compiler doesn't do it, this will actually give you some performance benefit.
                                         
                                         And then the CPU will basically reorder the instruction in the reservation stations and
                                         
                                         will give you a bit better performance.
                                         
                                         So say, for example, we had our earlier example where we have the initial branched version,
                                         
                                         then we have the predicated version that's somewhere here,
                                         
    
                                         and this might give us a bit better performance
                                         
                                         because of this additional reordering.
                                         
                                         And this is something we'll briefly try,
                                         
                                         and then we're gonna stop for today.
                                         
                                         So I'll show you, it's basically the same code here.
                                         
                                         So I'm starting by filling an array with a certain number of values with random data
                                         
                                         and recounting basically the number of instances that only have less than half of the int size.
                                         
                                         So a random generator gives us from zero to max int size and we're counting
                                         
    
                                         the ones that are less than that. And initially we have this simple branch here and I'm doing an
                                         
                                         initial count just to warm up the cache but this doesn't actually help that much but theoretically you could then we're doing this regular branched
                                         
                                         version then we're doing a predicated version here and finally we're doing a final version
                                         
                                         with these two individual counters or where we then basically sum up this afterwards and now I
                                         
                                         mean you can already see the result here. We can see that the branched
                                         
                                         version is actually much slower than the predicated version. And the two predicated versions,
                                         
                                         they're almost the same. So there's not that much difference here.
                                         
                                         And, but this is only, and this is basically just, I mean, you can see the effect however the compiler does this for you as
                                         
    
                                         well if you're using optimization parameters so here basically have the same code i'm going to
                                         
                                         compile this again with optimization flags and i'm going to run the same code here and you can see
                                         
                                         well results are identical in this case and And this is because the compiler actually,
                                         
                                         rather than leaving this part as a single,
                                         
                                         so it will actually do some predication here.
                                         
                                         And it will also unroll the loop.
                                         
                                         So it will basically make multiple instances
                                         
                                         of this operation.
                                         
    
                                         And then the CPU can just, as in this case,
                                         
                                         where we ourselves make multiple instances of this instruction
                                         
                                         or this statement, the CPU can also reorder this as it
                                         
                                         pleases and can basically use this additional parallelism
                                         
                                         here. And how you can see this, I will show you next time.
                                         
                                         For this time, I would say we're done here.
                                         
                                         Do we have any questions?
                                         
                                         Yes.
                                         
    
                                         So about the misprediction from the crash prediction, maybe then the CPU needs to have a copy of all the registers
                                         
                                         or data exchanged at the same time?
                                         
                                         It has to be set with the wrong prediction?
                                         
                                         So that it has a copy of all the registers?
                                         
                                         So if it does the wrong prediction, yes, I think so.
                                         
                                         So it will basically then rewrite whatever the state
                                         
                                         that was before and then start with the other branch. And this is why this takes so long,
                                         
                                         right? So on the one hand, it does more optimize or more execution already until then. And so that
                                         
    
                                         takes time, but then it sees the branch condition is wrong.
                                         
                                         So then it needs to go back and start basically from scratch
                                         
                                         where it was.
                                         
                                         It needs the same state again.
                                         
                                         Yes.
                                         
                                         OK.
                                         
                                         Other questions?
                                         
                                         No other questions?
                                         
    
                                         OK.
                                         
                                         Next time, we're briefly going to talk about prefetching and then finally start
                                         
                                         with vectorized execution with SIMD.
                                         
                                         Thank you very much.
                                         
