Hardware-Conscious Data Processing (ST 2023) - tele-TASK - CPU and Caching & Instruction Execution
Episode Date: May 3, 2023...
Transcript
Discussion (0)
So today, later today, we're going to talk about instruction execution.
Before that, we want to come back to the CPU and caches, and especially data layout and
alignment today.
But before we do so, well, I've already talked about the code examples, and I've not updated
it yet, but there will be more.
But then there was a question in the forum.
And because we discussed it just right now,
I want to also answer that right here.
And that's regarding recording the tasks.
So next week, we're going to start the SIMD task.
And the part, like the setup and everything, we will record.
Meaning that everything, how to set up, how the tasks are run, etc.,
all that, and also the task details for the first task we will record.
However, all of the solutions, that part, we're not going to record.
So wherever we show you how the task would have been solved
and our code, et cetera, that part won't be recorded.
So other people and future students can also enjoy the task
just like you did.
So that's the idea.
Then another, because we are with recordings,
I managed to not turn off the mic
for the second half last time.
So yesterday, that means we missed half of the recording from yesterday. We're just going to
post the link to last year's recording in Moodle for those of you who enjoyed this asynchronously.
You can see whatever I told you yesterday from last year, basically.
Exactly. So, that's basically it. Just as a reminder, so we're still in the CPU realm.
And as I said, we already moved stuff a bit. So, next week, we're going to start with SIMD on Tuesday and then have the task on Wednesday.
But, as I said, we'll also record this.
And otherwise, well, everything is as is.
We're in the instructions right now.
But, as I said, we're going to move back to the CPU cache first.
And this is where we left off.
So we talked about the different kind of caches
that we have and some of the cache performance,
and also how to calculate cache performance,
and then also translation look-aside buffer,
because that's all stuff
that you need to be aware if you're writing really high performance code.
Because I mean, you don't really have to write for that, but you will see the effects depending
on how you write your code.
And well, here kind of closing up this cache stuff, not completely,
but as one kind of overview example,
here's an example of the specint performance numbers.
So this has been done by a, I don't actually
remember who performed these experiments,
but you can see instruction cache misses and L2 cache
misses on some CPU, some server, probably x86 server CPU.
And just in terms of misses per 1,000 instructions,
and this would be, as we said, if it's
misses per 1,000 instructions, 10 misses per 1,000
instructions, our cache efficiency here would be 99%,
just as we calculated last time, if you remember.
And we can see that for most programs,
this is actually much lower than that.
So we're in the less than 1% realm,
so more in the 5 per mil realm, or even less
than that, of cache efficiency.
So many programs.
And specint is one of these benchmarks.
It's just basically a suite of different kind of small,
or not even really small, programs that will be run
and where then some profiling happens,
or some
performance happens.
So say stuff like GCCs or compiler or G-SIP, so some standard programs, let's say, that
you would run.
And this is used to test the performance of CPUs.
And well, we see most of these tasks,
except for this MCF, which is typically bad.
And I don't actually exactly remember what it does,
but it's really bad in terms of caching.
So all other than that, well, most programs
have quite good caching behavior.
And in comparison to that, if we now look at something like TPCC,
and we run this on a standard database,
so TPCC is one of these OLTP benchmarks, which I told you
about, which just does small transactions.
So you have warehouses, and you have things that are bought and sold
and stocks that need to be managed. So different kinds of transactions which just touch small
parts of the data. And you can see that on the one hand the instruction cache or the
L2 cache misses are quite high in comparison to most of the other programs. But even worse, the instruction cache misses are really bad as well.
So here we'll have a poor performance in terms of instruction cache.
And the question is, of course, why?
So why does this happen?
And so why do database systems have such poor cache behavior
in general?
And I told you all the way in the beginning
that classical database systems, MySQL, Postgres, DB2, et
cetera, in their original form were all built
around this bottleneck of disk.
It's all about disk.
Because of that, because that's so dominant, it doesn't really matter what happens inside
the caches.
So, you can put a lot of instructions, etc., you can do a lot, until you see a problem
in the CPU if you're waiting for disk all the time. And because of that, you basically
can use, well, let's say, very fancy ways of programming,
lots of libraries, et cetera, and make your code quite,
let's say, convoluted before you notice any problems.
However, if you then all of a sudden fit your data
into memory, you
will see these, even if you don't have them in memory, you will see that your code
behaves poorly in terms of cache performance. And this is because there's
poor code locality, so that's basically all the problem with the
instruction cache, right? So for example, you're resolving attribute types
for each tuple individually using
some polymorphic function, meaning basically
the code base jumps around all the time
for each individual attribute.
You're going to completely different code paths.
And if your code is large, and think about something
like Oracle, this is in the gigabytes of code base. Meaning whatever you install this is not, I don't
know what's HANA, you might actually know what SAP HANA is. Have you ever touched
that? You don't remember? Anyway, it's also not going to be in the in the
kilobytes, the code base. It's going to be larger.
And of course, this is optimized to some degree.
But say, classical SAP systems, also DB2, et cetera,
these are huge systems.
And the code base is not just, well, whatever.
But even the core code base for just operating
on the individual attributes will already be large.
And so basically the system has to go a lot of different code paths while executing.
And not just because of different attributes, but also because of the volcano iterator model.
So this is the traditional way of processing data in a database, tuple by tuple, in an iterator fashion.
So you all know the software engineering
pattern of an iterator, and where you basically say,
get next, get next.
And this is the way, and you will see this later,
and also in the lecture separately,
but this is the way how you execute traditional database operators.
Basically, you stack them on top of each other, each having an iterator interface,
and then each of them calling each other.
And this means each individual tuple will basically pass through this query plan individually and going through a lot
and large code base, which also means, right, so it's a lot of functions that need to be
called, a lot of code that needs to be basically loaded, and that won't be cached in the end
if it's too much, or it won't fit into the cache, at least not in the L1 instruction
cache.
So that's the one problem.
And then the other problem, and that's always going to be a problem to some degree,
is we have to work with large amounts of data.
If we're talking about an analytical database,
well, we have to go through the whole table.
If we're trying to find something in a large table,
it's not indexed, or we just want an average over the year
or something, we have to go through the whole table.
It's always going to be a lot of data.
And so because of that, we'll also always
see some poor caching behavior, or much poorer caching behavior
than something where we just have a small working set
that we're just going to iterate over and over again.
So I don't know.
Think about a machine learning model
that is constrained in size.
If you're working on that, that might actually
fit into your memory or even into your caches
if it's a small model.
You do a lot of operations on that,
so that's going to be very local if the data sets are not too large, etc. In the database,
we're assuming we're going to have a lot of data. We're going to go through this
over and over again, but mostly reading through this a lot, and that's going to be slow.
If we're using an index, we don't have to read as much data,
but it means we're going to randomly access data.
And again, this will have poor caching behavior
because we basically jump somewhere randomly
in the large amount of data.
And basically, the cache won't help us that much.
However, we can improve stuff, right? So there it's not, well, nothing we can do because it's
just like large amounts of data. We have to go through this. We can still do something by proper
data layout and by proper code management. I mean code is always an option to optimize your code, basically.
And you can and you will learn to some degree how you can optimize
for cache performance in terms of data structures, for example.
So you can build your data structures in a way that they have
better caching behavior or worse caching behavior and a lot of it is all around okay how how much
does fit or how much can i fit into my cache lines um will i read or use all of the data that's within
the cache line or not how much of the data fits into my cache line anyway?
Or do I have to, like, if I have multiple,
use multiple cache lines, does it fit into cache at all?
So navigating this, that makes a lot of sense.
Also how to access the data in the end.
So do I just do like random jumps,
which means I'm just using small parts of the cache line maybe?
Or can I use everything that I have in a cache line?
So that's something where we can work on.
And not only, I mean, all systems, not only database systems, favor cache-friendly code. And one particular thing, and that's maybe something that we don't
teach that much in this course, because you're going to optimize a lot for certain types of
CPUs or for a specific type of CPU, you'll see this is highly platform specific and this is highly CPU specific.
So cache sizes, etc.
To get the best performance, you have to be really, really specific with your code.
In practice, you don't always want that.
In practice, you actually want to make sure that your code is reusable and can be used
on other systems as well.
You always need to do this kind of trade-off.
How much do I optimize for a certain platform or for a type of platform?
How much do I, by doing so, limit my potential for reusing the code on a different platform?
That's just something to keep in mind. If you're optimizing just for one cache size, just
for one block size, for one associativity, et cetera,
then your code is highly specific
and will perform for sure not as well on another platform,
maybe even really poorly on another platform.
And this is kind of if you're setting these hard thresholds.
There's hope.
I mean, there's some ways to mitigate these problems
and we'll also talk to some degree about this.
But there's something to keep in mind, right?
If you're building systems,
don't over-optimize for a single platform.
And ideally, you try to write generic code.
You try to make your code somehow adaptable.
But there is basically some rule of thumbs
how you can always get a good performance.
One is work on a small working set.
You have this temporal locality.
Use small working set. So you have this temporal locality.
Use small strides.
So try to keep locality, spatial locality,
to work on the data closely, like all
on the data that's spatially close.
So that will always give you better performance
than if you don't have no temporal locality
and if you have no spatial locality.
And then if you have a loop, then basically
try to focus on the inner loop. That will also give you the most performance. But in general,
that's also something that can happen quite easily. Don't optimize prematurely, right? So
if you have one loop where you think you can optimize the hell out of
it make sure that this is actually a loop that counts right so that actually has an impact on
your overall performance so before doing like micro level optimizations check if this is actually a
hotspot in your system if everything is bound by something else well you might spend your time
better on another interface that you have a problem with rather than that.
Or if your application is super slow, super inefficient,
often you can get a much better performance just
by changing the application code a little bit.
Or say, for example, you get really crappy queries
and you don't have a query optimizer,
then it doesn't really make sense to improve your system execution if you're just working on crappy queries
right so there's there's always you always again need to make a trade-off
and make sure that you're focusing on the right on the right bottleneck and
that's something where you should always
do this kind of benchmarking or profiling.
We'll have a session on profiling
where we'll talk a bit more about this, how you can actually
do this yourself with the code.
And on the other hand, generally, it's also always
good to do some back of the envelope calculations
if this actually could be the case,
if this could be the performance bottleneck or not.
So with that, let's talk a bit about data layout, which
is also something where we can improve,
like I say, caching behavior quite easily
by doing some tweaking.
And there's some trade-offs.
And this is basically also how to directly bring some of the stuff
that we did so far to database systems, right?
So, so far, I mean, this is like how do the CPU caches work, etc.
More generic works for many, works the same for all kind of systems.
But data layout, this is kind of very specific.
OK, so let's talk about caches for data processing
or for database systems.
And the question is, how can we improve data cache usage?
And this is basically, we have to go and look at different data
models or data storage models and data layouts and query execution models. And
the query execution models, as I said, we'll look into more deeply in a separate
session, but data laid out is something that's directly related to what we've seen so far.
And say we have a simple query, this would be on the TPC-H dataset,
so line item is the big fact table, one of the two big fact tables in TPC-H, and the biggest one.
And here we're just going to count all of the line items, so individual cells,
basically, in the table, which were done on a certain date,
where the day would be 26th of September in 2009.
And if we don't have any index, if we don't know anything else about the table, then this
means we have to do a full scan.
So I mean, potentially, we have this sorted by date or something, then we could reduce
the amount of data that we have to look through.
But in general, there's no guarantee about this.
If we don't know anything else, we have to do a full table scan.
And there's basically two approaches how to lay out this data,
in storage, in memory as well.
And then of course there's some hybrid models as well.
We'll also talk about this.
And this is exactly the same as we had earlier,
or it's not exactly the same, but more or less the same as we had earlier with the area layout.
So we have a row-based layout and we have a column layout. So this is the same thing that
we talked about earlier. So I can store an area, I can store a table either row by row or I can store it column
by column. Row by row is good basically if we just want to write the data, right? Or if we need
the whole row or if we like in the area example we needed to go through the complete rows one by one,
then a row-based layout makes total sense.
And this would mean, this is also
called an N-ary storage model, or NSM.
And this basically means we're storing
all of the attributes of a single tuple one by one
within our pages, and then the complete tuples
actually next to each other.
The other way of storing this, just basically column
by column or attribute by attribute,
also called decomposition storage model,
would mean we're actually decomposing,
and that's why it's called like that,
we're decomposing the tuples
and just store individual attributes.
And well, this is basically good
if we're just looking at a single attribute, right?
Makes total sense because then we have
all of the information there,
but it's more costly if we have to reconstruct the tuple.
So say for example, looking at our example here,
just looking at an individual attribute,
the decomposition model would be faster.
If we need all of the attributes, say, for example,
then rather than select count star,
we would say select the whole tuple right
select star then we would have to reproduce we would have to rejoin these individual
subsets and then it becomes this kind of trade-off
if we do a full table scan in a row store, and so they are also called row stores and
column stores, depending on how they store the data on disk or in memory.
If we do a full table scan, then this means one tuple is next to the other.
Again thinking about everything in memory, We would have our virtual memory.
Everything is nicely laid out as one long area, basically,
where we have one tuple after the other, neatly packed.
However, the tuples have different kind of attributes,
have different kind of length of attributes,
maybe our fixed size, maybe our variable size.
But then we have this ship date, which is actually
what we're looking for, is somewhere in there.
If we're lucky, it's a fixed position.
So it's somewhere in there in the middle of our tuple here.
Now, the problem is that this is basically
what we're going to read if we're doing our read.
This is the cache block boundaries, as we know.
So we read individual, like if we're talking everything is in memory, we're going to read
individual cache lines into the caches and we're going to work on these cache lines.
And here now we see we don't know exactly.
Maybe we have something where we know where these tuples start or not.
But in general, if we're just scanning through, we don't know exactly which part we are interested in.
So this means we would actually have to read a large amount of irrelevant data to find the individual ship date values
that we're looking for.
But everything else from the tuple is irrelevant,
but we still need to read it to find this ship date
and work with that.
Even if we're just looking at the individual cache lines
that we're interested in, there's still
a lot of information that's completely irrelevant to us.
So that means the CPU will wait a lot just reading all this
data just to find out, well, this is not relevant for me.
This is not relevant for me.
Maybe this piece of information is relevant for me,
but even the rest of the cache line is not relevant.
If we have a column store layout, so we're basically
decomposing the individual tuples into their attributes,
then we have pages in memory that look like this.
So we have one page that just is full with ship dates.
And this means if here then we have
to read the individual cache blocks or individual cache lines,
then all of the information that we're reading is actually relevant for us.
So this means in this case, we're going to be much faster in processing the data.
The cache will be much more efficient just because of the layout.
However, of course, if later we have to reconstruct the tuple,
this will be, again, more costly, because we'd have to read all of the individual parts. We have
to join this back, which is something that in the original layout, in the row store layout,
we still have the full tuple. So if my final result, if say, for example, all of my tuples
will match and we have to reproduce the full tuple,
then this will be cheaper again because I
have all of the information there.
I can output this directly.
Well, in this setup, I first have to find the relevant tuples
quickly, but then I have to reproduce everything else.
I have to read everything, like all of the other attributes, and I have to
recombine this.
However, as I said, it's a trade-off, but typically,
I mean, we have some selectivity and then this makes a lot of sense, right? And if we're not looking at the
complete And then this makes a lot of sense, right? And if we're not looking at the complete tuple, but say in our count star, we can do the count
star just on the ship date, right?
So all that we need for this query, for example, would be just looking at these pages and we
can completely calculate the result.
And also the cost of reading this.
We're still looking at each individual tuple here,
each individual attribute only once.
This means we don't get good caching behavior per tuple however we're reading complete cache
lines so we're reading multiple tuples at the same time so the same cache line will be used
multiple times in in our processing so we have much better caching behavior than we had earlier
so we're not going to work with, like, not all of our accesses
will still be cache misses.
Because in the first case, we're basically here.
Each individual access, like every sub, every cache line,
basically maximum contains a single attribute. This means i won't have it in the cache
unless i'm using some kind of prefetching but if i don't use prefetching i won't have it in the cache
that means i will have to read it when i basically require it and that means every single access will be a cache miss here and in
this case at least for the number of attributes in the cache line we will
have cache hits and will have better cache caching effects here yeah and if
we're lucky the full data might even fit into the caches and then we get actually
much better.
We're going to have very nice performance.
And for some smaller tables that might actually might work.
Okay, so but the trade off and I already hinted at that we have this recombination cost, right?
So we have to recombine the tuples in the end.
We need to perform all these joins
to basically get the final tuples again.
And this is really dependent,
the trade-off here is really dependent on the workload.
If we're just looking, if we're just counting, for example,
no cost at all, because we can optimize this away we can
just do this on the ship date if we need the full tuple in the end and a lot of this will again be
application specific right so some of this you can basically on the application you can decide
do i really need to count in the application or do i send the count to the database, then the database could be much faster, for example.
However, there's also ways to kind of get
this hybrid approach.
And one way is PUCs, the partition attributes
across layout.
So rather than having the individual attributes
in separate pages, we can do mini pages inside a memory page and have all of the
attributes of a single tuple within these sub pages and basically group the attributes in a
way so that we still have this cache performance. But we also have locality in the memory pages so that we don't
basically might go into the problem that we have like pages maybe swapped out somewhere
for parts of the tuples which would give us additional problems.
Right and then of course rather than saying okay I want this hybrid model just within a single page and column and whatnot layout,
I can also say, well, why not just decompose the data set into two parts?
I have a column store layout and I have a row store layout for different parts of the data. And typically, say for example, new incoming data,
I will, if I get this fast, it's much faster
to write them in a row layout.
Or if I have to iterate in an OLTP fashion over this,
so I have to use the whole tuple,
I have to update stuff in there,
then the row store layout actually makes sense.
Or if I delete tuples, right, then I just have to touch this one page rather than
everything.
So then I can basically say I store the new data,
new incoming data in a row layout,
and all the data that's not going to change much anymore,
I'm storing in a column store layout.
And this is what typical column stores today actually do.
So SAP HANA, for example, or C-Store,
which puts the commercial version of C-Store Vertica,
right, I think.
So other systems that use this column layout,
they basically
have this kind of setup where they
have the new data in a row store and older data
then in the column store.
So with that, there's more we can do.
And one thing that we also can, to some degree, at least
think about is alignment.
So basically, our data access, any kind of data access,
can be aligned or unaligned.
And that basically means, do we read the data basically
in cache lines or in, I mean, for the CPU aligned basically so that it can easily be put into the registers or not?
Or will the CPU have to move this back?
So in the past, basically in older architectures, everything needed to be aligned. So basically I'm having an integer,
which is like four bytes, right?
So it needs to basically start at address zero
or a multiple of like modulo,
anything that's modulo four would be zero.
So it cannot be shifted somehow
because otherwise the CPU would have to. So it cannot be shifted somehow, because otherwise, the CPU
would have to somehow shift it back,
and old CPUs wouldn't be able to do this,
to have it in the right position.
So if you think about memory in this way,
basically, if we're having addresses that are 4 bytes,
for example, then anything that starts at address 0
or modulus 4, 0 would be aligned.
Everything in between there would not be aligned.
And anything not aligned basically needs some extra work for the CPU to move it back into an aligned phase
to then have it easily copied into the registers, etc.
This basically depends on word size and cache sizes or cache line sizes.
In the CPU, also the compiler, they will do automatic padding or they will do automatic
alignment to fix this.
However, it takes some time.
Modern CPUs put a lot of work into making data
or into mitigating these problems.
However, if you know that this happens,
you can actually get some additional performance.
So let's look at the typical structure representation.
So here we have basically a very simple translation
of something like a table.
Say, for example, we have four integers, a long,
and then a next pointer.
So this could be something like an index structure
that we're using.
And this is just then a block of memory.
And the compiler will just translate this directly
into this kind of structure.
And all of the fields, the compiler
must basically translate them in the order
that we've described them in our struct.
So it cannot reorder this.
So this is basically what the compiler just adheres to,
so we're not getting into weird effects somewhere.
And now the compiler might need to do some padding.
And so the compiler basically determines the overall size
and determines the positions of fields.
And the machine-level program then has no understanding
of the structures that you actually put.
So it will just see this area of memory.
And the CPU basically expects this to somehow be aligned, or if it's not aligned,
meaning it doesn't start at address zero, for example, I mean, if in this case, basically
everything would be nicely aligned, in this case, everything is fine.
But if something is shifted, so say, for example, we have one byte here, and then our integer
starts, the compiler or the CPU will have to realign this.
And the compiler will try to optimize this as well.
So in x84, Intel, for example, recommends
the data to be aligned.
So it's basically everything should be aligned.
So meaning that we are, as I said, in the word size,
so in four bytes, for example, we
want to make sure that every integer starts
at these addresses.
And the CPU still, even if it's not the case,
the CPU still will work, right?
But it needs to do some reordering.
So let's look at another example, maybe.
So here, we start with a similar data structure as earlier.
So we start with a character.
Then we have an integer, an array with two elements, and then we have a double.
And you can see in this case, basically, we're
starting at, again, at 0, for example,
or wherever we're starting.
Doesn't really matter.
In this case, typically, we're not aligned.
So first, we have this character,
and then we have the integers.
So these will not be aligned because we're not at address 0
but at address 1, for example.
And that means, ideally, we want to get some alignment again.
And what the compiler typically will do
is it will put some alignment there.
So it will put some additional space here.
If it doesn't, then the CPU will put some like
will have to do some alignment in memory, will have to move it.
So the CPU might need to read multiple cache lines
or might need to read multiple words to get the correct word
and then move it, shift it back to have the right alignment
so that in the end, it can actually
work with these individual integers
in the right alignment.
And also doubles, for example, need
to be even more aligned to have them put into the right space.
And if they're not, so if the compiler does this,
this will actually have better performance
than this unaligned data, because there, the CPU
will have to realign.
OK, so typically, what the compiler does
is the compiler maintains the declared ordering.
So as we said, it won't do any reordering.
You can do reordering.
You can just change your struct if you want,
but the compiler cannot.
And so what the compiler does, it can insert some padding.
So it can insert this kind of empty space here
in order to make sure that all of the data or the elements
in the struct are properly aligned.
Then the overall struct should be aligned or must be aligned according to the largest
field and then the total struct size also must be a multiple of its alignment.
And again, the compiler will insert some padding here.
And for strings and other variable length data, it will be split into the length, which will be fixed, and then the variable size tail. And the header will contain a pointer to the tail,
and the variable data will be placed at the end of the struct.
So the fixed part basically will be positioned in the same position
as the order of the struct,
but then the variable part will be put towards the end.
So now what we can do if we want,
we can basically make sure that we align our data
or we reorder our data so that it's better aligned in memory
and that the compiler doesn't have to do as much panning.
So a simple example would be here.
Instead of starting with the character,
we can start with the integer and then add the characters later on,
which then would give us a better alignment, a more tight representation in memory and less padding.
So rather than using 12 bytes, because we want to have the integer aligned at modulo 4 again,
so at position 4 in this case.
We have the integer at the beginning,
and the characters, since they're single-spaced,
they can be tightly packed,
but the struct overall should be aligned to the total size again.
And, well, you can test this.
For this, I actually don't have any example here.
But say you can have a simple struct
with different kinds of alignments
with padding and reordering.
And the differences, accordingly, again,
this I didn't test, but it's from Andy Pavlos keynote,
not keynote, every Pavlos course, the difference can be significant in throughput
that you get.
So basically with no alignment, you're below like half a megabyte throughput.
With padding, you can get 11 megabytes of throughput in memory speed.
And with the reordering, even more than that. So making this efficient can have a significant impact here.
Again, this one I didn't test.
I just used it from the lecture, actually.
But something you can try.
So write a small program and test it.
Yes?
If reordering has such a big impact,
why isn't the compiler able to do so?
Is it too hard for the compiler to reason about it?
I'm assuming this has, well, it's a good question.
I'm assuming this has something to do with compatibility.
So if you're basically not sure anymore how is the data laid out in memory,
then you cannot work with the data on different areas in your code
or something like that.
So that would be my assumption.
Because that would basically mean you cannot reason at all
anymore about how the data is laid out in memory.
So you're writing some code, and whatever comes out,
how the data is in memory, you don't know anymore.
So that's my assumption.
That's why there's basically a clear way,
a clear translation of a struct to how it's laid out in memory
so that you're sure this is, if you want to access this again,
you compile something else, it will also
be able to access this, my assumption.
OK, other questions?
And a lot of this you can actually also test.
I'll show you in a bit how you can play around with this
as well.
Okay, so this basically sums up the computer architecture and gives us a nice point to do, let's say, a three-minute break.
Okay, I guess we can get started again.
So, instruction execution.
So, we're still in the CPU architecture. Now this is more about how does the CPU actually execute
instructions internally.
So I want to talk about micro operations and pipelining
and then pipelining hazards.
Basically, when does this not work well?
What's bad in or what works well with modern CPU architectures, what does not
work well with modern CPU architectures, at least to some degree.
And some of the current, I'm going to show you some of the current pipeline architectures
if we get that far.
If not, we're going to see it next time.
So well, let me see. Yeah. So let's get started.
So basically, if you think about assembly code that you could write, so later on we'll
see in the SIMD task, for example, there's these intrinsics, which appear to be like these direct instructions for the CPU.
But the CPU actually breaks them down into even smaller instructions that will be executed.
And any kind of instructions that you can think of will be broken down into micro operations and it in pipelines. So they're not just like I'm doing an addition now,
for example, but this will be broken down
into smaller operations.
And say even for an addition, for example,
you need to first prepare the CPU to this,
like the execution pipeline, say this will be an addition now
and then you need to load the registers.
You have to perform the
execution or the instruction and then you have to write the result back. And this is basically
the phases. And this is simplified, so modern CPU architectures might have longer, like five steps,
six steps pipelines and some additional connections in there. But this is, let's say, the basic theoretical concept.
So we're starting with an instruction fetch cycle,
which means we are fetch cycle.
This is where we're getting the next instruction, basically.
So what will be executed next?
And we're updating the program counter.
So the program counter basically tells us
where are we in our program right now
and basically loads the correct instruction.
Then we have the instruction decode and register
fetch cycle, which could also be two individual steps.
But here, for simplicity, and some architectures do it like this,
we're basically saying, well, we're getting the instruction,
now we have to translate it to something that the CPU understands.
And we're reading the registers.
This is typically done in parallel.
So we're basically trying to find what we need,
basically the data that will be used.
And this is done from the registers.
And this also might evaluate the branch conditions.
So if we have something like an if statement in there or a for loop, then the instruction
decode and register fetch cycle will basically say, well, this branch, we're going here or we're
going there right now.
Then we have the execution and effective address cycle.
So here, say, for example, we have all of the data in our registers that we want to
work on right now.
Then we're telling the ALU, so the Arithmetic Logical Unit,
please add up these two values,
or whatever is in the registers,
and write it back to that register, for example.
Or, if we have a load or a store,
then we would do a read or write from memory in this phase.
So this would basically also mean, well, I
don't have the data in the registers yet.
I need to get it from memory.
If it's in the cache, in this case,
we'll just read it from the cache.
If it's not in the cache, then this means, well,
we're going to go to the memory unit and ask the memory unit, please give us this code.
And then that's basically where we are right now.
So then in the last step, we have the write back cycle.
This basically means rewriting the result and the register file.
So we're writing, say, our addition in the register file. So we're writing, say, our addition
in the correct register,
or we read the data that we have from memory
into the register that we want,
or the data that we have from our LEU
writing the result there back.
And these two steps are also called,
whoops, this is too fast.
These two, or these four parts
are also split up into frontend and backend.
This is also what you see
if you do some profiling.
You might see frontend and backend stalls.
And this basically means,
well, frontend is basically all of code.
So if your code is not properly working
or you have some self-optimizing code or something like that,
then you might see problems in the frontend.
Or you have bad code locality.
That might be a problem.
Or in the backend, it's basically if nothing goes,
then you have some contention there,
or you have bad cache locality,
everything needs to be from memory all the time,
you have stalls there, this would be backend problems.
And ideally, if you don't have any problems,
you see that all like every all of the
time instructions are retired meaning um you perfect like you execute the instruction the
instruction is done it will be retired then uh everything went uh worked out nicely
so now we have this pipeline right as i said four uh maybe more steps and uh if we have this pipeline, right? As I said, four, maybe more steps. And if we have a non-pipeline execution, like in a very old kind of CPU, this would mean we execute one instruction after the other. steps would basically be one clock cycle right so this means we use basically
four clock cycles to execute one instruction and then if the next
instruction comes we have again four clock cycles to do the next instruction
means we're going to execute basically two instructions per eight clock cycles, or a quarter of an instruction per clock cycle.
And this is called the instruction per cycle,
so a quarter of 0.25 instructions per cycle,
or four cycles per instruction.
So the CPI would be four, or the IPC would be 0.25.
And instructions per cycle, of course, you want to have as high as possible and
this can go beyond above 1 in typical modern CPUs, even in single threaded.
And you want as little instruction, little cycles per instruction as possible.
And this can go below 1 in modern CPUs.
And we'll see why in a bit.
So if we do this in a non-pipeline fashion,
of course we're going to have a bad IPC or CPI,
because basically we're just doing one thing at a time.
But these individual steps in the pipeline, they are different parts of the CPU actually.
So loading and storing, writing to registers, etc.
This is something that we can do in parallel, also fetching instructions and decoding them.
We've seen all this additional stuff on the CPU
or the structures.
At least to some degree, we've seen this.
So this can be done separately, also working with the caches,
loading, et cetera.
So we can actually do multiple things in parallel.
And in this case, in these micro instructions,
we're using so-called pipelining or pipeline parallelism.
Meaning, we can do maybe in one slot, we can do one instruction fetch at a time, but we can do this every cycle.
In every CPU or every clock cycle, we can fetch another instruction.
And we go, like, while the next instruction is basically decoding, we can already do the next one.
So, and this basically goes on and on.
And that means we have a nice fully loaded pipeline.
And we get an instruction per cycle count of one or the same the cycles per instruction count of one right so it's basically
then will be the same this is like perfectly pipelined and with a single pipeline this is
as good as we can get and of course this won't necessarily always work but ideally we get this as often as possible. And in an ideal world, basically, if we break down our pipeline into smaller stages,
say K stages, and this is for general pipelining, right?
Even if we do stream processing, we put multiple operators on different nodes or stuff like that if we do pipeline parallelism
Ideally the hope is we get like a throughput improvement of a factor of the number of pipeline stages, right?
However, of course the slowest sub instruction in this case determines the overall clock frequency
So basically whatever is the slowest step here means this is the fastest that we can clock our CPU here.
So we try to break down the instructions
into these equi-length parts, and then, of course,
try to reduce the number of cycles
it takes to execute an instruction.
We always have to go through all the steps,
but the question is, do we have some stalls in there?
Does something need more time?
And say, for example, if we go to memory,
this will stall everything.
There we will need to wait.
But this, for example, this is actually truly parallel in here
already.
And this is also why we get this very high throughput already.
And we can execute multiple instructions at the same time.
At the same time, it also means that each instruction will not
just take a single clock cycle to execute.
This is also you have to think about.
It's not going to be one clock cycle and this is done.
It's going to be one clock cycle and one stage is done.
We'll need a couple of clock cycles
to actually in any way perform a single instruction.
OK.
Now this is nice, right?
We have this pipeline, but there's
some problems with the pipeline.
So there's some different kind of hazards for the pipeline,
different kind of things that can happen that break our pipeline. So there's some different kind of hazards for the pipeline, different kind of things that can happen
that break our pipeline.
And there's structural hazards.
So say, for example, we have different pipeline stages
that need the same functional unit on the CPU.
So say, for example, we would have multiple stages that
need the memory unit or something like that,
or need the decode stage or something like that.
Then this means, well, we have a contention there.
They have to wait.
This will break our pipeline.
And this can happen, right?
We will have multiple pipelines also on a single CPU.
And if all of them need the same functional units,
we'll have a problem.
They basically have to wait.
Even if, say, for example, we have multiple pipelines,
we don't have enough ALUs on the CPU on the single core,
well, they have to wait.
It doesn't work other way.
But then there's also data hazards,
meaning that we have two operations, basically,
or two instructions that work on the same data.
And one instruction is not ready with the result
when the other instruction needs it.
So we might be in this fill the register part
where we actually want the result of the register part, where we actually
want the result of the previous instruction,
but the result is not there.
We're not in the write back phase
of the first instruction yet.
So then we basically have a problem here as well.
We might have control hazards, meaning
we have different kind of branches,
and all of a sudden the branch goes in the wrong way, or we have
to wait for the branch condition to be executed to be found out.
This might be a complicated calculation as well.
Well, then we're losing our pipelining again.
And then, of course, data access.
If we have to go to memory, then, well, we cannot.
Memory will take a couple of cycles, 200 cycles, something
like that.
Within these cycles, if we really
have to wait for the data to become available,
the CPU won't be able to do anything.
So then all of these will basically
lead to pipeline stalls. And a pipeline stall
is basically, well, the CPU can basically not do anything while waiting for something.
So say, for example, we have one memory access unit. This is an example that I just gave you. So our instruction 1 or instruction i
wants to do a memory fetch.
And then another instruction, i plus 2, for example,
is basically issued.
And they want to basically do the same thing.
So say say for example
instruction fetch and memory access are the same thing and so in whenever we're doing
memory access we cannot do any instruction fetch then as soon as instruction i plus 2 wants to use the memory unit to get the instruction fetch then we're in problem and we have to stall
installing means well the cpu in this stage doesn't do anything here in this unit or there
is no unit to do something so this pipeline won't be executed won't just be stalled or will be
stalled for a cycle and we're basically decreasing uh our instructions per cycle counter so we're basically decreasing our instructions per cycle counter.
So we're going to have less instructions per cycle, because in one cycle, we're not finishing
up one instruction.
So you can see here, this cycle, say cycle four or something, this instruction is ready.
Cycle five, this instruction is ready.
Cycle six, no instruction is ready.
So we have to go all the way to cycle 7
to get the next instruction.
And here, this is all about CPU architecture.
And it depends on the workloads, of course.
But do you have a CPU that fits your workload?
I mean, you might even be able to code in a way.
But you probably have to really try
that you only work with small parts of the CPU.
And then you get into these structural hazards.
So typically, the CPU designers will
try to make sure that the CPU is well balanced
and has proper hardware provisioning.
And then the compiler will try to do
some reordering to some degree or the CPU also some reordering of the instructions in a way that it doesn't come to these stalls.
But this might happen if you have a very specific
kind of workload.
Then we have the data hazards.
And this is basically whenever two instructions
or multiple instructions try to access the same data.
And one thing is we have the read after write. This is basically if a register
is read by an instruction before it's actually written by a previous instruction, right? So we
have our pipeline all the way at the end of the pipeline, we're writing the result. And now
another instruction will try to read this.
So say this next instruction will
want to work on this result. Then we might issue the read,
so the register fetch, before the register was actually
written.
And then we have a problem.
So then our results, like the result is not in the register.
We cannot read it in this cycle.
So then we need to put a stall in there.
Basically, the CPU somehow has to wait for the result in order
then to fetch the register and give it
to the next instruction.
Then there's also the possibility
to have a write after a read.
So basically, the write of a register
will be done by instruction after a later instruction
actually already did a read.
And these kind of problems can basically only happen
if the instructions are reordered by the CPU.
So the CPU, modern CPUs will reorder instructions
in order to get better performance,
in order to use up all the functional units properly.
So if some instructions take longer time,
need to read something from memory,
the CPU will try to do some other instructions
in the meantime.
But in this case, we might actually
write something after it's been reordering.
And an earlier instruction writes basically a value that should have been
read by another instruction.
If this happens, of course, then the read
must be stalled until the write has been done.
And write after write is the same thing.
So we're basically reordering two writes,
and then we might end up with the wrong result
unless we're basically stopping and changing the pipeline here.
So, let's look at an example.
So, here's some basic,
there could be some simple instructions, some assembly.
So, we have an addition, a subtraction, an and and an or
on the different kind of registers.
And real assembly looks something like this, right?
Usually, you have different registers.
The first one would be the result here.
And here, say, for example, in the at, we're writing to register 1.
And then in the subtraction, we're reading from register 1. And then in the substraction, we're reading from register 1.
And then now, if we're basically looking at this here,
then in the instruction decode and register fetch cycle here,
we're basically reading what's been written here.
So this would result in a problem.
Because this has not been written,
we don't have the result yet.
And in hardware, this is solved to some degree
with some bypass and forwarding hardware.
Or say, for example, you can basically,
in the instruction decode, you can basically say, OK,
the result will be in this register and I'm directly using this result in my instruction here.
So, then I don't have to stall the pipeline, but I need something in here or in this case also, right?
So, I want to access this value.
I know which register it will be in, so I can basically, the CPU can pass through the value,
can say, okay, this will be ready in the cycle when you actually need it, we will have it there
for you. So then in this execution, the data will already be here, and in this case as well.
However, it needs special hardware and special support. If not, we would have to stall the pipeline here.
So if not, we would basically do the instruction decode
at this stage when the writeback has already been performed.
Because then we know the data is in the register already.
Another example here, we're basically writing back the register results in the load.
And then another instruction basically reads what has been written from the load instruction
before it was written.
And this one actually cannot really be mitigated because we don't know how long this will be taking.
So if we're reading with a load, then the hardware
needs to basically be stalled.
So then until the load is performed,
then the instruction or the pipeline will be stalled,
and only then it will continue in this case.
So here in the load, while this has been executed,
as within the writeback phase, we know where it's going to be.
We can actually continue here.
By scheduling the tasks in a different way, by reordering
the instructions, you can mitigate some of this by basically not depending on
the loads all the time right away. Then we have the control hazards and
this is basically from the branches. As soon as we have the control hazards and this is basically from the branches.
As soon as we have branches, we have to know how these branches will be evaluated.
A simple way of doing this is basically as soon as we're in the branch, we're executing
the branch, flushing the pipeline and start from there when we know what the
branch result will be.
But of course this is costly because we basically have to completely flush the pipeline.
So here we have our branch instruction and we need to evaluate this instruction in order
to know where to continue.
And this basically means at this point, we're basically waiting and then continue with the target
instruction here.
If we don't know the result yet, then we
might even have to completely flush and continue from there.
But this means in this case, we're paying actually a lot.
So now the question is, how can we mitigate this in software?
So how can we deal with this?
A lot of this hardware will try to mitigate this.
But there's also some things that we
can do in order to mitigate this.
And some stuff, well, first let's
look at some of the things that are actually in hardware and that we can utilize in software.
And some of it directly utilized by the CPU and the compiler.
So after the fetch, one thing is that there's superscalar hardware.
So there is multiple pipelines at the same time.
And this is even true for a single core today.
So we have the different instructions and we're fetching them,
but then we might have separate pipeline stages or multiple units to do the same
the same kind of instruction stages, but maybe with different functionality.
And multiple functional units would be something, okay, we have, say, for example,
the arithmetical logical unit or multiple of those.
We have loading and storing units, and we have floating-point units.
And this exists in every core today, right?
So you have this instruction fetch system,
but it might actually do multiple instructions at a time,
decode them, have kind of a buffer there.
You have the decode stages, and then you have multiple units,
or you even have multiple decode stages or units,
and then you have multiple units that execute different parts,
and some of them even replicated multiple times, even in a single core.
So this already will give you some mitigation of especially these structural hazards,
because basically we have multiple units at the same time.
We can use multiple of them at the same time. And also we cannot use all of them typically at the same time, we can use multiple of them at the same time.
And also, we cannot use all of them typically
at the same time.
So basically, there is some flexibility what
we're using at which time.
So the CPU will have a certain number of ports
where we can actually execute, or the core,
where we can execute multiple instructions in parallel or multiple pipelines in parallel.
And then we can choose out of the functional units which ones are used right now or which
ones are needed right now. And with this, we actually get these instructions per cycle with
larger than one because we have parallelism, not even only pipeline
parallelism, but actual parallelism as well
with these different kind of units.
Pipeline parallelism is also actual parallelism,
but additional superscalar parallelism here.
One thing that the CPU does, because the branches are so
bad, we're flushing the pipeline.
We have to wait for the result. The CPU
will try to guess which branch we will take.
And we'll predict and then just speculatively execute
this branch.
And this is basically the branch prediction.
This is something that we saw in this very first sort example.
So this is good.
If it works, it's bad.
If it doesn't work, because then we basically
have to go back like all of the work was done in vain,
and the CPU has to do the other task.
And the prediction needs to be done very early.
And so this is usually there's some very simple buffer
in the CPU, so branch target buffers or branch target
caches that basically says, well, in the program counter
here, this stage at the program counter, in this part of the program counter here, right this stage at the program counter,
I'm in this part of the program.
Well, what was the last branch that we used?
Or what's my prediction that we did?
And then also did this prediction,
was this prediction correct or was it incorrect?
And this is really done in parallel
to the instruction fetch.
So while the branching instruction is basically
fetched, the CPU will look up in the branch target buffer
or branch target cache and make its prediction.
And then basically based on that,
already schedule the next instructions,
meaning for the branch that it
predicts. And this basically, well if there's no entry for
a program counter then it will just use like a random prediction. If there's an
entry it will follow the prediction. And while the inner workings are actually branch secrets,
branch brand secrets, so the CPU manufacturers will do this as they think or will have their
own techniques. But we can see we can easily trick this right by using uniform distributions
and randomness. This won't work well. It will basically be wrong half of the time,
because it cannot do anything better there.
But we can do something.
We can change something.
Also, the compiler can do something.
So rather than doing this branching,
we can actually say we're not going to do branching.
We're doing just turning this control flow, meaning if the data is this, then do this.
If the data is that, then do that.
We can turn this into a data flow where we can say with this data, do this.
With this type of data, do that.
So we're changing this. We're always going to operate.
And then there's no branching.
The CPU doesn't have to select which type of code to use.
The CPU will always know, okay, this is the next thing that I have to do.
And say, for example, here we have a selection query.
Again, similar thing.
Line item, we're doing a count star.
If we do a basic C++ program or a C program with a branch,
then we would say if the quantity is smaller than n
or is equal n, I mean, here, should be equal in this case,
then we're increasing our counter.
And this is a branching condition
if we directly translate it.
And this would look something like this.
So the CPU will speculatively always
try to find the right branch.
And if we have 0% selectivity,
the branching, the branch prediction will be good.
If we have 100% selectivity,
the branch prediction will be very good
because we're always taking the right branches.
If we have 50% selectivity
and we have a uniform distribution, of course,
then the branching prediction will be bad
because we're always running into the, like,
or in 50% of the time, we're taking the wrong branch
because we're just guessing.
So then the performance will be less.
And well, another way of doing this is in a data flow way,
rather than saying we're doing an if statement here,
we can just say we're always doing an addition,
but we're adding the result of this logical operation.
So this basically says, is this greater or smaller?
And this is not a branch or anything.
It's just a logic operation.
And the result will be true or false, or 0 or 1 in C++.
And we can just add this to our account.
Then we get the same result, basically.
But in this case, we'll always do the same.
Doesn't matter what the selectivity is,
because we'll always basically operate,
we'll always do this addition.
We'll also do the addition if it's not necessary,
because we're just going to add zero.
And this means we'll have a performance something like this.
So in the cases where the branch prediction would be better,
or where we have say for example
very little selectivity then when there's very little selectivity we
never have to execute this and the branch prediction will also always be
correct so then this code will be faster in the cases where we have to do a lot
of a lot of wrong branch predictions,
then this code will be faster because we don't never get any branch predictions.
So we're always just doing our simple addition,
which is faster than a wrong branch prediction.
So wrong branch prediction will be multiple cycles.
One addition is basically a single cycle extra.
And well, this is also done, of course, in hardware.
So this also the CPU will do this with hardware.
So the CPU will also try to figure this out
while it's running.
And I mean, we can also do this
with different kind of conjunctive predicates.
So rather than doing like,
say we have multiple predicates in a conjunctive query
and we would do this with this double end.
This will translate into many branches typically.
So if we're doing this kind of code,
this will be translated into this.
Alternatively, we can do a bitwise end,
and that will be translated into,
well, would be less branches basically would just be one branching conditions but we could also do this in this again
predicated range by basically just adding or doing the same kind of
selectivity or doing the same kind of resultivity or doing the same kind of result
by just doing the operation on each individual tuple
rather than doing an if statement
or a branching statement here.
And well, if we're using a query compilation,
so if we're generating code
while we're executing the queries or when we already have some statistics
over the code, then the code can actually use a cost model to find out which makes sense.
So then we can basically see, and this is basically from a paper by Ross.
You can see if we're just using the no branched version,
as we would expect, we get this very clean flat line
because we always do the same kind of work.
We're looking at each individual tuple,
we're performing something.
If we have a very low selectivity,
then this bitwise end will basically be better.
But notice this.
Let me see.
I'm not saying something.
Oh, no.
Yeah, no.
And then the regular AND will basically
be lower because then the branch prediction will be good.
We're doing less work.
We don't have to work on every tuple.
The compiler will order the CPU will just do less work.
And the bitwise or might be a bit better in some cases
and will always be or will for especially in the cases
where the regular and will be slower will be better.
But we'll also have more overhead
than predicated version with a higher selectivity and let me see I actually
have an example for this somewhere don't remember where I wanted to put this. I have it later. So let me show this later. Okay. Also, then the CPU, it can do some
rescheduling and can do some out of order. And this is, it does so by having some reservation
stations, right? So the instruction stream is basically read from the instruction cache.
So it will be loaded into the instruction cache.
And all of the instructions that are going to be executed one
by one later on will then go into reservation stations
for the proper units.
So all of them that need to be doing some memory access
will go to the memory units that go to the floating point units
or that do some floating point operations
go to the floating point units, et cetera.
And they will then be executed when the hazards are cleared.
And as soon as we have data hazards,
then basically in this kind of count operation,
then we basically also have a limit of parallel execution.
So this is a simple example.
So here again, we're basically going through an array and we're basically
going through we're doing the predicated version of our account.
Right.
So we're going to execute this on every step,
but there's not that much way, or we cannot use this very parallel.
However, what we could do is we can add additional parallelism
by splitting the area up into, or the cursor up into two cursors,
and then running through the area on two different parts in parallel.
And that gives us some additional parallelism here because we directly say that this is independent
independent parts of the data. It won't be basically touched by anything else.
So we'll have basically half the amount of loop steps
with twice the instructions.
And these instructions can basically then be reordered
and can be put into different kind of functional units
as well.
So the CPU, by giving the CPU this,
the CPU might use a bit more parallelism.
In practice, I might show you later, or let's see, either next time or this time, I will show you,
in practice, the CPU or the compiler does this automatically for you, this kind of optimization.
But if the compiler doesn't do it, this will actually give you some performance benefit.
And then the CPU will basically reorder the instruction in the reservation stations and
will give you a bit better performance.
So say, for example, we had our earlier example where we have the initial branched version,
then we have the predicated version that's somewhere here,
and this might give us a bit better performance
because of this additional reordering.
And this is something we'll briefly try,
and then we're gonna stop for today.
So I'll show you, it's basically the same code here.
So I'm starting by filling an array with a certain number of values with random data
and recounting basically the number of instances that only have less than half of the int size.
So a random generator gives us from zero to max int size and we're counting
the ones that are less than that. And initially we have this simple branch here and I'm doing an
initial count just to warm up the cache but this doesn't actually help that much but theoretically you could then we're doing this regular branched
version then we're doing a predicated version here and finally we're doing a final version
with these two individual counters or where we then basically sum up this afterwards and now I
mean you can already see the result here. We can see that the branched
version is actually much slower than the predicated version. And the two predicated versions,
they're almost the same. So there's not that much difference here.
And, but this is only, and this is basically just, I mean, you can see the effect however the compiler does this for you as
well if you're using optimization parameters so here basically have the same code i'm going to
compile this again with optimization flags and i'm going to run the same code here and you can see
well results are identical in this case and And this is because the compiler actually,
rather than leaving this part as a single,
so it will actually do some predication here.
And it will also unroll the loop.
So it will basically make multiple instances
of this operation.
And then the CPU can just, as in this case,
where we ourselves make multiple instances of this instruction
or this statement, the CPU can also reorder this as it
pleases and can basically use this additional parallelism
here. And how you can see this, I will show you next time.
For this time, I would say we're done here.
Do we have any questions?
Yes.
So about the misprediction from the crash prediction, maybe then the CPU needs to have a copy of all the registers
or data exchanged at the same time?
It has to be set with the wrong prediction?
So that it has a copy of all the registers?
So if it does the wrong prediction, yes, I think so.
So it will basically then rewrite whatever the state
that was before and then start with the other branch. And this is why this takes so long,
right? So on the one hand, it does more optimize or more execution already until then. And so that
takes time, but then it sees the branch condition is wrong.
So then it needs to go back and start basically from scratch
where it was.
It needs the same state again.
Yes.
OK.
Other questions?
No other questions?
OK.
Next time, we're briefly going to talk about prefetching and then finally start
with vectorized execution with SIMD.
Thank you very much.