Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Instruction Execution
Episode Date: April 24, 2024...
Transcript
Discussion (0)
Okay, so let's get started. We're going to start with the instruction execution today, but I'm just using this for two announcements.
One that's important for you is there's still this survey on who wants to participate in the programming exercises, and it closes today. Because we need your information in order to set up all of the environment.
So if you have not done and you want to participate in the course,
please do this now. Maybe we should also do an announcement in Moodle.
So people who are not here today, which is not or there's probably a couple,
they will also see this online.
And then we have our joint seminar with you, Darmstadt, this week again. So today at 4.30,
if you're curious, Julian Eisenschloss from Google DeepMind will talk about visual language,
how generating drives understanding.
So there's a Zoom link you can join.
This is online.
The Zoom link is also in the slides that you can find in Moodle.
With this, I'm going to switch to the other slides and go back to where we left off.
So we saw this as a teaser. We saw that database workloads, on average,
or example database workloads,
perform much worse than average programs on typical CPUs.
So you have like checking
these benchmark programs you have,
and I mean, not perform, but have like worse cache misses,
both instruction and data cache misses.
So let's check or let's see why that is, right?
So how does this happen?
I mean, technically a a database has very well knowledge about how the data is accessed.
So caching should actually work quite well because it's all about data movement, etc.
So if we know exactly how we're accessing this, we should also be able to optimize for this.
So let's see why that is actually a problem.
So why do database systems show such poor cache behavior?
And there's two things.
On the one hand, you have poor code locality.
So traditionally, and I've talked about this quite a bit
in the first sessions. Traditionally, database systems were
built for this bottleneck of disk.
Because this was the major bottleneck, caches, et cetera,
were not really a problem.
So there, you wouldn't have to be efficient there.
So it was somewhat over-engineered in the sense
that you have polymorphic functions.
So, say, for example, you're resolving the data types or attribute types for each process tuple individually every time.
And that means you have lots and lots of code flying around in your active working set.
And then you have this volcano iterator model,
which allows you for pipelining.
And that means you have like the whole query plan
and each individual tuple will be passed
through the whole query plan, essentially.
That means for each individual tuple,
you will go through a lot of code.
I mean, queries can potentially be large
and then each of these individual operators have is a polymorphic function so you will walk through
quite a bit of code for each individual tuple and that means basically the code will not fit into
your instruction cache and that means we will get cache misses here for the code.
And the same is true for data, right? So database systems have to walk through a lot of or large amounts of data quickly,
which in general is not super for caches, right?
So if you're just scanning through large amounts of data, that's, well, it's not
cache friendly in the first place.
But on top of that, often we're not using, or even there, we're not using the data that
we're loading efficiently.
So if we're just interested in small parts of the data, but we're loading all of the
data into the cache, well, then most of the, or we're polluting our cache with data that's
not super relevant, and we're polluting our cache with data that's not super relevant.
And we're going to be slow again, right?
We're going to get a lot of cache misses.
And at the same time, doing something like a tree or walking through something like a tree means randomly jumping around.
I've already mentioned this, right?
So this is also how we can actually check cache performance
by just doing random jumps in memory.
Like we will always go out of the cache
with many of our accesses.
So we have to think about how maybe we can do this
more cache efficient if we know that most of our data
will be in memory somehow.
Okay, so how can you do cache-friendly code?
We'll talk more about the data in a bit,
but first let's think about the code.
And even for your code, you can optimize, right?
So you can make your code more cache-friendly.
I mean, you can optimize how data structures are organized.
We will see this in a couple of sessions
where we have a specific lecture on data structures
and not only how they're organized
in terms of cache lines, et cetera,
but also how they are accessed, right?
So are you gonna access them in one way
that you have close, like you have code locality
and you have the cache locality for the data?
And then in general, I mean, all systems will favor cache-friendly code.
Of course, if the cache is not your bottleneck, maybe you don't want to go all in and optimize
for cache-friendliness because it's not going to help you much if
Anyway, you're gonna go like each individual interaction will go through the network or something like this
but
Optimizing to some degree or as soon as the caches are your bottleneck then it actually makes sense to optimize there or check a bit more.
But the tricky part about this is that the caches are very different on different systems.
So these kind of optimizations are very platform dependent.
So we've already seen that this processor that I have in my machine
is very different in terms of caches compared to a server than x86 server.
And even within the different generations of servers, the caches will look different.
So you basically also have to think about if you want to write cache-friendly code,
will this just run on one system? Or if you're benchmarking this, will this just run on one system or if you benchmarking this, will this just run on one system in a production
environment probably you want to be able to also port this to something else to a newer
version of your chip, maybe to a different architecture, right?
Right now we see a huge shift from x86 to ARM.
Then you will have to deal with this somehow.
And well, that's basically the third point here.
If you can make this generic, then well, you get these,
you can still work with performant code,
but it's still portable.
So, and generic code where you can be cache friendly means that you have a reasonable small working
set.
So you're not working with very large amounts of data in a small loop or in small code.
Use small strides.
So we already said that.
So if you have an area, you're not jumping in the area everywhere, but you're staying local or in your data structure, you stay local within the data
structure, then you get cache friendliness and you focus on the inner loops, right? So whatever,
if you have nested loops, the inner loop is the one that will be more executed most frequently or
like temporally local
and that needs to be cache friendly, right?
So this is basically where time will go.
If this inner loop is not cache friendly,
then you will have like constant cache misses essentially.
Of course, at the same time, I mean,
in our programming exercises, that's maybe not true,
but in general, you don't want to optimize too prematurely
because, I mean, you also have to check
what is actually the hotspot.
I mean, you can optimize your code away
in terms of cache performance,
but if you have another bottleneck,
it's not going to help you much, right?
So you always should check what is actually your bottleneck.
So am I already utilizing my system utilizing my other system resources properly?
Then it makes sense to actually go there.
And we'll have a profiling session later also in this course, where you will get some insights
how to look for these hotspots and how then to optimize for these.
So, in some examples.
Okay, so with this, let's look a bit more
into this data layout idea, right?
This is not only true for cache friendliness,
this is also true in general.
If I'm accessing data across like large amounts
of disk space, etc.
Whenever I have some kind of a blocked access or I have not,
like my random access is more expensive than my sequential access somehow,
then the different accesses make, let's say, have different performance,
then it makes sense to think about data layout.
How are we storing the data?
How are we accessing the data, et cetera?
And well, so the question is, how can we improve data cache usage?
But again, as I said, this is also true for how can we improve disk usage, essentially.
So you can just do the same thing.
It's just a larger scale.
It depends if your data will end up in disk or not. And so essentially we have to think about
different data storage models and query execution models, and we have to think about temporal and
spatial locality. So can we somehow get temporal and spatial locality
in our data accesses?
And then based on that, or using different kinds of execution models,
how we actually execute our queries,
we can also improve this to some extent.
So let's look at a simple example.
So we're counting the number of line items.
So this is line item.
If you ever see line item in an academic setting,
you know this is the TPC-H.
So this is basically one of the benchmarks.
You probably heard briefly about this.
So one of the standard benchmarks for OLAP processing or OLAP databases.
So we are counting the line items
that were shipped on a certain date.
And typically this requires a full table scan,
unless we have some kind of partitioning already
or we have some kind of index already,
but let's leave that aside.
So no helper structures.
Let's go through the full table.
And now the question is, how much data will we have to have to access?
So the line item table will be the largest.
So if we have scale factor one, this will be 700 megabytes.
And I mean, this is of course not like a scale
that we're typically will be using,
but it might be much larger there.
And now just storing the data,
we have different options to do so, right?
So I mean, first thing that we think about
and that we naturally would always store the data in
is the row layout.
So basically we have these line item rows or tuples and we're just going to store them one
by one, one after the other. And that's how most databases did it for a long time. And that's
actually quite efficient if you want to write a lot of data because you get the tuples in a row layout and then you just write them.
An alternative is the column layout.
So rather than storing row by row,
so each individual tuple,
we can also store attribute by attribute.
So that's a different way of storing.
And that basically means that similar attributes
will be closely located or will have spatial
locality.
So, line item is an OLAP table, means it has a lot of attributes in practice.
So, I don't remember, it's like 20 or so attributes, so it's not super large, but it's already
large in practice.
You will see tables that have a hundred attributes,
a thousand attributes in larger scale systems, right? So, and then that actually makes a huge difference in terms of how data is laid out, right? So, either I have all of the attributes
of the same type next to each other in memory, or I have the individual rows next to each other.
So, if I need the full row for my data processing,
of course it makes sense to have this row layout.
If I only need a single attribute,
of course it makes sense to have this column layout.
Okay, so now if we have like a full table scan
and we have a row store
and we're doing what we saw just earlier, right?
So this means we have our tuples in memory laid out,
and the ship date is somewhere in the middle in this tuple.
So, it's a very tiny fraction of the tuple,
and this is all we're actually caring about, right?
So, this is like our cache block boundaries,
and we can see, well, we're going to load not all
of the cache blocks, so we're not completely inefficient by loading the full rows, but
we have to load a lot of data that we actually don't need.
With every access to the ship date field, we load a lot of data that's not relevant.
Let me see if I get this here. To the ship date field, we load a lot of data that's not relevant.
Let me see if I get this here.
So basically, if we load this ship date, for example,
we have to load the full cache line.
So everything else is irrelevant for us.
If we have a column store or a columnar layout,
that means basically all of the ship dates will be close to each other. I mean here I'm putting them on at the beginning of the table. Of course, this will not be the first attribute because we already saw
there's other attributes before but well with the certain offset this will look something like this. So they're tightly packed and then
if we're loading all these ship dates, so it's the same amount as on the previous table,
or in the previous slide,
so then all the data that we're loading into our cache
is actually relevant for the query, right?
So this means we're much more efficient in loading the data.
We're not gonna pollute our cache with additional data.
We have a lot less accesses to memory in there.
Right? So this means we have also less data, so this will probably be much more efficient.
And on top of that, right, so I mean we know the memory access, that's actually the slow part.
And this is where our CPU will wait until something is happening
because the comparisons, et cetera,
I mean, the comparisons might be wrong branches, et cetera.
But still, this part is actually where we've spent most time
just going to the memory and bring it back.
But here, we're loading a lot of relevant data
at the same time, and we can do a couple of loops
on these cache lines
because they are all relevant, right?
So this means we'll hopefully amortize the costs
for the fetch over more tuples, right?
And if we're lucky, I mean, if we're very lucky,
the whole table might fit into the cache, right?
The whole attribute.
And then everything will be the cache, right? The whole attribute, and then everything will be very fast.
So of course, this is great.
So now you're saying,
oh, why don't we do column store all the time?
So why would anybody ever do row store anymore?
And well, there's a trade-off of course, right?
So it's not the end of the story
because many queries don't just use one single attribute.
But in the end, they might still use the whole tuple,
or at least parts of the whole tuple.
And that's additional cost.
It's not only just fetching this,
but we have to recombine this somehow later again.
That means we have to perform these kind of joins.
And this means it's
a workload dependent trade-off. So whenever I'm just writing the data, the column store will be
much more expensive because I have to decompose it. Whenever then I'm writing the data and I'm
always going to retrieve the full tuple, I'm always going to work with the full tuple, then
the column store will just kill me because I basically just do overhead work, right? So I'm always going to work with the full tuple. Then the column store will just kill me because I basically just do overhead work.
So I'm decomposing the data first, re-layouting it in memory, and then I have to recombine everything later on again.
So it's a trade-off.
And because of that, there's also hybrid approaches.
I mean, well, who would have guessed, right?
And one idea is the packs or partition
across attributes across layout.
And that basically means we're doing this column store idea,
but we're just doing it on individual memory pages, right?
So we're not doing it for the whole table, but we're just basically splitting this up
into mini pages and doing this for partition of the data.
So that means we're again a bit more flexible in how we lay out the data.
We don't have to manage the complete column at once. But all our changes will be more local.
And we can still be like the recombination cost
will go down a bit.
And of course, there's also hybrid storage models
in terms of temporal or in terms of how we insert data.
So rather than saying,
we're always gonna use the column store,
we could just say, well, the writing part,
if we get new data that comes in a row fashion typically.
So let's keep it at that, right?
So let's keep it in a row fashion.
Let's keep the working set there reasonably small.
So we're not, we don't get into this trouble
of having large amounts of data in a row oriented fashion,
although we might want to access it in a column fashion.
And if we have some updates still on this data,
that's also faster than having to somehow realign
the columns individually.
And as soon as the data is old enough,
or our buffer for this row-oriented
storage is full, then we will move it to the column store.
So this is what basically many in-memory database systems these days do.
So systems like HANA, for example, they have a delta store, which is basically whatever comes in new,
and then they have like a cold store or like a column store
for the more older data that's still accessed frequently,
but where we have mostly these analytical workloads
where then we can basically do faster scans
through the individual columns.
Okay, questions so far? Right, so makes sense. I mean the tricky part is basically
when to do what, right? So this is basically, it's really dependent on your workload.
So if you're just reading through the data, if you're just scanning parts of the tuples,
like individual attributes,
then for sure the column store makes a lot of sense.
If you're just updating, like you're getting new tuples,
you're updating these tuples,
then the row store makes more sense
because you don't have as much additional maintenance work
within these individual tuples, right?
In the hybrid layout, if your updates stay local,
basically you don't have the same amount of reorganization.
So this PACS approach, for example,
that makes sense in these cases
where then you have some updates,
some data stays like blue,
or most of the accesses are still column oriented,
then that basically makes sense.
Is this a question?
Yes.
Yeah, that's not a smart question,
but why do we even bother putting data in the cache
in this case if we're just doing the code operation
because I don't expect to
re-access the table values right so the question is why do we put the data into the cache in the
first place and it's not a not smart question so the thing is basically all data will always
go into the cache right so we we cannot access the data. So, let's go back here. So, I cannot access
this individual. Let's say it's four bytes. These individual four bytes, I cannot access. I'm always
going to read 64 bytes into my caches. This is my cache line and this is how accessing the memory always works. And so basically I will always have this overhead.
This will always somehow pollute my cache.
And with this layout I don't have this problem anymore.
Just to reiterate, technically in this case we could just override the same cache line?
Yes. So the question is can we override the same cache line. Yes. So the question, can we override the same cache line?
Yes.
In this case, we can basically, I mean,
if we're not going to reuse, and we can actually also,
there's some commands or some hints to the processor
and to the compiler that we say, well, we need this data.
So maybe you also want to prefetch this,
but we're only going to use it once.
So then basically, you don't need to cache it anymore.
It can be flushed out of the cache quickly.
In general, even if we're doing scans,
even if we're just doing scans, we're still
going to hope that eventually we might have some data that we will reuse so then um i mean in the instruction cache of course we're
going to reduce something uh maybe some other values right that we're going to reuse in the
cache so the like some um variables that we will later on use again. So things like that. So that basically means we will always use the cache somehow.
For the scans, the cache doesn't help us this much.
But let's say the axis granularity
makes a difference here.
Make sense?
OK.
If there's more questions, feel free to speak up.
Okay.
So, we saw this trade-off.
Now, as a last topic, sort of along the same lines, right?
So, basically, we can, with the data layout,
we can make stuff more cache efficient.
Now, there's an additional thing where we can make our layout more cache efficient.
But it's not so much data layout.
It's more data structure layout.
And that's alignment.
So alignment or memory alignment is basically the idea that your variables or your data starts at an address that is aligned,
which basically means that if you're starting at address 0, basically everything is always aligned.
And then you could say, I'm starting at address 1,
for example.
Then this is only aligned if your data has size 1.
So if your data is 1 byte, then starting at address 1
would basically make sense.
If we're starting at address two,
this would be aligned for values that have size two.
If we're starting at four, so from zero to four,
then this would basically be okay
for values that have size four.
So this basically, that's the definition of alignment. You can basically think of it,
if I have a data structure, it's going to start at a cache line somewhere. If I'm starting
somewhere that like some odd number in there, then my access will be not aligned and I have
the CPU basically has to remap this somehow so it fits into
the registers.
And that's costly.
So you basically need Word and cache-aligned attributes for cache efficiency, otherwise
the CPU needs to do extra work.
In general, today the CPU will cover this up for you.
The compiler and the CPU will basically
make sure that your code still runs.
In the past, data would have to be,
like your variables, et cetera, would have to be aligned.
Otherwise, freaky things would happen,
because basically the way the stuff is laid out
in the registers etc
would somehow go wild right so now of course this is somehow padded but if it's not aligned there is
extra work that needs to be done and the compiler typically will pad this because then um the
alignment is better uh the cpu is uh is for the the way way the CPU works is better.
So let's look at this, right?
So we have an integer array, we have a long, and we have some kind of structure or struct,
which is essentially a pointer.
And each of those will be basically, or like the pointer will be 8-bit.
I'm not saying anything wrong.
Help me here.
It's 8 bytes.
So we're talking about bytes here.
Sorry about this.
And the int, of course, will be 16.
So the for integer array will be 16 bytes,
so each 4 byte, essentially,
and the structure in the end will end up in memory, right?
So this, and the compiler cannot change the order.
So the compiler basically needs to represent this in the same way,
and this is nicely aligned here, right? So we can see this starts at 0,
and then each individual item basically
ends at the nice boundaries such that the data or the individual items are aligned and that
the whole thing basically fits nicely into the memory.
This will always be, so the way the struct is laid out in memory
will always be done in the way you define this, right?
So if you define this this way, in this order,
it will also be laid out in this,
even if another order would be better, right?
So the compiler will not change this just for safety reasons, etc.
The compiler will also determine the overall size
and the position of fields such that it is aligned.
And so that's basically what it does.
And for good system performance,
Intel and other processors say that the data needs to be aligned or should always be aligned.
The memory is accessed in these word chunks.
And so whenever you're basically somehow overlapping there in between words,
or weirdly offset it within there, then your accesses will be slower.
There needs to be extra work for the CPU.
So essentially also around the cache line boundaries, if it's not well aligned, then we need to load
extra cache lines, we need to remap this somehow so that it fits in the registers, etc.
The hardware will work correctly, but it will be slower.
It will basically do some extra management.
And the alignment, I basically tried to somehow explain this already.
I always find it a bit hard to explain.
So essentially, if you have a single byte, there's no restrictions.
If you have two bytes, then your address, basically, like the lowest bit, needs to be zero.
So it's basically like you cannot start at a one byte address.
So you cannot be shifted by one if you have four you cannot be shifted by one two or three
It needs to be at zero or at four right and so on. So the larger you basically the larger your address
Or the larger your your data
The more space you need in between.
Either, I mean, if you only have elements
of the same data type, then alignment is easy, right?
It's just gonna be aligned naturally.
But if you're mixing this in a structure or in a struct,
then you have these problems,
then you can think about this, right?
So, and here we have an example.
So we saw this struct earlier or something similar.
And here we're starting with the character.
Then we have an integer array and then a double.
And here you can see we're starting with the single byte.
Then we have these two integers.
And there all of a sudden they're not aligned anymore,
because they're not starting at 0 or at 2 or at 4,
but they're starting at 1, which is not an alignment.
So the compiler, in order to make this efficient,
will actually shift this.
The compiler will insert some padding
in order to make this faster,
if you give them optimization flags. You can still pack it, but then if it's not padded,
then the CPU will have to do extra work in order to have this alignment later on in order to be able to properly work with the data.
And so essentially, this kind of this padded version here will have better performance
than this.
Even though we're using more space, like we'll have to load more data or our caches are filled
up more quickly, this will be faster just because
there is much more overhead that the system has to do. And additionally, I mean, the yet,
the other thing that you can do is basically make sure what's the alignment, like what's the proper
alignment and just flip the data around in such a way, or change the struct in a way that it nicely fits into,
or that it actually is aligned,
then there is no padding needed.
And I mean, if it's possible, right?
So if your struct makes sense in that sense, in that way,
and then you will have the best performance essentially,
because you don't have the padding,
you don't have to realign something,
or the CPU doesn't Have to do any alignment. Okay.
So just basically as like what this is from the compiler point
Of view. But first there's a question.
Yes?
Sorry, say again?
So that's basically a question. Why doesn't the compiler reorder the fields in the struct automatically?
And the rule of thumb, does it work to always start with the biggest variable, start with the floats, and the doubles, and the integers, and the characters, and last?
So the question is, does it always help to work, start with the largest variables?
I mean, most likely it will happen or will mostly work but in
the end it's something like a packing problem right I'm not saying bin packing
but might be even a bin packing problem so I mean if you have very large
structs. So in general I would assume that this probably helps, but you can also, I mean, you can know
how large your individual values will be, and then, or individual data items, and then
basically make sure that you have the alignment.
What makes sure that my struct itself is not already starting at the offset?
So the question is, how do I make sure that the struct is not starting at an offset?
I think this will always be starting.
The compiler will make sure that this will start at the address,
like a zero address or something proper.
Okay.
More? Okay. More?
Okay, so let's look at what basically the compiler does.
So the compiler always maintains the ordering
that you defined, so that it can change.
And each field should be aligned within the struct.
So, I mean, that's basically what the compiler also does.
The CPU will work also without that, but it will be much slower.
And you can somehow use offset off in order to get the actual field offset.
So, you can basically check what's my offset
of an individual item in the struct.
The overall struct will be aligned
according to the largest field.
So basically making sure that we have an alignment there.
And the total struct size must then be a multiple
of its alignment.
So there we also might see some additional padding in order to have an aligned struct.
So this is basically your question.
So the compiler will basically add some padding in order to align the structs within.
And then the variable length data that will be split
and has a fixed header size, so it's basically
the pointer to the variable length data,
and then the variable tail that basically points then
to the tail. And any kind of variable data will be placed at the then to the tail.
And any kind of variable data will
be placed at the end of the struct.
So there we have some flexibility.
And in order to save space, as we said,
the compiler needs to respect the order of the elements, how
they are declared.
So we can basically make sure that we're ordering the data in a way that
it actually fits the alignment or that the alignment works better, right? So that the
alignment is more efficient. So say we're starting with a character and then have an
int, then we will have some padding in there. So the compiler will add some padding. And then at the end, and then the other structure
and some additional padding if we're reorganizing this.
So in essence, putting the first element, the largest element
first, then we get a better padding.
We get a better alignment and less padding.
So all in all, this will be more cache efficient.
And this is an experiment that I didn't try.
So I'm somewhat surprised about the numbers
and eventually maybe we'll try it out, right?
So then I'll update you on this.
But apparently if you don't use alignment,
you get actually really bad throughput.
So this is, I took this out of a colleague's slide deck.
They made this experiment.
And so reordering and padding can actually make a significant impact
on how fast you can access the data.
Again, I'm somewhat surprised about these numbers.
We might want to check this eventually.
And then we'll give you some small code to try it out
yourself okay good questions so far no questions then let me switch gears into into instruction execution.
Okay, so I'm going to start this quickly.
Let me see how long.
Start this quickly before we do a break.
And, yeah.
Okay, so in the rest of the lecture,
we'll see how far I get.
But we're going to talk about instruction execution
and different kinds of hazards for instruction execution,
for instruction pipelines.
And so in general, if you think about instructions,
so if you think about your assembly, et cetera,
this assembly is not what is actually executed.
So at least that's not the end of the story.
So the CPU does not just get an individual assembly instruction
and then execute this magically and continue with its life.
Right, so but actually this is further broken down. This is broken down, each operation
is broken down into micro operations and these are then executed in pipelines.
And these micro operations, so this is, I mean in this, it's four micro-operations. In practice, these pipelines might be six or seven or eight steps long.
So individual micro-operations that contribute to a single higher level,
which still is kind of assembly-level instruction.
And from an abstract point of view, this is basically all you need to know.
Some of these might be further broken down.
You might have different steps depending
on the type of instruction you see.
But essentially, initially, we're
starting with fetching the instruction.
So whatever needs to be done next,
so we're getting the instruction,
and we're updating the program counter.
So the program counter basically tells us us where are we in the program?
so let's get the next instruction in the program and and decode or
First get this instruction actually, right? So this will be in like a in a buffer will read it and
And increase the program counter. And then we decode the instructions, so we check what is
this actually, what do we need to do, and we're fetching the registers. So we read the registers
in order to do the actual operation. This will also evaluate any kind of branch conditions, right? And this is where also then branch prediction comes in, right?
And then we have the actual execution cycle.
So this is basically where the calculation is done.
So if we have like a simple math operation, then the ALU, the arithmetic logical unit, will start doing its work.
So do some addition, do some subtraction, do some logical operation like AND and OR
and stuff like this.
Or if this is a load and store operation, this might also be the case, then it will
basically read to the read and write from memory. And this, of course, then it will basically redo the read and write
from memory.
And this, of course, then will take a bit longer again.
And then we have the write-back cycle.
So it's not the write-through, it's the write-back cycle.
So this is where then basically the data, the result will be put into the register. This could also be from the memory subsystem.
So we basically have the memory load
and here then in the memory load,
like in the write back phase,
we'll write whatever we loaded from memory
into the register.
Or it's the result that comes from the ALU,
which will then end up in the register as well.
And if you ever see front-end stalls, back-end stalls, etc., this is basically where this
comes from. So the instruction fetch and instruction decode and register fetch cycle,
that's front-end, and the execution and write-back cycle, that's back end and the execution and write back cycle that's back end right so
if we're basically in the execution if we have a stall there then we don't have enough uh
operational units on our thing if we're front end we're not loading or getting the instructions
fast enough and now this is a small pipeline, right? This is multiple instructions that we're
doing or multiple micro instructions that we're doing one after the other. And this is done like
with the CPU cycles, right? So basically one CPU cycle, one micro instruction. And if we're, so first cycle,
we're getting the instruction fetch,
then second cycle, we're decoding the instruction,
we're executing the instruction,
we're writing back the results.
So basically, I mean, in my example here,
four cycles to do an individual instruction.
Now, if we're just doing this non-pipelined
or one after the other, then this
means the next instruction will basically start at time four. So if we're starting at zero at time
four and we're actually going to be slow, right? So we're basically waiting and different kind of
units on the core won't have anything to do while we're waiting.
And we need multiple clock cycles for a single instruction.
We always need multiple clock cycles for a single instruction,
but we can parallelize here.
We can use pipeline parallelism.
But still, I mean, if we do it like this, then what we can say is we have basically 0.25 instructions per cycle, or IPC, or four
cycles per instruction, CPI.
So I mean, these are two measures of the same thing.
One is basically one divided by the other.
And of course, we want something that's higher than 0.25.
So how many instructions can we do per individual cycle, right?
So typically, you would see something that's one or even above one.
And this is done by actually basically doing these stages,
the individual pipeline stages in parallel, right?
So we have the instruction fetch,
and this is different pieces of hardware
that are running here or that are working here.
And this is also how the pipeline is set up.
So it's not set up that the same piece of hardware
has to do multiple things one after the other.
It's really different parts of the hardware,
and we're just clocking this
in order to break it down further. So now
what we can do is basically we can do pipeline parallelism here, right? So while we're decoding
the first instruction, we can actually fetch the next instruction. And while we're fetching or
decoding the second instruction and executing the first instruction, we can fetch the third
instruction. And this really depends on the length or the depth of the pipeline, how parallel we can do this. And ideally, we can basically just do,
like, continuously just fill up the pipeline, such that every step is always filled, like every
functional unit always has something to do. And with this, we get this instructions per cycle of one
or cycles per instruction of one.
So we basically fully utilize our instruction or our system all the time.
And ideally, I mean, this is basic pipeline parallelism
that you've hopefully already heard.
If we have a K stage pipeline then our throughput will be improved or the performance will be
improved by a factor of K. Of course the slowest part in there will
determine the clock frequency or will determine also the throughput of the
pipeline. So let's say either we're clocking the CPU in a way
that each of these things can actually be executed
on a single cycle,
then probably the execution is most expensive, right?
So then whatever the slowest execution is
will basically determine how fast our clock cycle can run.
Or if we're using multiple cycles in one of these stages,
then basically the other stages will have to wait.
So that's basically how we determine the throughput in here.
And this also, but the more stages we have, and if we have a single cycle per individual stage,
this also means the latency goes up, right?
So, for each individual operation, we'll actually need multiple cycles to finish them,
but we can do more operations in parallel.
So, that basically means a single operation won't be done in a single clock cycle,
but we will need a couple of clock cycles to
actually run a single operation. But this here, this part will basically completely
execute it in parallel, if all goes well. Okay, questions so far? As I said, like a
typical CPU will have longer pipelines and we'll go to this in a bit more.
And now we're going to do a quick 4-minute break.
And then I'm going to talk to you about the pipeline hazards.
So, when does the pipelining not work well?
Okay, so let's get started again.
And talk about pipelining hazards, right?
So, pipelines are these.
And now the question is, well, ideally, this always
works nicely.
But in practice, we might have some problems with this.
And this is what we call pipeline hazards.
So whenever the pipeline doesn't really fully work out,
so the pipeline parallelism is somehow broken,
this happens due to some kind of hazards. And there's structural
hazards, so meaning we have something or some kind of operations that need the same functional unit,
right? So we have some kind of resource conflict on the CPU. And this is, for example,
so the instruction fetch and the memory access.
So they both need to get something from the caches.
They both need to get data in form
of instructions or actual data.
So that's the same kind of unit.
If there's not enough on the CPU or not enough thing,
like we were basically constrained on the CPU or not enough thing.
We were basically constrained by this unit on the CPU.
Then all of a sudden, we have a structural hazard.
We will not get the perfect pipe planning.
Then we have a data hazard.
This is basically whenever we have instructions that somehow
need the same kind of data or data
from previous
instructions, right?
So then all of a sudden we're in a conflict because of the pipelining, right?
Especially if the pipelines are very long.
If the next instruction works with the same data, then it might be already in the execution
whenever the previous pipeline hasn't, or the previous
instruction hasn't written the data yet. So this is because of this overlapping, we might
get into a problem. Then control hazards, so whenever we're branching, I mean without any
branch prediction, we will have to wait for the branching to actually be,
like for the branch to be evaluated to continue in our instructions.
So that's why there is branch prediction, because otherwise we basically,
every branch, we first have to completely execute the branching condition
in order then to continue with whatever happens next. Data access is a problem,
so whenever we have to go all the way to memory,
then the CPU will have to wait
the pipeline is broken to some degree,
or if we're instructions are not loaded, et cetera,
so that might also be stalling things.
And these hazards, so these are all
hazards, these are things that can happen, and if they happen, they will stall the pipeline, or
that's what we call a pipeline stall. So all of a sudden, our pipeline is in trouble, right? So
we cannot continue, we cannot further fill our pipeline. So let's look at a structural hazard.
And this is basically whenever the CPU cannot support
all possible combinations of instructions.
So if the CPU only has a single memory access unit,
and the instruction fetch and memory access
are scheduled in the same cycle, then, well, we
have two memory accesses at the same time.
One has to wait.
So then basically, we see this here.
We have an instruction, like our memory access
here in the first instruction.
The second instruction is fine, because we
do this while we're doing the instruction decode.
But here, we want to fetch
something, the memory unit is blocked through this actual execution or through this instruction,
so then we need some kind of stall and the pipeline just like we're basically just using
losing one cycle where then the next pipeline will get started.
And this can also happen because individual functional units
are not fully pipelined.
So this is like some functional units
are actually quite complex.
They need to do a lot of operation,
and they need multiple cycles to execute. So that basically means we basically have to wait
until we send something new into this functional unit.
So this is especially true for what we're going to see next time.
So this vector processing.
So these operations are quite complex.
They might use multiple cycles to execute.
Then a new instruction basically won't be able to use the same unit because we just have to wait until this unit is done.
Until the current operation is done.
And this is basically where the CPU manufacturers need to make trade-offs, right? So we basically, more functional units means more parallelism
or more complex pipelines breaking down things further
means longer pipelines, means better parallelism there maybe,
but also means more cost because it's more complex to maintain this,
more circuitry is needed to actually properly do this.
So, well, in the end, then often we need to, or they need to make a trade-off and say,
well, in this case, if you're doing this, it will take a couple of cycles, won't be
a single execution and your pipeline will suffer to some extent.
And for this, I mean, it's basically something that the hardware needs to do, right?
So we might come up with something where we somehow specifically target these things,
but typically in a regular workload, well, if that happens, let's hope that the CPU is properly designed so such that it can somehow mitigate these problems and that the scheduling
works well so that some reordering etc. happens that it works well.
Yeah so then we have these data hazards and they can basically this can happen if the pipelining changes the relative timing of reads and write
accesses.
In general, from a single threaded point of view, we have a single core, we're talking
about single core still, so I don't have multi-threaded or something like in there. I'm expecting one operation to be executed after the other logically.
Practically now all of a sudden there's some overlap in here.
We're doing some parts of the operation,
the previous operation still while we're doing the next operation already.
And this is where things can somehow get awkward.
If I'm doing, for example, a read after a write.
So I'm reading in the next instruction,
I'm reading the register that is written
by the previous instruction.
If this is done at the very end of the pipeline,
then my next step in the pipe or my next pipeline
might want to access this
before it's actually written back.
So then we have this data asset.
Or we have a write after read.
So all of a sudden what could happen, I'm basically, that we're writing a register that should have been,
or we're writing a register that,
before it's actually read by a previous operation, right?
And this happens only if we're reordering instructions.
So typically this should not be happening
if we're still in the same order,
because the writing will happen all the way
at the end of the pipeline.
But since CPUs can reorder instructions to some degree,
this might all of a sudden be a problem for us,
that we have an instruction that writes something
that the previous instruction shouldn't have read yet.
So then we need to somehow stall this and wait until the read has occurred.
And write after write, of course, same problem. somehow stall this and wait until the read has occurred.
And write after write, of course, same problem, right? So we're writing something that a previous instruction
should actually write, so we're overwriting,
or the previous instruction will overwrite
what we're writing in the later instruction
if we have some reordering in here.
Again, only happens in reordering,
but the read after write actually can happen even if
we don't do any reordering of instructions. So let's look at an example. So here we have
an addition, a subtraction, an and, and an or, and basically the way this works in assembly here is that we have two arguments, R2 and R3,
and we're writing the result in the register 1.
So, we're basically reading R2 and R3, making an addition and writing it back into register 1. And in the second instruction, we're reading register 1 and register 5,
and we're writing to register 4, and so on and so forth. And here you can see that basically
the instruction, the subtraction basically reads the register R1 before it was written
by the addition. This can be mitigated.
So there is an option that we actually have
some kind of forwarding or bypassing hardware.
So basically, in the hardware, we can say,
well, the next instruction will use the same register in here.
And the computation will actually happen later, right?
But the instruction fetch
or the register load happens before the actual execution so then we somehow need to make sure
that this works in one way or the other right so here we can see basically in the instruction
decode step and of the subtraction will read the register which should have been written here.
So now there is a bit of hardware in the CPU
in order to make sure that this result here
will also be available here.
We only need it actually in here, right?
So the actual calculation will happen in here.
But if we're just reading whatever
is in the register right now then this will be wrong right so because at this stage the register
still has something which might be from somewhere else right it's not what we've written here
the same is true in in here right so basically, here we actually don't know exactly, right, so it might be written, might not be written,
et cetera, so here there is some additional hardware
that will save us from this,
but technically this might be a problem.
Then basically here we have a similar setup
where we're doing a load, right?
So we're loading something from memory.
So we have the address R2 and we're loading it into R1.
And then we're reading or we're subtracting
whatever was in there and write the results somewhere else.
So here, technically, the instructions,
so this execution here is a load, so this will take time, right?
This is not a single cycle, essentially.
If we're doing the instruction decode here and we're reading from the register,
also forwarding hardware won't help us because this is not available in time.
So here we actually need to stall, right? So we basically need to wait until this is available
and nothing can happen until this is available. So the pipeline will basically just be stalled
until we have the data available. Then we can basically start our decode and only then the next
instructions can be fetched and so on and so forth.
So here, scheduling might help us.
And this is why the CPU all of a sudden then starts to reorder the instructions, because
basically getting the data from memory will take time, so maybe I can do something else
in between that's useful.
And all of a sudden we get into this trouble
that we do something like a write after write,
which also might be what the CPU needs to be aware of, right?
If we're storing or if we're reordering here.
Okay.
But anyway, at some point, I mean, once we're always going into memory with our execution,
this will basically empty our pipeline or stall our pipeline and slow things down.
So there we have to think if we can do something.
And then finally, we have the control hazards and these come from branches and these are actually
more severe than data hazards because essentially we don't know what's going to happen next.
So in a branch we can predict what will happen but technically we have to wait what happens. And the very simple way of doing this, of dealing with this kind of control hazard,
like if we're doing like a wrong prediction, is that we're flushing the pipeline. Meaning basically
we're just emptying the pipeline and start from scratch. And redo the instruction fetch. And this is basically costs in the order of the length of the instruction, of the pipeline.
So even if we're executing the branch, we don't know if this is the right instruction.
So I mean, we're just reading the program counter.
This is the next instruction in the program counter. But until we know exactly what's in here
or what the correct thing is, or whenever we know this,
we only can continue to compute.
So we'll basically wait and then go with the two target
instructions.
Or we can basically
start once we know what the target instruction is.
So this might really just remove,
like completely empty our pipeline
until we know what exactly is going on next.
So another question is, I mean, some of these
we know we cannot really do anything.
Some of these we might be able to do something about.
Another question is, how can we mitigate some of these pipeline hazards?
And some stuff is actually done in hardware, right?
So some things the hardware vendors try to improve in order to fix these problems.
And so one thing is that we actually have superscalar hardware.
Meaning it's not just a single pipeline that we're constantly filling.
So, I mean, even if I draw here multiple pipelines, it's actually a single stage pipeline where we're adding new things.
So, we have instruction fetch, instruction decode, execute, and write back pipeline steps.
And there's just these four steps basically.
We're just doing different things here.
So it's not four individual pipelines, right?
So, but what we can do,
we can actually do multiple pipelines, right?
So if we know we have different kind of like,
we wanna do multiple things in parallel
or things take longer time,
well, let's just replicate some of these things.
So we can have multiple pipelines
and these actually can have different functionality.
So we use them for different parts
or we can have multiple functional units that we're using
and we're using them separately
or we're using them in different pipelines.
So just in order to make sure that we do the right things. And this, for example,
you will always have, right? So we will always have like a arithmetical logical unit, a store
and a load and floating point operation units. And now if we have multiple pipelines, we can
also use them separately with different kinds of instructions in parallel. And with this, if we do this right, we actually can get
instructions per cycle larger than one.
So all of a sudden, we can do more parallelism even
in a single core, and we can execute more operations
or more instructions than we would usually do
within a single clock cycle.
I mean, one thing that we saw, these control hazards,
this comes from branching.
And here, the CPUs try to predict these branches.
And the CPUs, they have some hardware just to do this.
There's a question.
Sorry, just when we have the superscalar architecture
and we have four instructions at once,
why doesn't this affect the instruction decoding?
Would that become more complex?
Then we use four different hardware units?
So the question is, why don't we have more different instruction decoding if we have different hardware units or different functional units?
Technically, we just need to know which unit the instruction needs to be executed. So in the instruction decode, the micro operation
looks to the instruction decode looks the same.
It just needs to route to the right functional unit.
And I mean, let me also rephrase.
So this is just a schematic.
So how this is actually done in hardware,
you might actually have separate pipelines.
So where these pipelines start, which parts are replicated,
this really depends on the core architecture.
So how an individual core then will be.
I have some schematics for some different cores later on.
So there you can basically at least guess to some degree
how this is done.
You're welcome.
And we'll probably not be able to get there today.
So then we'll do it next week.
OK.
OK, so branch prediction we already heard
is kind of tricky because it might flush our pipelines
or at least stall everything.
So let's do something smarter.
And the CPU tries to speculate what
would be the right branch.
And guessing which one is the right branch,
it will just start execution.
And the prediction needs to be done early.
So this needs, in the instruction decode, this is too late
because we already need to know where to go.
This needs to be done in the instruction fetch.
So this, again, needs to be done in hardware and fairly fast.
So there's a branch target buffer or branch target cache.
It's basically a simple lookup table that says for this program counter, so for this
wherever we are in the program, what's the predicted branch and has it been taken or not.
Right, so this is basically telling us like initially say we're always taking the true branch
or we're doing taking a random branch and then then we're updating was this taken or not.
So it's really just a binary decision.
And if it's been taken, then yes, the prediction was good.
We're going to keep this prediction.
If it's not been taken, then the prediction was bad.
So we'll update the prediction and use the other.
And maybe there's a bit more magic to it,
but that's basically how it will work.
And so the branch target buffer will be consulted
during the instruction fetch,
and then we'll see where do we need to go.
And if there's no entry, or if there's an entry,
then we follow the prediction.
If not, then we do some kind of static prediction.
As I said, true or false or random.
And then check after the branching if that was correct or not.
And it's, of course, like many of these things, it's secret how it's actually done in the hardware.
But you can guess something along these lines.
And we can try it out and see, well, if we're randomly doing
our randomly or our branch targets are random,
then the prediction is bad.
I mean, there's no magic behind that, right?
So it cannot predict the future properly.
It can only see whatever it has learned already.
So if we've seen branches have been used many times in this direction,
or last branch was this, then it will just try this.
So, I mean, this is basically branching, right?
So what we also can do from our point of view,
we can, I mean, the branch is basically the problem, right?
So we have, if we have to, or if we're using a branch,
then we have to do code like different kind of codes.
The CPU will do a prediction.
If it's right, it's good. If it's right, it's good.
If it's wrong, it's bad, because then we'll have this control hazard and we'll flush our
pipeline. We go back, do a new instruction fetch, et cetera. We get a bad instruction
per cycle count. What we can do in our program still is we can mitigate or we can avoid branching.
And this is basically in data processing.
This is essentially fairly easy.
So we can control, turn the control flow into data flow.
So again, let's look at our line item query, right?
So we have, it's basically the same query that we had before, just a slightly different predicate.
So we're doing the count with a different predicate. It could be the same predicate, it doesn't really matter. And if we directly
translate this into C code or C++ code, then
this is a branch, right? So this is basically here. If it matches,
if our predicate matches, then we increase the count.
But we don't have we increase the count.
But we don't have to do the branch. Anybody know how this works?
You could just add the comparison?
Exactly.
We can add the comparison.
So this is basically, I forgot how this is built up.
So this is basically if we execute this.
Somebody did this nice experiment if the selectivity is about 50% random.
So if most of the time, or half of the time, the branch prediction will be wrong, then
this takes much longer than if the branch prediction works
well.
So in these cases, the branch prediction,
if the branch is always the same,
then the branch prediction is good.
We never get into this pipeline or this control hazards,
and we never have to flush the pipeline.
So this will work actually quite quickly. As soon as we're we have 50% selectivity then it's
not working well because we're basically doing a lot of overhead here because of
this branch branch misses and now what we can do that what you correctly said
is we can basically just add the comparison
right so this is no branch this is basically just checking does it match
and we're we're adding does it match to our result so here we don't have a
branch we always will execute this right so this code the upper code the count is
only or the addition is only executed if it matches.
This addition is always executed.
So we'll always have an addition, but the addition will be with one if it matches,
and will be with zero if it doesn't match.
So we'll always execute this.
So we do a bit more work in the cases where it matches where it doesn't match because we we basically also do
the addition which we don't do in the in the upper part but we never have to branch and this is
basically what we'll see in performance is basically like in the cases where the branch
prediction works well we do a bit more work, right?
Because we always have to calculate.
But in the cases where the branch prediction doesn't work,
because half of the branch predictions are wrong,
there we actually do a lot less work.
So there we're actually quite a bit faster.
And then we can also do basically hardware predication
and execute both branches and discard one result.
So this is also possible.
So the hardware can also do this for us to some degree,
but this is something that we can do.
To be honest, the compiler will do this for you.
In simple cases, if you're doing optimizations,
the compiler already knows these kind of tricks,
so you don't have to be smart about it.
If you write this code and this code,
the assembly is the same,
and basically the compiler will already have seen this and you get the same performance.
Yes?
Question, the predicated version. So I do the comparison and then I get a Boolean, which is most of the cases I have a 0 or 1.
But in some cases, I think this C standard doesn't demand a true case to be won.
So am I at risk that those comparisons could, for example,
return 1,000?
So the question is, in the C standard,
so I have to be honest.
In the C standard, this might also evaluate to something else.
And there, I have to be honest, I would have to look it up.
So what could happen?
I mean, this is basically one of the standard examples.
Whatever the compiler does probably is correct.
So you could go check the compiler
and see whatever the compiler makes out of this.
Probably does some safeguards around it
to make sure that the result will always be correct.
Typically, you also have multiple predicates.
And then you can do also different kind of,
I mean, you can do the same thing.
So you can do also predication with multiple predicates.
So if you do a classical and, then that
will be translated into many if statements that are nested.
But you can also do this with a bitwise and, which then will
just be evaluated as a single if statement later on.
Or again, we can do this similarly
with a predicated version without branches.
So if we do this, basically, then we'll see,
I mean, there's again, somewhat of a trade-off, right?
So if we have the branched version, this is good.
If the branching always works well, right?
So then if the branch prediction is perfect
and we don't have anything to do in here, right?
So our predicates basically always evaluate to 0,
meaning that we don't have to do additional work.
Then this is faster, because we don't have to do the addition.
If we have a high select, or if the branching mostly is wrong,
or half of the time is wrong, then basically the branching mostly is wrong, or half of the time is wrong, then basically the branching will be bad,
and the predicated version will be faster.
If we do bitwise, and so that's basically as fast
as a predicated version in the cases where we basically
are mostly or often evaluating to false.
But if we're evaluating to true a lot, then this will be just as slow as the branched version again.
Right. Okay. But the predicated version basically always needs to do this memory write.
And this is why this is actually the overhead.
If we don't actually do, or if our branch prediction is bad,
or if we don't actually have to execute the query,
then this will have a little bit of an overhead.
OK, let me see.
Yeah, I'm going to get two more.
So the other thing that the CPU can do
is that it can do out-of-order processing.
So instructions can be dynamically scheduled.
So there's basically reservation stations
where the instructions go.
And then they will be executed whenever
all hazards are cleared.
And I mean, we can or the registers
can be renamed in order to reduce hazards.
So then some stuff can be run in parallel
by not overriding the same kind of registers.
And we have different kind of reservation stations
for the different kind of units.
And within these, to get some speed up basically,
or to fill the pipelines, the CPU might reorder operations.
And to some degree, we can actually also use this.
Because here in our count, we have a data hazard.
So in our count, we're always accessing the same...
Where is this getting empty? I don't know. So the count, this is basically the same register in the end, right?
That we'll always be writing to.
So the addition, like the new addition needs the previous addition in order to continue.
And we can ourselves basically change this by using two cursors, right?
So all of a sudden having two cursors
means we have two operations,
we're using essentially two registers,
two additions that can be reordered
because there's no dependence in between the calculations.
So we're basically going through the data
from two different positions,
doing two additions.
And this can be done in parallel.
Whatever the CPU wants to do here, it can do it.
And in the end, we just have to add up
whatever we're doing there.
So that basically means there's some additional gains
for the CPU in here.
And hypothetically here,
we could see even a bit additional performance improvement
by this, right?
So because the CPU doesn't have the kind of data stalls
that we have with the previous version.
But as a last example, I will actually show this to you.
So let's look at some code here.
Before we end today.
So I'm doing basically the very same thing.
Let me see if I can find my cursor here.
So I'm just using a large data array.
I'm filling it with random data, as I usually do.
I'm just going to do the sum of the array in the end.
And here, I'm basically checking is the'm with a certain, let me see.
So I'm adding up all the numbers.
I'm basically checking for half.
It's random.
So it's random numbers within integer max.
And I'm using half of integer max as my branching condition.
So if it's lower, I'm going to add it.
If it's larger, I'm not going to add it.
So 50% of the time, most likely, if everything's well randomized,
this branch will not be hit.
And 50% is true, right?
So this is basically what I'm doing here.
If I do, then I count, and then I output.
I'm just going to go through the array and check.
This is basically very similar to what we earlier did.
And then a predicated version of this looks just like this.
I'm doing the comparison.
I'm adding the comparison to my count.
And that should, of course, give us the same result.
And then finally, we have this out of order version
where I'm going through the area with two different counters,
basically.
I'm doing two steps, and I remove this data hazard
with the counter to some extent.
So there's some additional reordering that the compiler can do. And then I'm just outputting everything and I'm
outputting the timings, right? So this is as always just basically getting some
timings here. And I'm going to show you this on the command line. And I'm
first compiling it without optimization flags and and that's important. So this is basically running this without optimization flags.
We see that the branched version takes 91 or 92 milliseconds.
The predicated version takes 15 milliseconds and the out of order, so basically where the reordering works, takes 13 milliseconds.
So we can see some difference in here.
What if you compile with O3?
I'm going to do this next.
That's exactly what I wanted to show you. So let me just do this with O3.
And we're going to do the same thing. And all of the nice benefits are gone because the compiler
actually does a lot of stuff here for us, right? So here we can see this is 800 microseconds for each of these executions.
They're all basically the same. I mean, there's a slight difference, but it's very slight,
right? And the reason for this, I mean, you can basically check this if you want, like
analyzing the code. The reason for this is the compiler sees that
there's an option for predication in here
and the compiler will vectorize this code.
So the compiler will use additional functional units
in there because it's so simple, right?
It doesn't need to be done on a single,
like these individual calculations.
It can do some parallelism in there,
even though it's like on a single
functional unit in there. So using this vectorized operations that I'll show you next time.
Okay.
Further questions?
Regarding this. Okay. So this is also uh one of the examples where premature optimizations
don't really help you that much right so because the compiler actually is already quite smart
so you can complicate your code a lot without achieving a lot of benefit yes
how many cursors do you need or Or how many should you not use?
Because in an extreme case, you could just access them all
one by one, right?
So would you iterate?
JANET RANKINSTEINER- You don't iterate at all, you mean?
And you just build so many cursors that you
have everything in count.
You can just access it at once.
JANET RANKINSTEINER- Yes.
It wouldn't help much, right?
JANET RANKINSTEINER- Yes.
So the question is, how many curses would you want?
I mean, it's really just, can you mitigate some of these data
assets in here?
So it's basically, do we get some additional benefits
from reordering?
We see slight benefits.
I haven't tried it with 4, if this would give us
an additional benefit in here.
I mean, as soon as you're, this is sort of already somehow
vectorizing this, right?
So we're working in blocks.
The CPU or the compiler will see this and optimize already
in there.
But then we need to deal all of a sudden with corner cases,
what with the border of the areas.
So I would say we'll probably not see any further improvements
after four or so, because then, I mean,
think about the pipeline, right?
The pipeline is not endlessly long.
So reordering won't help you very soon anymore.
So basically, because I mean, with 4,
you probably don't have any data hazards anymore
because it's only a few micro-instructions that
actually would result into this data hazards.
But to be honest, I haven't tried.
I mean, you can easily, I don't know,
I don't think I have this code online yet.
I can put this code online.
You can try change it to 4 and see if you're not optimizing it
if you get some further improvements here.
OK.
No further questions.
We're already a bit over time.
Then thank you very much.
I'll see you next week.
We'll talk about prefetching and the core architecture,
and then fully dive into vectorization,
where we can get these kind of performance improvements.
Thank you.