Algorithms + Data Structures = Programs - Episode 51: Efficiency vs Speed
Episode Date: November 12, 2021In this episode, Bryce and Conor talk about the difference between efficiency and speed.Date Recorded: 2021-11-05Date Released: 2021-11-12ADSP Episode 47: Combinatory Logic!C++ std::minmax_elementC++ ...std::inclusive_scanLoop fission and fusionC++ std::memcpyCache prefetchingRegister pressureIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
So loop fusion is when you take multiple loops and you combine them into a single loop.
And loop fission is when you take a single loop and split it into multiple loops
that iterate over the same index range as the original loop. Welcome to ADSP, the podcast episode 51 recorded on November 5th, 2021. My name is Connor
and today with my co-host Bryce, we talk about the difference between efficiency and speed.
Now, listener, strap in for Bryce's explanation of the difference between
efficiency and speed. And I'll give a little 30-second intro. A couple episodes ago on,
I think it was the combinatory logic episode, I was saying how if you fuse, you know,
two algorithms into one using combinators, you're going to end up with faster code because doing a minimum element
and maximum element separately versus, you know, min-max element, which Ben Dean very, you know,
astutely pointed out that, you know, you're never actually going to match that because the
performance of min-max element has three over two n operations, which a fused one is still going to
have two n. But anyways,
that's besides the point. But thank you, Ben, for clarifying that. Bryce pointed out that like,
well, it'll be twice as efficient, but not necessarily twice as fast. And in that episode,
I don't think, I think in listening back to it, I don't think I fully grokked what
Bryce meant by that. So now enter Bryce with his elaboration on efficiency versus speed.
Yeah.
So I like to think of this as, I'm not surprised that we are coming at this from different angles.
Because I think you are much more of a theorist, a pure computer scientist.
Whereas I am much more of a software engineer. And maybe that's even like
too haughty a title, like a software plumber. You know, I sort of come at it from like the
grounds up, like how does this thing actually run and work in production? Whereas you're much
more the person to think about, you know, the whiteboard solution. And that's not meant as a dig. It's a little bit
of a dig. A little bit of a dig. But so let's start off with an argument that I know you will
find compelling because I've made it before,
which is the question of efficiency when it comes to parallelization.
So we've told this story a few times on the podcast, but once on a drive from the south San Francisco Bay up to San Francisco,
I explained to Connor how you could paralyze
an inclusive scan. And the reason that Connor thought that it could not be paralyzed is because
Connor was thinking about paralyzing the most efficient version of a scan algorithm. And I explained to him how you could implement a
parallel scan that is work inefficient. That is, it's less efficient. It does more operations than
a serial scan. But it doesn't matter that it does more operations because they can be done in parallel.
So, you know, if you do n work in serial for some algorithm and you can paralyze it and, you know, paralyzing it means that you actually end up doing two times n work, that might be perfectly okay because if your goal is to get an answer faster, it doesn't matter if you're doing more work if that work can be done
quicker because it's being done across multiple threads. So you understand that argument. Yes, you understand that in the case of, you know, thread parallelism, you know, being inefficient might be faster.
Correct, yes.
Okay.
100% on board. to almost every case that we encounter in the field.
Because even when you're programming for a single thread,
the hardware that you are running on
is still actually parallel under the hood.
There is a part of your processor that loads data. There's
a part of your processor that fetches instructions, that decodes instructions. There's a part of your
processor that does adds. There's a part that does multiplies, et cetera, et cetera. And those parts can run in parallel.
And so on every modern processor,
your single-threaded code is actually being paralyzed under the hood
at a level that you don't really see,
in that the actual things that need to happen
to make that serial thread,
the serial set of operations that you want to do to make that actually execute,
those things don't strictly happen in order,
and some of them are overlapped and asynchronous.
And so then again, we get to this idea that in some cases, things that are less efficient
might actually be faster because efficiency isn't really our goal.
Our goal is time to execution.
So let's move on to this particular case of loop fusion and loop fission.
Do you know what those terms mean?
Uh,
we'll pretend for the sake of the listener that I don't know what loop fission is, uh, because this is the first time we were recording this. Um,
I do know what loop fusion is though. Um, but tell me what is, what is the,
what are they both and what is the difference?
So loop fusion is when you take multiple loops and you combine them into a single loop.
And loop fission is when you take a single loop and split it into multiple loops that iterate over the same index range as the original loop,
or order of magnitude the same range as the original loop.
And so you said you were familiar with loop fusion prior to this conversation.
And that's not surprising because I think that loop fusion is perhaps the more intuitive of these optimizations. If I take two things that are iterations over N elements and I combine them together, you know, that's, that's probably a win,
especially if the N elements that they're iterating over are the same set.
If like, if, if the iterations are always accessing the same data,
then combining those, those two loops together,
you know, it seems like it'd just be a win, right? You're doing, instead of doing
iterating through two n times, you're just iterating through n times.
But in practice, that's not always the case. And sometimes it's better to have loops be separate or to even take existing
loops that you have and to split them. And the primary reason why you do this is to get better locality of reference.
I'll make the bold assertion that the vast majority of code in the wild is either memory latency or memory bandwidth bound.
Which essentially means that all of your software is pretty much a fancy version of memcopy. The reason for this is because
we've gotten very good at building processors
that have a lot of compute throughput
at a fairly low latency.
The problem is getting the data
to that execution hardware.
And if you have a ton of data reuse, like, for example, a matrix multiplication, which has an O-N squared complexity, and therefore there's a lot of data reuse, then yeah, you might be able to saturate
your execution hardware,
but that's not what the majority of applications look like.
The majority of applications are,
I read some sequence of data
and I do something with it
and then I go on and do something else
or something like that.
And so for most code, the determiner of your performance is going to be your memory access
patterns. And in particular, how efficiently you make use of your cache and your processor's ability to prefetch, etc., is going to have a huge impact on application performance.
I'll give you a simple example.
Let's imagine that we had two loops.
And the first loop applies a transformation to one data array.
The second loop also iterates over that same array, but it iterates over two other arrays, let's say, as well.
So the first one accesses one array,
the second one accesses three arrays,
and one of those is the same as the first one.
So you might think, okay, well, we should fuse those because then we only have to access each element of that array,
the array that both of them access, a single time, right?
Yes.
Okay.
Now, here's the problem.
When you fuse those loops, you might end up disrupting your memory access patterns. So one thing that you might end up doing is you might cause memory accesses that would
previously hit in the cache to no longer hit in the cache.
So pretty much all processors today do a pretty good job of prefetching the data that you need into the cache.
And if you combine those two loops, now we have to prefetch even more data. Because of that, it's possible that we might have cache misses instead of going through the cache.
So maybe that first loop, which just acts as the one array,
maybe that loop was actually like really optimal and that loop, the compiler and the processor were
going to always ensure that all of the data was prefetched before each iteration. And so
it would always run out of cache and it'd be super fast. But by combining it with the second loop, now you're going to have a cache miss every time. And that's going to make your code
run slower. And of course, this is even more likely to occur if you have loops that have
completely unrelated data that you try to fuse together. Because then there's really no data reuse.
You're just increasing the pressure on the cache of how many different things are going to fit in
the cache with no real reuse between the two. But there's another thing that can happen too, which is a larger loop body can have increased register
pressure. And you should really think about registers as being like an even faster type of
cache. You know, let's say you got like a processor that has like six registers. Well, if each loop iteration only needs six temporary variables for its operations,
then that's great.
You never have to go to stack memory.
Stack memory is always going to be slower than registers.
Hopefully, your stack will be in cache.
It usually is. But on a processor like a GPU,
the difference in memory bandwidth and latency
between registers and even the first tier of cache
is pretty bad.
Like on a GPU, spilling to stack is like really not good.
You never want to do that.
And that's why there's like a ton of registers on
GPUs so that you can avoid doing that. So by combining those two loops,
you might have caused operations that previously did not spill to the stack to have to spill to
the stack. And that will mean that there is more main memory traffic. And also that
there are more things that need to be stored in the cache. And therefore, there's less room
in the cache for the data arrays that you're accessing through. And there's another thing that can happen.
So as I said, the modern processors are pretty good
at prefetching all your data.
And one of the ways they do this
is they look at your access patterns
and they can figure out linear and strided access patterns.
They can be like, oh yeah, you know, like the last like 10 iterations,
you accessed like this chunk of memory sequentially,
or you access this chunk of memory with some fixed stride between each access.
And so I better just assume that you're going to continue accessing that stream of memory
and I'll just fetch
where I think the next element's going to be.
But if you're accessing
a whole bunch of different
streams of memory,
the processor has a limit
to how many different streams it can track.
And so it might reach that limit
and it might no longer be able to keep track
of all of your linear access patterns and prefetch accordingly.
And then again, you'll have more cache misses.
Yeah, and there's a few other, you know,
ways and places in which these sorts of effects can come up.
If you've got multidimensional loops,
you can, you know, these caching effects
can be even more pronounced
because in the multidimensional loop,
you will have a lot greater potential for reuse in the inner loop.
And also, if you have something like an adjacent difference
or any sort of stenciled operation
where each iteration, you don't just access one element, but you access, you know, three
or four or five different consecutive elements, like a window of elements, you know, if you
fuse a bunch of loops together, then again, that's increasing pressure on the cache.
And then it's possible that, you possible that you're exiting these five consecutive elements
and maybe because of this loop fusion,
the last of those consecutive elements got pushed out of cache
by something else that got prefetched into cache
because of all of the stuff that you're trying to do in this one loop.
Now, this isn't to say that loop fusion isn't ever a good optimization. I mean, in many cases,
it is. But it's not always the right thing to do. And every compiler that does loop fusion
has heuristics for when it's profitable to do so. And in some cases, a compiler will split your loops because it determined, hey, you know,
if I take this complex loop that's doing like, you know, a ton of things, and if I just split
it into like three loops that are much simpler, those three loops are going to run faster than
the complex loop because those three loops are going to have really good, really simple and easy memory access patterns.
And the processor is going to do a great job of prefetching the data for those three loops.
And there's never going to be a cache miss.
Whereas this one giant loop that I have, it's a mess.
And it's going to end up with a whole bunch of cache misses.
So that makes sense for the most part.
Although I will say your example where you had three arrays,
one loop that was going over the first one
and then another loop that was going over all three,
including the first one.
I mean, I would guess uh fusion in that case in most cases like
it would end up being faster right because like in if you had said you know a loop over uh array a
and then a second loop over array b and c where the index is the same so you could still fuse it
to me that's a lot more obvious
that you could end up with cache misses,
whereas where the first loop is just a subset
of the data of the second loop.
Isn't that, isn't like the second loop
going to end up with just as like fusing the first loop
because it's a subset of like the data that you're fetching.
Isn't that you're fetching, isn't that
you're going to have the exact same cache miss whatever ratio or am I missing something?
Fusing that first loop into the second loop might mean that the first loop, let's say
that that first loop as written has great memory access patterns and is going to always
prefetch great and run very fast. Fusing it into the second loop could mean that the operations
that constituted that first loop will now cache miss because this second loop has more complex memory access patterns and it's trying to load more stuff into cache and some of the stuff starts to fall out of cache and isn't there when we expect it to be there.
So it might make the first loop slower.
It might also complicate the second loop. It will add additional register pressure to
the second loop because there's a new operation that you have to perform. And that additional
register pressure might mean that there's spills to stack, which might mean that now the stack has to be in cache
where it previously didn't. And that means that fewer of the data elements would need to be in
cache. And the combination of those two loops could also, you know, throw off the compiler's ability to recognize that it
should, or not the compiler's, it might throw off the processor's ability to prefetch the right data
because it's a more complex access pattern. You know, especially it could be the case if that
first loop is something like an adjacent difference
where it's stenciled
where it needs to access consecutive elements
if you have a loop that is accessing consecutive elements
you need to not only have the next element in cache
you need to have some of the last elements in cache
and you might need to have like the next one or two elements in cache
and the last two or three elements in cache.
And so combining those loops,
maybe the second loop only needs to have each iteration.
It only accesses one element from that array.
You see what I'm saying?
Right. No, yeah.
So I guess that, yeah, the register pressure stuff makes total sense
because of the increased operations.
I guess you had not said this
but i in my head when you were explaining that example was thinking that uh you know you had a
single index i uh and that was the only index that was being used for each of the arrays a b and c
but uh if in the first loop like you said you you're looking at both index i plus you know two
before and two afterwards and then in the second loop you're only using you know single like like you said if you're
changing the data access pattern and it becomes more complex then that can i can see that yeah
so that's i think i added uh criteria to your example that you'd never stated which is what
caused the confusion you asked me a question the other day, which was, you know, don't you have to pay for the
cost of the memory loads and stores anywhere, like in both cases?
Like, you know, why does it matter?
The way that I had phrased the question, because it was related to a slightly different example,
is that you've got one example where you have two loops and you're going to be loading things so you're iterating over an array and however many times you need to load things
into cache you're going to have to do that twice because you've got two different loops so n times
for the first loop n times for the second loop and in the fused example that potentially results
in some cache misses anytime there's a cache miss that's going to result in another you know loading
things into cache so like you know worst case or whatever you're going to end up with 2n you know
of having to load things into cache so in both examples you have 2n uh you know times of loading
things into the cache why is one more like say and you're saying like oh if we split it into two
loops that's going to be potentially faster.
And my question was like, I don't understand.
Like if we're loading things into cache the same number of times, sure, one of them is due to some cache misses.
But like why does it end up being so much more performant when you split it into two loops?
Like that didn't make sense to me.
And that brings us back to the start of our discussion where we talked about efficiency in the context of parallelism.
And the intuition here is you are assuming that I have to wait for those memory loads in stores.
That like, you know, there's some fixed costs that I always have to pay for those memory loads in stores. There's some fixed costs that I always have to pay for those memory loads
in stores. But the thing is, the processor, all processors, essentially, are parallel.
There's this unit that's going to do the load in the store. And there's a unit that's going to do my computation. And they can operate asynchronously. And most
modern processors, you know, a laptop or a phone CPU or server CPU, maybe not an embedded CPU,
but most modern processors, GPUs too, will prefetch data for you. And many of them do this thing called out-of-order execution.
And what that means is that in the best case,
yes, you have to pay the cost for that load in store,
but it's hidden from you
because the processor has started that load before the data was needed.
Like, let's say that that load takes a minute. Okay, let's say it takes a minute to load from
main memory every time you do it. I don't care about that minute if the processor initiated the
load a minute before I needed it.
And if the load is happening asynchronously in the background
and the data just happens to be there when I need it,
then we're good.
And that's the power of prefetching
is the data just gets there into the cache when I need it.
And you compare that to the case of the cache miss, where I have to actually go into main
memory, and I have to wait for that operation. And I have to wait for the full latency of that And this doesn't just come up with data load in stores too.
Division, for example, which is notoriously an expensive arithmetic operation on most processors.
Usually your processor these days can usually do multiple adds and subtracts and multiplies per cycle. But usually something like a division can take,
you know, 10, 20, 30 cycles, could even be a lot more than that.
But if you only need to do a division every 30 cycles, then that's okay. You won't actually have to wait for it
because assuming that you can have the data ready for that division 30 cycles ahead of time,
then you can start it and you can go do 30 cycles of other work and then your division will be done
and then you're good.
Whereas if you don't have that work to overlap,
then you do have to spend the time
waiting for that operation to come back.
And we call this latency hiding.
And the parallelism at this level
is not something that's visible
and exposed to programmers.
You don't have to think about
how you write this because it's something that's visible and exposed to programmers. You don't have to think about how you write this
because it's something that the processor and the compiler do for you.
But I do think it's important to have the mental model of how this works
and to understand that all processors are inherently parallel,
even if you're programming to a single thread.
And that can really have a big
impact on your performance. Yeah. So basically, yeah, cache misses can mess up prefetching.
And if prefetching is messed up, it's not going to be the same performance profile.
Yeah. And I mean, usually it's more the other way around. It's usually the reason that you have a,
there's two reasons to have it, that you'd have a cache miss.
Reason number one is that the data that you needed to have was not prefetched because the processor did not figure out that it needed to prefetch it or because the processor figured
out it needed to prefetch it, but the prefetch did not, it did not figure it out early enough.
And then reason number two is the data was in the cache, but it got evicted from the cache because there's a lot of pressure on the cache because you need a lot more data than you can fit into your cache.
And there's a number of strategies that you can use in that case. Like for example, one that comes up a lot in
multidimensional loops is something called tiling, which gives you better control over
the amount of data that you need to have in cache. By controlling the size of your tile, you can
control how many elements you need to have in cache.
And then therefore you can make sure that you can fit all the things that you need in cache
and not have to spell to the next level of cache or the main memory.
Yep, makes sense.
And more importantly, I think you nailed it, Bryce.
10 minutes.
Yeah, although I got to go get dressed and then run to my thing.
Thanks for listening.
We hope you enjoyed and have a great day.