Algorithms + Data Structures = Programs - Episode 51: Efficiency vs Speed

Starting point is 00:00:00 So loop fusion is when you take multiple loops and you combine them into a single loop. And loop fission is when you take a single loop and split it into multiple loops that iterate over the same index range as the original loop. Welcome to ADSP, the podcast episode 51 recorded on November 5th, 2021. My name is Connor and today with my co-host Bryce, we talk about the difference between efficiency and speed. Now, listener, strap in for Bryce's explanation of the difference between efficiency and speed. And I'll give a little 30-second intro. A couple episodes ago on, I think it was the combinatory logic episode, I was saying how if you fuse, you know, two algorithms into one using combinators, you're going to end up with faster code because doing a minimum element

Starting point is 00:01:06 and maximum element separately versus, you know, min-max element, which Ben Dean very, you know, astutely pointed out that, you know, you're never actually going to match that because the performance of min-max element has three over two n operations, which a fused one is still going to have two n. But anyways, that's besides the point. But thank you, Ben, for clarifying that. Bryce pointed out that like, well, it'll be twice as efficient, but not necessarily twice as fast. And in that episode, I don't think, I think in listening back to it, I don't think I fully grokked what Bryce meant by that. So now enter Bryce with his elaboration on efficiency versus speed.

Starting point is 00:01:47 Yeah. So I like to think of this as, I'm not surprised that we are coming at this from different angles. Because I think you are much more of a theorist, a pure computer scientist. Whereas I am much more of a software engineer. And maybe that's even like too haughty a title, like a software plumber. You know, I sort of come at it from like the grounds up, like how does this thing actually run and work in production? Whereas you're much more the person to think about, you know, the whiteboard solution. And that's not meant as a dig. It's a little bit of a dig. A little bit of a dig. But so let's start off with an argument that I know you will

Starting point is 00:02:42 find compelling because I've made it before, which is the question of efficiency when it comes to parallelization. So we've told this story a few times on the podcast, but once on a drive from the south San Francisco Bay up to San Francisco, I explained to Connor how you could paralyze an inclusive scan. And the reason that Connor thought that it could not be paralyzed is because Connor was thinking about paralyzing the most efficient version of a scan algorithm. And I explained to him how you could implement a parallel scan that is work inefficient. That is, it's less efficient. It does more operations than a serial scan. But it doesn't matter that it does more operations because they can be done in parallel.

Starting point is 00:04:10 So, you know, if you do n work in serial for some algorithm and you can paralyze it and, you know, paralyzing it means that you actually end up doing two times n work, that might be perfectly okay because if your goal is to get an answer faster, it doesn't matter if you're doing more work if that work can be done quicker because it's being done across multiple threads. So you understand that argument. Yes, you understand that in the case of, you know, thread parallelism, you know, being inefficient might be faster. Correct, yes. Okay. 100% on board. to almost every case that we encounter in the field. Because even when you're programming for a single thread, the hardware that you are running on is still actually parallel under the hood.

Starting point is 00:05:03 There is a part of your processor that loads data. There's a part of your processor that fetches instructions, that decodes instructions. There's a part of your processor that does adds. There's a part that does multiplies, et cetera, et cetera. And those parts can run in parallel. And so on every modern processor, your single-threaded code is actually being paralyzed under the hood at a level that you don't really see, in that the actual things that need to happen to make that serial thread,

Starting point is 00:05:47 the serial set of operations that you want to do to make that actually execute, those things don't strictly happen in order, and some of them are overlapped and asynchronous. And so then again, we get to this idea that in some cases, things that are less efficient might actually be faster because efficiency isn't really our goal. Our goal is time to execution. So let's move on to this particular case of loop fusion and loop fission. Do you know what those terms mean?

Starting point is 00:06:29 Uh, we'll pretend for the sake of the listener that I don't know what loop fission is, uh, because this is the first time we were recording this. Um, I do know what loop fusion is though. Um, but tell me what is, what is the, what are they both and what is the difference? So loop fusion is when you take multiple loops and you combine them into a single loop. And loop fission is when you take a single loop and split it into multiple loops that iterate over the same index range as the original loop, or order of magnitude the same range as the original loop. And so you said you were familiar with loop fusion prior to this conversation.

Starting point is 00:07:30 And that's not surprising because I think that loop fusion is perhaps the more intuitive of these optimizations. If I take two things that are iterations over N elements and I combine them together, you know, that's, that's probably a win, especially if the N elements that they're iterating over are the same set. If like, if, if the iterations are always accessing the same data, then combining those, those two loops together, you know, it seems like it'd just be a win, right? You're doing, instead of doing iterating through two n times, you're just iterating through n times. But in practice, that's not always the case. And sometimes it's better to have loops be separate or to even take existing loops that you have and to split them. And the primary reason why you do this is to get better locality of reference.

Starting point is 00:08:29 I'll make the bold assertion that the vast majority of code in the wild is either memory latency or memory bandwidth bound. Which essentially means that all of your software is pretty much a fancy version of memcopy. The reason for this is because we've gotten very good at building processors that have a lot of compute throughput at a fairly low latency. The problem is getting the data to that execution hardware. And if you have a ton of data reuse, like, for example, a matrix multiplication, which has an O-N squared complexity, and therefore there's a lot of data reuse, then yeah, you might be able to saturate

Starting point is 00:09:26 your execution hardware, but that's not what the majority of applications look like. The majority of applications are, I read some sequence of data and I do something with it and then I go on and do something else or something like that. And so for most code, the determiner of your performance is going to be your memory access

Starting point is 00:09:58 patterns. And in particular, how efficiently you make use of your cache and your processor's ability to prefetch, etc., is going to have a huge impact on application performance. I'll give you a simple example. Let's imagine that we had two loops. And the first loop applies a transformation to one data array. The second loop also iterates over that same array, but it iterates over two other arrays, let's say, as well. So the first one accesses one array, the second one accesses three arrays, and one of those is the same as the first one.

Starting point is 00:10:59 So you might think, okay, well, we should fuse those because then we only have to access each element of that array, the array that both of them access, a single time, right? Yes. Okay. Now, here's the problem. When you fuse those loops, you might end up disrupting your memory access patterns. So one thing that you might end up doing is you might cause memory accesses that would previously hit in the cache to no longer hit in the cache. So pretty much all processors today do a pretty good job of prefetching the data that you need into the cache.

Starting point is 00:11:49 And if you combine those two loops, now we have to prefetch even more data. Because of that, it's possible that we might have cache misses instead of going through the cache. So maybe that first loop, which just acts as the one array, maybe that loop was actually like really optimal and that loop, the compiler and the processor were going to always ensure that all of the data was prefetched before each iteration. And so it would always run out of cache and it'd be super fast. But by combining it with the second loop, now you're going to have a cache miss every time. And that's going to make your code run slower. And of course, this is even more likely to occur if you have loops that have completely unrelated data that you try to fuse together. Because then there's really no data reuse. You're just increasing the pressure on the cache of how many different things are going to fit in

Starting point is 00:13:13 the cache with no real reuse between the two. But there's another thing that can happen too, which is a larger loop body can have increased register pressure. And you should really think about registers as being like an even faster type of cache. You know, let's say you got like a processor that has like six registers. Well, if each loop iteration only needs six temporary variables for its operations, then that's great. You never have to go to stack memory. Stack memory is always going to be slower than registers. Hopefully, your stack will be in cache. It usually is. But on a processor like a GPU,

Starting point is 00:14:06 the difference in memory bandwidth and latency between registers and even the first tier of cache is pretty bad. Like on a GPU, spilling to stack is like really not good. You never want to do that. And that's why there's like a ton of registers on GPUs so that you can avoid doing that. So by combining those two loops, you might have caused operations that previously did not spill to the stack to have to spill to

Starting point is 00:14:40 the stack. And that will mean that there is more main memory traffic. And also that there are more things that need to be stored in the cache. And therefore, there's less room in the cache for the data arrays that you're accessing through. And there's another thing that can happen. So as I said, the modern processors are pretty good at prefetching all your data. And one of the ways they do this is they look at your access patterns and they can figure out linear and strided access patterns.

Starting point is 00:15:25 They can be like, oh yeah, you know, like the last like 10 iterations, you accessed like this chunk of memory sequentially, or you access this chunk of memory with some fixed stride between each access. And so I better just assume that you're going to continue accessing that stream of memory and I'll just fetch where I think the next element's going to be. But if you're accessing a whole bunch of different

Starting point is 00:15:58 streams of memory, the processor has a limit to how many different streams it can track. And so it might reach that limit and it might no longer be able to keep track of all of your linear access patterns and prefetch accordingly. And then again, you'll have more cache misses. Yeah, and there's a few other, you know,

Starting point is 00:16:34 ways and places in which these sorts of effects can come up. If you've got multidimensional loops, you can, you know, these caching effects can be even more pronounced because in the multidimensional loop, you will have a lot greater potential for reuse in the inner loop. And also, if you have something like an adjacent difference or any sort of stenciled operation

Starting point is 00:17:06 where each iteration, you don't just access one element, but you access, you know, three or four or five different consecutive elements, like a window of elements, you know, if you fuse a bunch of loops together, then again, that's increasing pressure on the cache. And then it's possible that, you possible that you're exiting these five consecutive elements and maybe because of this loop fusion, the last of those consecutive elements got pushed out of cache by something else that got prefetched into cache because of all of the stuff that you're trying to do in this one loop.

Starting point is 00:17:46 Now, this isn't to say that loop fusion isn't ever a good optimization. I mean, in many cases, it is. But it's not always the right thing to do. And every compiler that does loop fusion has heuristics for when it's profitable to do so. And in some cases, a compiler will split your loops because it determined, hey, you know, if I take this complex loop that's doing like, you know, a ton of things, and if I just split it into like three loops that are much simpler, those three loops are going to run faster than the complex loop because those three loops are going to have really good, really simple and easy memory access patterns. And the processor is going to do a great job of prefetching the data for those three loops. And there's never going to be a cache miss.

Starting point is 00:18:37 Whereas this one giant loop that I have, it's a mess. And it's going to end up with a whole bunch of cache misses. So that makes sense for the most part. Although I will say your example where you had three arrays, one loop that was going over the first one and then another loop that was going over all three, including the first one. I mean, I would guess uh fusion in that case in most cases like

Starting point is 00:19:09 it would end up being faster right because like in if you had said you know a loop over uh array a and then a second loop over array b and c where the index is the same so you could still fuse it to me that's a lot more obvious that you could end up with cache misses, whereas where the first loop is just a subset of the data of the second loop. Isn't that, isn't like the second loop going to end up with just as like fusing the first loop

Starting point is 00:19:40 because it's a subset of like the data that you're fetching. Isn't that you're fetching, isn't that you're going to have the exact same cache miss whatever ratio or am I missing something? Fusing that first loop into the second loop might mean that the first loop, let's say that that first loop as written has great memory access patterns and is going to always prefetch great and run very fast. Fusing it into the second loop could mean that the operations that constituted that first loop will now cache miss because this second loop has more complex memory access patterns and it's trying to load more stuff into cache and some of the stuff starts to fall out of cache and isn't there when we expect it to be there. So it might make the first loop slower.

Starting point is 00:20:43 It might also complicate the second loop. It will add additional register pressure to the second loop because there's a new operation that you have to perform. And that additional register pressure might mean that there's spills to stack, which might mean that now the stack has to be in cache where it previously didn't. And that means that fewer of the data elements would need to be in cache. And the combination of those two loops could also, you know, throw off the compiler's ability to recognize that it should, or not the compiler's, it might throw off the processor's ability to prefetch the right data because it's a more complex access pattern. You know, especially it could be the case if that first loop is something like an adjacent difference

Starting point is 00:21:47 where it's stenciled where it needs to access consecutive elements if you have a loop that is accessing consecutive elements you need to not only have the next element in cache you need to have some of the last elements in cache and you might need to have like the next one or two elements in cache and the last two or three elements in cache. And so combining those loops,

Starting point is 00:22:09 maybe the second loop only needs to have each iteration. It only accesses one element from that array. You see what I'm saying? Right. No, yeah. So I guess that, yeah, the register pressure stuff makes total sense because of the increased operations. I guess you had not said this but i in my head when you were explaining that example was thinking that uh you know you had a

Starting point is 00:22:30 single index i uh and that was the only index that was being used for each of the arrays a b and c but uh if in the first loop like you said you you're looking at both index i plus you know two before and two afterwards and then in the second loop you're only using you know single like like you said if you're changing the data access pattern and it becomes more complex then that can i can see that yeah so that's i think i added uh criteria to your example that you'd never stated which is what caused the confusion you asked me a question the other day, which was, you know, don't you have to pay for the cost of the memory loads and stores anywhere, like in both cases? Like, you know, why does it matter?

Starting point is 00:23:16 The way that I had phrased the question, because it was related to a slightly different example, is that you've got one example where you have two loops and you're going to be loading things so you're iterating over an array and however many times you need to load things into cache you're going to have to do that twice because you've got two different loops so n times for the first loop n times for the second loop and in the fused example that potentially results in some cache misses anytime there's a cache miss that's going to result in another you know loading things into cache so like you know worst case or whatever you're going to end up with 2n you know of having to load things into cache so in both examples you have 2n uh you know times of loading things into the cache why is one more like say and you're saying like oh if we split it into two

Starting point is 00:24:02 loops that's going to be potentially faster. And my question was like, I don't understand. Like if we're loading things into cache the same number of times, sure, one of them is due to some cache misses. But like why does it end up being so much more performant when you split it into two loops? Like that didn't make sense to me. And that brings us back to the start of our discussion where we talked about efficiency in the context of parallelism. And the intuition here is you are assuming that I have to wait for those memory loads in stores. That like, you know, there's some fixed costs that I always have to pay for those memory loads in stores. There's some fixed costs that I always have to pay for those memory loads

Starting point is 00:24:45 in stores. But the thing is, the processor, all processors, essentially, are parallel. There's this unit that's going to do the load in the store. And there's a unit that's going to do my computation. And they can operate asynchronously. And most modern processors, you know, a laptop or a phone CPU or server CPU, maybe not an embedded CPU, but most modern processors, GPUs too, will prefetch data for you. And many of them do this thing called out-of-order execution. And what that means is that in the best case, yes, you have to pay the cost for that load in store, but it's hidden from you because the processor has started that load before the data was needed.

Starting point is 00:25:50 Like, let's say that that load takes a minute. Okay, let's say it takes a minute to load from main memory every time you do it. I don't care about that minute if the processor initiated the load a minute before I needed it. And if the load is happening asynchronously in the background and the data just happens to be there when I need it, then we're good. And that's the power of prefetching is the data just gets there into the cache when I need it.

Starting point is 00:26:22 And you compare that to the case of the cache miss, where I have to actually go into main memory, and I have to wait for that operation. And I have to wait for the full latency of that And this doesn't just come up with data load in stores too. Division, for example, which is notoriously an expensive arithmetic operation on most processors. Usually your processor these days can usually do multiple adds and subtracts and multiplies per cycle. But usually something like a division can take, you know, 10, 20, 30 cycles, could even be a lot more than that. But if you only need to do a division every 30 cycles, then that's okay. You won't actually have to wait for it because assuming that you can have the data ready for that division 30 cycles ahead of time, then you can start it and you can go do 30 cycles of other work and then your division will be done

Starting point is 00:27:44 and then you're good. Whereas if you don't have that work to overlap, then you do have to spend the time waiting for that operation to come back. And we call this latency hiding. And the parallelism at this level is not something that's visible and exposed to programmers.

Starting point is 00:28:04 You don't have to think about how you write this because it's something that's visible and exposed to programmers. You don't have to think about how you write this because it's something that the processor and the compiler do for you. But I do think it's important to have the mental model of how this works and to understand that all processors are inherently parallel, even if you're programming to a single thread. And that can really have a big impact on your performance. Yeah. So basically, yeah, cache misses can mess up prefetching.

Starting point is 00:28:33 And if prefetching is messed up, it's not going to be the same performance profile. Yeah. And I mean, usually it's more the other way around. It's usually the reason that you have a, there's two reasons to have it, that you'd have a cache miss. Reason number one is that the data that you needed to have was not prefetched because the processor did not figure out that it needed to prefetch it or because the processor figured out it needed to prefetch it, but the prefetch did not, it did not figure it out early enough. And then reason number two is the data was in the cache, but it got evicted from the cache because there's a lot of pressure on the cache because you need a lot more data than you can fit into your cache. And there's a number of strategies that you can use in that case. Like for example, one that comes up a lot in multidimensional loops is something called tiling, which gives you better control over

Starting point is 00:29:30 the amount of data that you need to have in cache. By controlling the size of your tile, you can control how many elements you need to have in cache. And then therefore you can make sure that you can fit all the things that you need in cache and not have to spell to the next level of cache or the main memory. Yep, makes sense. And more importantly, I think you nailed it, Bryce. 10 minutes. Yeah, although I got to go get dressed and then run to my thing.

Starting point is 00:30:10 Thanks for listening. We hope you enjoyed and have a great day.

Pet Camera - EBO Air 2

Algorithms + Data Structures = Programs - Episode 51: Efficiency vs Speed

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Algorithms + Data Structures = Programs - Episode 51: Efficiency vs Speed

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.