Algorithms + Data Structures = Programs - Episode 25: The Lost Reduction

Episode Date: May 14, 2021

In this episode, Conor and Bryce talk about maximum consecutive ones and a missing reduction algorithm.Date Recorded: 2021-05-12Date Released: 2021-05-14Godbolt SolutionThrust and the C++ Standard Alg...orithms - Conor Hoekstra - GTC 2021ADSP Episode 20: GTC 2021 Preview & Thrust AlgorithmsMax Consecutive Ones Leetcode Problemthrust::reducethrust::count_ifthrust::transform_reducestd::reducestd::count_ifstd::transform_reduceC++Now 2019: Conor Hoekstra “Algorithm Intuition”Python NumPyranges::fold ISO C++ PaperHaskell’s foldlHaskell’s foldl1Intro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8

Transcript
Discussion (0)
Starting point is 00:00:00 You look like your apartment looks like the saddest place on earth. You're like there with a hoodie. The hoodie is for echo cancellation. I figure if I put the hoodie up, it captures more of the sound. The hoodie is for echo cancellation. A line that will go down as one of the greats in the annals of podcasting history. Welcome to ADSP The Podcast, episode 25, recorded on May 12, 2021. My name is Connor. And today with my co-host, Bryce, we revisit the problem, maximum consecutive ones, and discover a missing algorithm.
Starting point is 00:00:54 So here's the thing. So now, so let's take a moment to explain my place to you. I have a bed. I have a desk. And I have an office chair. And so the echo is awful. I don't have any other chairs. So right now I have carted my office chair from my office, through the kitchen, through the living area, through my bedroom into the closet. And my you're in a closet right now. Yeah, it's sort of it's like it's like an open closet. So it's a it's a closet without like doors in the midst of a hallway. So it's only really got it's not like a closed drawer thing that's a part of the closet and then i pull out the second drawer which is my socks and boxer drawer and then that's where i put the mic and it actually works like perfectly does this apartment have windows because i have not seen evidence that this apartment have you not have you not seen my view bro no i have not seen your view yeah okay that's a pretty good view it's a pretty good view oh that's a pretty good view
Starting point is 00:02:11 that's a pretty good so you're in downtown now i'm gonna have a better view when i'm in new york uh you know why it's yeah because you're bright you know why i'm gonna have a better view in new york because you're price because i'll be looking at new york city not toronto i mean take that canadians here's what i'll say about new york i visited once twice i can't recall it smells like garbage you get used to that you get used to that it gives a character the smell gives a character do you want to know why it smells like garbage bryce well usually because there's a lot of garbage because there's like the same way that in the winter time there's big piles of snow on the corners
Starting point is 00:02:57 of street in toronto there are big piles of garbage bro you don't, you don't even, you don't even understand in New York between like Christmas and so, so the, the, um, the, the sanitation workers in New York has a very excellent union. Um, and, and from like, I think it's like from Christmas to January, there's like no trash pickup, like none. And so during this period of time, some quite substantial piles accumulate. Yeah, but the smell gives the city character. It's a problem because I like to run specifically outside, and I can't be inhaling garbage.
Starting point is 00:03:39 It's not that bad. It's not that bad. You're making it sound like it smells everywhere. It doesn't smell everywhere. Central Park was quite nice. And if you're most of the places along the sort of what in Vancouver is called the Stanley Park, the seawall, that's what they call it. You can go for like a walk and it's a pathway that sort of surrounds the park and it's right next to the ocean. That sort of happens in New York. Anyways, enough about New York.
Starting point is 00:04:11 Anyways, I got this dope setup and my apartment's too big that I would have to furnish it too much. So I'm pretty sure I'm getting my sister, my older sister's moving to Calgary. And so I'm acquiring one coach, one shelf, and two patio chairs. Can you please just appoint your sisters to decorate your place? Just for my benefit. I've only met one-third of your sisters, but I'm confident that they're all classy people who can take care of this problem and it'll make me sleep better at night. I'm definitely going to outsource it to individuals. It's not going to be my responsibility.
Starting point is 00:04:54 But this doesn't explain. You asked about the webcam. So I have an external webcam, which I do not. I take my chair. I take my laptop. But the webcam stays on the desk. I can bring the webcam next time, but it's a very – Right now I'm looking at Connor,
Starting point is 00:05:10 and Connor looks like the protagonist in a movie about like a rogue hacker. He's like super blurry, and he's got his hoodie on, and he's in like a white void. The hoodie is for echo cancellation. I figure if I put the hoodie up it captures more of the sound the hoodie is for echo cancellation a line that will go down is one of the greats in the annals of podcasting history oh man it's um yeah i i feel it's i wish we had been putting these on youtube because uh the the delta from my last place where i was right next to windows and had great lighting and
Starting point is 00:05:55 had a plant behind me lots of color and now i'm just a shadow in the dark. Yeah. All right. So I guess we should probably do our thing, right? Yeah. This is all going in the podcast. That was all gold. And now we will start talking about Mac's consecutive ones.
Starting point is 00:06:18 Yeah, I think I solved it. This is our second time for transparency because at ADSP the podcast, we're all about transparency. This is our second time for transparency because at ADSP, the podcast, we're all about transparency. This is the second time. We recorded this on Mother's Day, but it was a good recording, but I was a bit confused, and I was staring at the code that Bryce was sharing with me, and I figured our listeners are going to be super lost. So we said this is— No, this is the third time, Connor.
Starting point is 00:06:43 Oh, yeah, yeah, yeah. That's true. That's true. Like three weeks—no, maybe four weeks before that. time, Connor. Oh, yeah, yeah, yeah. That's true. That's true. Like three weeks. No, maybe four weeks before that. So, okay. All right. We should take a step back.
Starting point is 00:06:50 We should take a step back. Nah, let's take two steps forward. First, Connor, why don't you explain the max consecutive ones? Right, yeah. Let's do a mini recap. So this is probably, what, the second or third time at least that this has come up in our podcast. It was a problem that I solved or showed the solution to using parallel thrust algorithms in my GTC talk that happened in April. And very briefly to recap, you're given a sequence of ones and zeros.
Starting point is 00:07:19 And the problem is to simply return the length of the longest contiguous sequence of ones. So for example, if your sequence is 1100111, you have basically two sequences of ones, which are of length two and three. So the answer to that problem would be three. And in my talk, the way that I if you know, for the full solution, I guess we'll link to our previous episode where we first talked about this. And I'll also link to our previous episode where we first talked about this, and I'll also link to the GTC talk. It's basically using an algorithm called thrust reduced by key, which first segments the sequence into contiguous elements of equal value.
Starting point is 00:07:58 So you're going to end up with three sequences, two ones, two zeros, and three ones. And then by default, it does a plus reduction. So you're going to end up with the values two, zero, and three. All the zeros always sum to zero. And then once you have those three values, you just take a max element. And at the end of that episode, or sort of halfway through, I think we were discussing, or you were discussing, or I guess we were both discussing. I said, I think this can be done in a single pass.
Starting point is 00:08:27 Correct. A more efficient implementation existed. Now, we should clarify that. First, Connor asked me, what algorithm would you use to implement this? And I gave the wrong, I gave the answer that was not what he had in mind, not what he showed in his CDC talk. And then I was like, ah, but I think it can be done in a single pass and of course not not wanting not being able to live with the shame of having failed to accurately identify the algorithm in my own library that did this an algorithm which two years prior i had taught connor about
Starting point is 00:09:05 i was like i must go solve this and so in in in your defense you did at an earlier date and time come to me without even me asking you and said hey you can solve this problem or a similar problem using reduce by key and at the time i didn't know about it and i said was that when we met yeah that was probably when we met yeah that was literally january of uh 20 2019 when i think i first gave the algorithm intuition talk at the the meetup which we're going to be attending in like an hour um to do a sequence okay so after this embarrassing incident where i failed to correctly identify the algorithm that i had taught connor about you know, when we first met, that is an algorithm in my own library. I was like, all right, forget the rest of my Saturday.
Starting point is 00:09:52 I'm going to go implement this. And I went and I wrote it, wrote a single pass implementation of this algorithm in terms of a reduction. And then. And then I got on the Twitter DMs and I told Connor, come on, man, we got to go record another episode right now. I'm so excited. And we recorded, and that was some great podcast material that's never going to see the light of day. I should add, you said, let's record a 10-minute addendum. And I think I was in the middle of something, and I was like, I'm sort of busy. And you were like, just, it'll be 10 minutes. 55 minutes of addendum later.
Starting point is 00:10:35 And the conclusion of that addendum was, because I was still sort of writing the code as I was talking with Connor, the conclusion was, oh, wait, no, there was a bug in my thing. So then I was like, all right, well, cancel that whole 55 minutes. And then I think later that week, I was doing yoga with my mom. I do yoga with my mother twice a week. And inspiration hit me. I realized how I could solve this problem. And I went and I coded up a little bit, but something distracted me. So I never mentioned it to Connor again. But then last week, I was speaking with my friend, Tony Lewis. And we were talking about what belongs in the C++ standard library, my C++ Now keynote. But then at the end of our conversation, he mentioned this podcast episode and this problem. And he said, oh, hey, I think I found a solution to it. And he showed me his solution. And his solution was in terms of a scan.
Starting point is 00:11:36 And we'll explain more about that in a little bit. But I looked at his solution. I was like, aha, yes, this is not only do you have the right answer, but this reminds me that Connor and I have to do another attempt at solving this problem in a single pass. And then on Mother's Day, we recorded that episode. And then afterwards, we were like, well, Connor was like, yeah, the way that we explained that was too confusing. So then now attempt number three is today.
Starting point is 00:12:10 And also shout out to all the moms. Happy Mother's Day belatedly. Hi, mom. Thank you for doing yoga with me. She's a yoga instructor slash lawyer. And she gives me free yoga classes on Zoom twice a week, which is absolutely wonderful. And it's a great way to connect with your mother, especially during a pandemic. And tell me, Connor, what did you send your mother from Mother's Day?
Starting point is 00:12:35 Why are you bringing this up? Because now I got to cut this out. Our family is not a big gift family. I did a three-hour Zoom call with my three sisters. My dad popped in for a little bit. I was confused. My parents don't listen to this, but I was confused on why we didn't invite. I felt so bad for my dad because it was a three-hour call,
Starting point is 00:12:53 and I think he showed up at like the hour-and-a-half mark, and I was like, poor dad. He's just off over in like a different room, jealous of her mom. I sent my mother her favorite wine, two bottles of it, a cheese pairing, and I may have sent other stuff. I don't remember. And then I sent my grandmother chocolates, flowers, probably some other stuff because I am a good son. Anyways, anyways, let's move on to the algorithm. Okay. Anyways, let's move on to the algorithm. Okay, so the problem we're trying to solve is we've got an input sequence,
Starting point is 00:13:27 and we want to count the longest run of consecutive ones. So we want a single pass parallel algorithm. So whenever we've got a problem that amounts to counting, this usually boils down to being a reduction. So something like a count or like a stood count or a max element, etc. You're looking to look through all of the data and find one answer. So the intuition is, okay, maybe we can write this as a single pass parallel reduction. So almost any time that you're parallelizing a problem, the first thing that you do is you split up the input into sub-ranges. And I think that's where we should start in explaining this. So let's imagine we've got our input sequence and
Starting point is 00:14:31 we split it up into sections. And then we assign some thread to work with each section. Well, let's think about naively what we want each thread to do. Well, we're looking for the longest consecutive run in the entire sequence. So sort of the first thing that seems sensible is each thread should figure out within its subsequence what's the longest sequence in that region, right? Does that make sense? Okay. but that's not actually good enough because what if the longest run happened to span across two of these regions that like thread zero has you know 10 elements and thread one has 10 elements and the longest run spans from the last five elements of thread zero to the first five elements of thread one. Well, now somehow we need to find that sequence,
Starting point is 00:15:41 and that means that we have to propagate information from left to right, from thread zero to thread one. So how do we do that? Well, the approach that I took and the approach that Tony Lewis took is to create a summary for each region. We'll talk about how that gets created in a minute, but for each sort of chunk of the input, we're going to create a summary. And that summary needs to have three key pieces of information. The first piece of information is what is the longest run in that section. We already said that. That's pretty straightforward. But it also needs to identify what is on the left-hand boundary
Starting point is 00:16:33 of this section. What is the character at the left-most boundary? And how many times is that repeated? So let's say that the um, let's say that the start, like we said, thread zero is 10 elements. Let's say that the first two of those elements were zero. Well, then on the left side of thread zero's chunk, there is a run of two zeros. Um, and then let's maybe say at the end the end of that 10 element section, the last three
Starting point is 00:17:06 elements are ones. So on the right of thread zeros section, there's a run of three ones. And we need that information about what is the first run in each section and the last run in each section. Because when the thread with the next section, the next consecutive section, is looking for its longest run, it needs to ask its neighbor, right? Because it needs to check, hey, I've got a run at the start of my sequence.
Starting point is 00:17:46 I have a run of, you know, four ones at the start of my sequence. And I need to know, is there a run of ones at the end of the elements that, you know, are right before my elements? And if there are, then I need to combine that run with this run because that might be the longest run. Does that still make sense? Yeah, it makes complete sense. So how do we get these summaries? Well, this summary is a data structure. The way that I structured this data structure is, you know, first I created a data structure called a run.
Starting point is 00:18:32 And a run is just a value and a size. The way that I chose to solve this problem, I solved the more general case of counting the max consecutive number of elements of the same value and also recording what their value was. So a run is a value and a size. And then a summary consists of three runs. There's the first run in that section. There's the last run in that section. And then there's the longest run in that section. And then there's also a Boolean flag that I need to keep because there's a special case, and the special case is if the entire section was one element repeated the entire way through, I have to handle that a little bit specially. But that's not super important. Well, it's important to get it right, but it's not sort of critical to the algorithm. So our input is, you know, some value type. Let's just call it t.
Starting point is 00:19:36 So when we take two t's and we feed them into our reduction operator, that's how we generate the first set of summaries. When we reduce two elements, we can create a summary that describes those two elements. Now, likewise, if we have a reduction over one of these summaries in one element, we can also combine that into a summary. Because, you know, if you think about it, if we have just one element in isolation, it's pretty easy to construct a summary to it. If we just have one element, well, the first run in one element is the element itself, and its length is one. The last run for one element is the element itself, and its length is one. And the longest run is one element is the element itself and its length is one. And the longest run is the element itself and its length is one. And so, likewise, you know, in the more general case when we have two different, when we have a summary on the left side of the reduction
Starting point is 00:20:39 and a summary on the right side of the reduction, you know, we can combine those two and produce a summary that describes the entire region. But for all of this to work, we need a guarantee. And the guarantee is that the left-hand side of the reduction has to contain information for a set of elements that are before and consecutive with the right hand side. So let's go back to the case of just two elements. So our reduction operator needs the guarantee that if you call it on two elements of the input sequence, you're calling it on two elements of this input sequence that are right next to each other, that are consecutive, and that there's no elements in between them.
Starting point is 00:21:35 And then that produces a summary. And if you reduce two summaries, you need the guarantee that, hey, these are summaries for two regions of elements that are right next to each other. So what do we call this requirement? Well, specifically what we're asking for is we want a reduction that will not commute the input elements. So commutativity is the property of an operation that says, you know, A plus B equals is the same thing as saying B plus A. But in our case, that doesn't work for this algorithm. We need a reduction that promises that it won't reorder the inputs in that way.
Starting point is 00:22:26 So does C++ have that today? Well, there's std accumulate. Std accumulate promises that it won't reorder the inputs. But std accumulate actually makes even stronger guarantees than that. It promises that it will associate the operations in a very specific order. And the order that it prescribes completely prevents parallelism, which kind of defeats the whole point of what we're trying to do. We want a parallel max consecutive ones. So accumulate will work for non-commutative and non-associative operations. And what we have here is we have an operation that is non-commutative, but it is associative. Okay, so there is another option, which we added in C++17.
Starting point is 00:23:20 It's called stood reduce. Stood reduce is like the opposite end of the spectrum from std accumulate. Std reduce is allowed to arbitrarily reorder the inputs, and it's allowed to perform the operations in whatever order it wants. It's allowed to associate the operations in any way that it wants. So std reduce only supports things that are both commutative and associative. But the benefit of it is that it can be paralyzed. But this presents a problem for us because std reduce, like, again, as I said, we have an operation here that is associative but not commutative. And we want to do a parallel reduction on it.
Starting point is 00:24:04 And we simply don't have that primitive in C++ today or in thrust today. In our previous recordings, I hypothesized that thrust's reduction might actually make this guaranteed. That thrust's reduction might actually not commute the inputs. And so it might actually work for this, even though it's not documented that way. But one of my colleagues later pointed out, no, it actually does in some cases reorder the inputs. Because oftentimes when you're computing things on GPUs, you want to sort of swizzle or transpose the input elements. Or instead of doing things exactly in order, you want to sort of do things by slices so that you get memory coalescing.
Starting point is 00:24:58 Now, the interesting thing is we don't have any reduction operation that does this. But there's another class of algorithms called scans. A reduction gives you back a single result. A scan gives you back the partial sums. So it gives you back the result of accumulating, the result of the accumulation incrementally. Yes, incrementally yes incrementally so it gives you every increment and partial sum is that yeah partial sum is actually that's borrowing sort of from the c++ you know pre-c++11 algorithm name but arguably like that's that's a suboptimal name because the sum implies like you're sort of hard coding
Starting point is 00:25:45 the semantics of the default binary operation plus that comes with the algorithm into the algorithm name, which is why scan, which is what we call the C++17 parallel versions is better because it's, you know, even though it's not maybe evocative of what the algorithm is doing, it's, you know, you can use partial sum to do a partial max or a partial min or a partial product. It's basically a reduction that's just every step of the way you're outputting the incremental result at that point. So to be able to output that incremental result as you go, you can't commute the inputs, right?
Starting point is 00:26:24 Because if you do, then you're going to produce the wrong partial results. you can't commute the inputs, right? Right, correct. Because if you do, then you're going to produce the wrong partial results. So these scan operations, these parallel scan operations, like std reduce, the C++ parallel scan operations, do not support non-associative operations, but they support non-commutative operations. And that's exactly what we need. We have a reduction operator that is non-commutative, but it is associative. So, you know, one option would be, you know, okay, well, I want to do a sum that's associative, but not commutative. I don't have a parallel way to do that with
Starting point is 00:27:15 std reduce, so I could just use a scan, right? And in fact, I think we've learned since we came across this problem that that's what some of our users do. But this is, of course, inefficient. If you just need the one final result, it's quite inefficient that you have to write out all of the incremental sums somewhere. So I think the conclusion of our discovery and our exploration, our journey, our journey to this algorithm. That's the word I was thinking of, our journey. Is that we are missing an algorithm. We are missing a reduction that is, that supports non-commutative operations,
Starting point is 00:28:01 but requires that operations be associative. And we can't just change std reduce to support this because it turns out that in our implementation and in in thrust and for gpus and also in some other implementations we take advantage of the the freedom that std reduce gives implementations to reorder the inputs that's actually a useful thing so no what we what we need is we need a new algorithm i i don't know what we call it do you have any ideas on what we call it well we didn't we bring this up you know we can't talk about uh all the all the content of the meeting but we had a meeting earlier in the week at nvidia with um some iso folks and didn't michael garland who's the head of research
Starting point is 00:28:55 said uh we should we should clarify for michael garland and bill d, who's the actual head of research. Oh, am I fired? Michael Garland is the director of our parallel systems and algorithms research group. He is not the director of all of the video research. My bad. This is why Bryce is climbing the ladder. I just typed this stuff into the keyboard. But yeah, Michael suggested Associate of Reduce. And I think even to pause before trying to name it is –
Starting point is 00:29:32 It's a terrible name, by the way. you know, reduction tennis match that Bryce and I have been having live on this podcast is that this has been like a monumental sort of, because reductions are, they're so important. They're so fundamental. I mean, Ben Dean has a whole talk called Stood Accumulate Exploring an Algorithmic Empire, where he shows that like 77 of the 90 of the algorithms that exist in the standard library of algorithms in C++ are implementable in terms of a reduction. And one of the remarks that I made the very first time I gave my algorithm intuition talk is that, you know, std accumulate, it's a bad name, introducing C++17's std reduce, which is a much better sort of generic name and in hindsight i think that's like a mistake to to like have that as sort of like a not a motto but like a slogan of like you
Starting point is 00:30:32 know introducing c++ 17 stood reduce a better named version of stood accumulate because it's not fundamentally stood accumulate and stood reduce are two different algorithms and basically what what this episode is, and this is probably going to end up going closer to an hour because I think it's important, is that we're, you know, a stood accumulates on the left and it says, we don't require anything. Commutative, non-commutative, associative, non-associative, we don't care. We'll make it work. And stood reduce says, hold your horses. You need both associative and commutative. Otherwise, no bueno. We're not doing anything.
Starting point is 00:31:07 And basically what we've identified here is, oh, we have a problem. Technically, we could solve it using a scan because scans don't require commutativity. And that's what we have here. But we don't want a scan. We want an equivalent reduction that does the exact same thing, but that just doesn't require that commutativity. And enter what Bryce just said is we're missing this algorithm. And if users actually are scanning, because that's one of the properties of a scan is
Starting point is 00:31:33 that the last element of a scan is always equal to the equivalent reduction. And if that's basically what people are doing is just scanning and then taking the last element, there's basically perf that's being left on the floor. And because reductions, in my opinion, they're so fundamental that I think that we should definitely be adding this. And it's not the first time actually, it's the first time this has come up where I've been like, wow, this is important.
Starting point is 00:31:57 But it's actually come up in meetings on Rapids, the team that I work on, where people have pointed out, oh yeah, if we just sort of had like a parallel fold left, we could do this a little bit more efficiently. But that was sort of the, you know, that was the last thing that was said. And then we just moved on. Because, yeah, you know, it's so it's so interesting to, to see you to have seen you reach that conclusion, Connor. And I'll tell you why. Because when you and I first met,
Starting point is 00:32:29 you were already an algorithms wizard. And we've told the story before of how we first met when you gave a talk at the Bay Area C++ user group on algorithm intuition. And afterwards, you and I chatted, and at least two of the things that we chatted about was one, I told you about reduce by key,
Starting point is 00:32:54 but we also talked about, you had suggested that reduce was a better named version of accumulate. That was the first place where you'd suggested it. And I pointed out to you that in some of your examples, that it didn't necessarily work because of the assumption of, you know, you were using some sort of heterogeneous reduction, similar to the one that I just described, where the reduction operator produces a different type. Like this max consecutive ones algorithm I described in that algorithm,
Starting point is 00:33:27 the inputs types are Ts, but the reduction operator returns summaries. And then, you know, it also has overloads that take summaries and Ts and reduce summaries and overloads that take summaries and summaries. But correct me if I'm wrong before, like at that time, when you gave that talk, before you came to work at NVIDIA, you worked on algorithms, but you hadn't really worked on parallel algorithms. Is that right? Correct, yeah. how to write parallel algorithms and sort of like where the, that's the default mode that you're in, you know, you, you've, you've come to appreciate the, the subtlety and the distinction between those two and how important like these semantics are. Um, and it's just, it's just so, so fascinating to me to see you have reached that point. And I think it speaks to something that's really fundamental
Starting point is 00:34:27 about parallel programming, which is you need this whole different mindset. You know, when I think about a problem, about how to implement an algorithm, I'm not the best algorithms person. I will freely admit that. That is not my expertise or specialty. If you put Connor and I in a room and ask us to solve an algorithm, Connor's going to probably do a better job of it than me. But I have a bit more experience
Starting point is 00:34:57 with parallel programming than Connor. So if you put us in a room and you tell me, like my default reaction is going to be to try to solve the problem in parallel. That's just how I've wired myself to think. And, you know, one of the first tools that I reach for is, you know, is there a way that I can express this in terms of reduction? Or like what parallel primitive can I express this in? And actually, I think, Connor, that you and I have a joint talk in our future. reduction or like what parallel primitive can i express this in and actually i think i think
Starting point is 00:35:25 connor that you and i have a joint talk in our future a joint talk i'm sure i'm sure there's at least one joint talk and i think that talk is going to be like how to write reduction algorithms because you're right they are so powerful and and one of my takeaways from this whole experience has been that um they're even more powerful than i had thought and to the to the degree that a large chunk of my job seems to be identifying how people can rewrite something as a reduction. Right. And that's the thing is that like, when I think about, you know, you're much more of a, you know, 10 year, 20 year thinker of what is the future of programming look like. But, you know, when I picture it in my head, you know, there's sort of two, as it stands, like sort of roots, you can go to writing some parallel program. You can write
Starting point is 00:36:25 CUDA kernels, or you can write your program in terms of algorithms. And I think it's, at least for me, it's a lot easier to express or solve a problem in terms of algorithms that other people have written. And that might be just my functional programming or APL bias. Like I'm used to reaching for maps and reductions and partitions and different algorithms and piping them together. That's basically the thrust model. You know, you have a collection of algorithms, you find a way to compose your problem and select them, or you can go and write a custom CUDA kernel. And, you know, I think Leslie Lay posted once on Twitter, a Drake meme of when he's in that song where he's got like the no and the yes face. And, you know, he says
Starting point is 00:37:12 custom kernel, custom CUDA kernels, no thrust algorithms like yes, please, or whatever. And don't get me wrong, there's a time and a place in Rapids when we need a CUDA kernel for, you know, optimal efficiency, or we can't express something in terms of thrust algorithms, they're our go-to tool. We do that. We're currently working on a PR right now that is doing that. But I think for a large percentage of problems, they can be easily expressed in terms of whether it's a reduction
Starting point is 00:37:40 or an algorithm that's in thrust or cub. And I think that's going to be way easier., you know, you use the sort of the on-ramp analogy all the time. It's a way easier on-ramp to go from, oh, I know how to do std colon colon insert algorithm to thrust colon colon algorithm. And, you know, seeing that, like, it's made me appreciate, you know, you mentioned, oh, in my talk, I had sort of examples. I can pinpoint one of them and it was implementing count if in terms of a stood reduce. And that doesn't work because basically your accumulator is going to be some size T or some int 32, which is keeping track of how many. And then your element that you're currently looking at could be anything.
Starting point is 00:38:24 It's just a value type. And so what if you're counting the number of 42s that show up in a list of numbers? Well, due to the fact that you're using a std reduce, usually in accumulate, your accumulator is on the left. So that's your running sum of how many times we've seen this. And then your value of whether that's a 42 or a non-42 number is going to be your right element in your binary operation. But std reduce, it could swap those around.
Starting point is 00:38:47 And then accidentally, instead of doing a plus one, because you've encountered a 42, you actually added 42 to your running sum. So it completely breaks. And I think I had an idea at the time that I was writing this code and it didn't work. But I was just like, yeah, but I just like std reduce. It's just a better name. And so I don't want to type std accumulate on, yeah, but I just like std reduce. It's just a better name. And so I don't want to type std accumulate on my slide where I want to type std reduce. And that was, I think it's an okay mistake to make at that time.
Starting point is 00:39:14 But in hindsight, I definitely, I'm sad that I did this. I wonder whether you could do that count if implementation with the missing reduce algorithm you can do it with the missing reduce aka associative reduce or find the better name but there's already an existing algorithm that you can solve that problem with oh play along we're playing that we're playing the play along game where bryce guesses the algorithm and if can't think about how much it would have thrust. See, I would have assumed that it was a reduce. So if you're an APL programmer listening to this problem,
Starting point is 00:39:51 so here's the problem. We're implementing COUNTIF, which is an algorithm that, given a list of numbers or a list of values, but we'll simplify it. We just assume we have a list of numbers. It returns you the number of times that number shows up. So assume you've got the list 1, 2, 3, 42, 4, 5, 6, 42, and you're searching for 42. Count if will return you 2. And if you're an APL programmer or a J programmer or an array-oriented thinker,
Starting point is 00:40:17 you've already solved this problem in your head, guaranteed. And this is exciting. So in Thrust, we implement it as a transform reduce. Woo, there we go. That's a ding, ding, ding, which is awesome because this is exactly how you solve it in APL. So in APL, what you would do is you'd go 42 equals your list
Starting point is 00:40:39 and that's going to turn it into a Boolean array of ones and zeros. So it'll be 00001, 00001, and then you just do a plus reduction. And in C++ or Thrust, you're just going to do the exact same thing with the transform reduce. Note that this is the transform reduce that just has a single range, so it applies a unary operation to transform the integers into ones and zeros or trues and falses, and then you do a plus reduction. is this is very similar to one of my favorite examples the parallel word count one where which you implement with transform
Starting point is 00:41:11 reduce with a with a binary right right and the idea that we i'm sure we've talked about it before but the idea there is that as a transformation you you you take you do a binary transform where it takes two inputs. And what you do is you look at the left element, like two consecutive elements at a time, and you return whether that's the end of a word. And you do that by checking is the character on the left a non-space and the character on the right is a space? Okay, then it's an end of the word.
Starting point is 00:41:43 And so then that produces an intermediate sequence of ones and zeros where each one represents the end is a space, okay, then it's an end of the word. And so then that produces an intermediate sequence of ones and zeros where each one represents the end of a word. And then you just reduce that. And it's so elegant. Right. And that algorithm that I think, yeah, we've explained, you've explained that before, the transform reduce that takes two ranges that are just using the zip tail trick. That could be an algorithm. And I actually presented it in the same algorithm intuition talk, not like a parallel version, but it's easily parallelizable with the implementation you explain, where it's called adjacent reduce.
Starting point is 00:42:13 So basically exactly similar to adjacent difference or what should have been named adjacent transform or adjacent find, where you're looking at adjacent elements. And then at the end of the day for this, we're just doing a reduction and doing basically a plus one whenever we see a character and then a space because that denotes the end of a word.
Starting point is 00:42:33 Yeah, that's a great algorithm that shows up all the time. The thing that makes reductions so powerful is that it is a way to pass global information efficiently. Like, I want to do something that requires me to know something about all the elements in this sequence, you know, that requires all of them to in some way to communicate. And a reduction is a super efficient way of doing that communication because that communication builds up over time.
Starting point is 00:43:13 Like one of the things that in my previous job when I worked in HPX, one of our fundamental tenets was to always localize communication. And Reduce really, really, really localizes communication because it ends up localizing it down to two, you know, actors at once. And, you know, on sort of the, you know, one of the points you made earlier about, you know, how much more powerful language like APL can be for parallel programming than writing things by hand. I think you're really right about that. And I think that there's still a revolution to come there. You know, there's a lot of interesting research these days and a lot of interesting
Starting point is 00:43:58 things that have been put into production around these machine learning graph compilers. They're these compilers where the underlying IR, the underlying operations that are supported, are all fundamentally parallel operations. Now, they're parallel operations that are sort of specific to the domain of machine learning but i think that that sort of technology is applicable to just general you know sequence algorithms so if you if you imagine a imagine an assembly language where every one of your operations is not operate on scalar data but operates on sequences of elements and where it's all parallel and um i i the one of the nice things about this model where you express your code in terms of some
Starting point is 00:44:55 known algorithms some known corpus of algorithms that is in in comparison to accelerating arbitrary code, accelerating a collection of calls to known algorithms is way easier. Because all you have to do is you have to implement those known algorithms efficiently. And especially, you know, it makes it easier to build specialized hardware, specialized accelerators, which is increasingly important in our post-Mort era. So I think that there is another revolution of functional programming languages that's yet to come. You know, the parallel APLs of the future. Who knows? Maybe you'll make one. Maybe it'll be you. Oh, Bryce, Bryce you're just you've done this so that you can say I remember back on episode 25 when I planted that seed and I've already been thinking about this for like since I've fallen in love with APL when you started working at NVIDIA do
Starting point is 00:45:58 you remember what I told you you were gonna do I told you you were going to build for me the next parallel programming language that actually if we're going to be technical technical about it that was even before i started at nvidia that was that was before you knew that you'd started it in video yeah i think technically what happened is i i had left my job at Amazon in like mid-September of 2019. Yeah, that sounds right. And then CppCon was like in the third week of September. And I want to say I actually like signed my offer. No, I did sign it when I was in Vancouver, which I can't even remember. That was the week after CppCon. But I remember distinctly being at CppCon in like the gathering welcome room.
Starting point is 00:46:52 Or I don't even know. It was like it's the main area of that like lodge. And then we were sitting on the couch. And there was a couple people around. And you were like, all right, here's our codename project. You already had come up with a codename for a project that didn't exist yet. And yeah, it's an exciting space. Graph compilers right now, sort of they only exist in these, you know,
Starting point is 00:47:17 machine learning applications, but I don't know much about them, but it definitely seems like there's a use case for like a more generalized graph compiler that it's not just, you know, linear algebra that's sitting on top of it. It's just, it's a language. And yes, it's going to be a constrained language, but that's basically what the APL family of languages are. And honestly, like when you, I'm not, I've probably mentioned this before, but if you look at the popularity of, uh, the most popular Python libraries, like what's the number one Python library? It's NumPy. And NumPy is basically just a variant of APL.
Starting point is 00:47:58 Half of the functions in NumPy get their names directly from APL. Reshape, Shape, I don't know all of them, but it's basically, there's articles that have been written that talk about how NumPy is a ghost of Iverson, who is the individual who created the language. Anyways, I want to circle back to, you asked me, what should this, we're going full circle to like 30 minutes ago, when you asked, what should the name of this missing algorithm uh this and i'll and i'll tell you why i think associative reduce is a bad name because one um uh associative the fact that it's for operations that are associative is not the
Starting point is 00:48:38 important part the important thing is that it does not commute yeah it doesn't require commutativity yeah now but non-commutative reduce be a bad name too because i think it's the more common case than reduce so and i guess maybe a question too is if if we could you know what should we name this missing algorithm and on top of that if we could name stood accumulate stood reduce and this lost algorithm you know carte blanche if that's if that's the correct way to say that. Oh, and also, I'm going to just interject here. I meant to say this earlier, and if I don't say it now, I'll forget. There were two people on Twitter that I actually, I believe, came to the solution that we explained.
Starting point is 00:49:18 So Jean-Marc, I'm going to mispronounce your last name. I apologize. Bourget, it's B-O-U-R-g-u-e-t um i apologize and then also toby isaac i believe um both have a solution with this sort of summary object that's got four pieces of information you know the ones at the left the right and the max um so we weren't. Well, and we have to thank Tony Lewis, who if Tony had not mentioned this at the end of our of our Zoom call, I would have totally forgotten. And we know this episode never would have happened. So thank you. Yeah, thank you, Tony.
Starting point is 00:49:57 This is and it's I think it's a really important. I hope, folks, this is definitely going to be one of our longer episodes. But especially if I include the stuff about my sock drawer at the beginning. So what do we call these? What do you think you'd name it? Well, there's Fold. But C++23 ranges are going to probably use Fold for something. I just finished telling Barry Revzin earlier a couple hours ago because he was asking me on the naming of uh the proposed c++23
Starting point is 00:50:31 reduction algorithm and i said i think i'm gonna write a paper just like purely on the namings naming a full because we already have stood accumulate and reduce and i don't think we're gonna how do you how do you feel about the fold like barry wants fold with initial value to have a different name than just fold and i want the python thing where they're they have the same name they're just overloads i honestly like i i don't know how i feel because i've looked at too many languages and like have forgotten you know i know haskell does like fold fold l fold r and then fold l1 and fold r1 uh barry just linked me uh apparently rust used to call it fold first and now they're they just finished renaming fold first
Starting point is 00:51:20 um which i believe is the one without the initial value to reduce and i think that's what wait wait whoa whoa how can fold first be the one without the initial value like i completely misunderstood those names because i assume fold first was the one with the initial value let's open uh let's open slack so i don't get this wrong i I told Barry to talk to you too, by the way. I don't know if I CC you in that email, but my, so key management and leadership tip for people, quickly learn how to hand off emails and tasks whenever possible to someone else, even if that person doesn't technically work for you.
Starting point is 00:52:03 Well, you gotta give me some name ideas for this new algorithm because I am going to go make somebody implement this in Thrust. Actually, there's a chance that the person who's going to be implementing this, who is starting at NVIDIA soon, is listening to this right now and is wondering, when he says, I'm going to make somebody go do that, does he mean me? The answer is yes. If you're starting at NVIDIA on my team in like two weeks, this will probably be a task that you'll do. So I still don't, because I need to read the details and I will go and fix this in post. But one of the alternatives for fold first
Starting point is 00:52:39 that was being suggested was try fold first, which almost definitely makes me think that it is the one that doesn't take an initial value because the try implies that you might fail. And the way you fail is if you have an empty list. So I'm almost positive that like fold first is the one that means just use the initial, use the first element of the list as your initial value. And so yeah, Haskell has fold L and fold R and those ones take initial values. And so yeah, Haskell has fold L and fold R, and those ones take initial values, and then they have fold L1 and fold R1, which don't take initial values. A lot of languages, they just choose two different algorithm names, like fold and reduce, and you just need to know, oh, the reduce is the one that doesn't take an initial value. I think that's what basically
Starting point is 00:53:22 ended up happening to Rust. And I don't know, what I'm going to do is I'm going to go and look at every single language. And that's the thing is on top of like initial values or not initial values, there's these associativity and commutativity requirements. Yeah, it would be really interesting to see a survey of what other languages call these and whether the associativity and commutativity is uh is is respected in in the various names of those things so yeah go make me one of your pretty tables it'll be yeah it'll be in the form of a paper maybe i'll make a youtube video as well um honestly i think that i think that if if not for the fact that that we were going to use the name fold for the ranges version of accumulate
Starting point is 00:54:11 i would probably use the name the name fold for this um i i feel like if i were doing this from scratch um i don't know there's a part of me that actually wants like uh the requirements of associativity and commutativity to be represented by the word like par par underscore um or like put that in a namespace so that like the parallel algorithms could have been like stood colon colon par colon colon um and then we could have just used the same names um arguably uh it's actually interesting if you look at um libstd c++ is implementation of std reduce not the parallel version one but just the regular one you'll see that even if you're not worried about parallelism the reduce like like they have a more efficient implementation because um uh they don't have to you know they don't have to avoid commuting and they can just associate the
Starting point is 00:55:07 operations however they want so it's not necessarily inherent to parallelism so yeah I feel like there's like a space though for like just the calling the accumulate fold left and then having a fold right version just for symmetry and then um having like parallel fold left and parallel fold right and then having like parallel fold and you're not explicitly pointing out that like oh the parallel fold left doesn't require uh that your operation be commutative but that's implied because i think it's actually what's what's more important is the behavior and honestly like until you at least for me, until I got hit over the head with like commutativity and associativity, like four
Starting point is 00:55:50 different times, I could never remember which one was which. And actually, one of the things that helped is that in APL, there's an operator called commute, which basically swaps the order of arguments to a to argument function. So if you like, it's a very classical thing to do if you have a minus operation, or you're trying to do a minus reduction on something, and you need to swap it, well, instead of writing some custom lambda, or what they call a defun in APL, you just you apply the commute operator to the minus operation. And then with a single character, you swap the order. Subtraction is a great example of something that of an operator that does not support uh uh that's that's not associative that does not support you know be uh stood to reduce um uh now you could use it it's just the
Starting point is 00:56:35 like you know you might get unexpected results whereas addition you know and when i when i say subtraction i mean like subtraction on the set of natural numbers. Likewise, addition on the set of natural numbers is a good example of something that does support commutative. Sorry, I said associativity there. I meant commutativity for all of what I just said. Yeah. I know that in Q, which is array language, they have parallel versions of some of their functions, and they just prefix it with a P. So they have, like, each, which— I don't—I think that that's wrong, though.
Starting point is 00:57:14 I think that these semantics are not about parallelism. So you don't like the fold left, parallel fold left, and parallel fold? No, I don't. I don't think it's about parallelism. I think it's about commutivity and associativity. But I think those are just words that are too long to put into the name. So what is your take on parallel and left implying that you don't need a commutative operation and parallel with fold or whatever not requiring or requiring both of them um like you're like so you're saying that really what's key is associativity and commutativity
Starting point is 00:57:53 but is there is there no way to name the algorithms that like doesn't make that explicit and overwhelm the user or should we just expect that in 20 years parallelism is going to be everywhere and it's just going to be the same way that learning what's to accumulate does and how important that algorithm is we're going to expect people to know what whatever the different variants of these parallel versions are so as i often say one of my goals is to normalize parallelism um uh and i i just think that the way that the algorithm is executed should be completely divorced from the the the semantics that it expresses like you should like i stood reduced and like this is why there is both a, when we had to introduce a new, like a std reduce because we couldn't parallelize std accumulate, we didn't just add a parallel version of it. We didn't just call it parallel accumulate.
Starting point is 00:58:56 There is both a std reduce that takes an execution policy and one that doesn't. Right, right. And I think that that's absolutely essential that you separate parallelism from the semantics of the algorithm. Right, right. yeah i'll have to i'll have to think i have to think i'll have to write the paper and um my guess is that like one naming scheme that would probably be if if the if the naming scheme was like very clear like like uh you know maybe non-community of reduce isn't the worst in the world that's actually the name of if you you look in the wording for std reduce and for, and for the scan algorithms, we actually have these things called generalized sum and then generalized non-commutative sum. So maybe non-commutative, uh, reduce isn't the worst
Starting point is 00:59:57 name in the world. Again, I think that it's actually more useful than, than reduce. Um, uh, I'll boldly claim that. I mean, it's, is that even a bold claim? It's a less restricted algorithm. Of course, it's going to be. Should reduce, if you don't need the freedom, you may take a big performance hit for it. Right if you if your operator is commutative you may may take a big performance hit from from invoking the version of the algorithm that assumes that you can't commute that'd be actually i would i'd be interested to maybe i should go that's if i ever come out of retirement from speaking uh profiling this kind of stuff on a variety of problems it would be interesting to see what is the impact.
Starting point is 01:00:47 Yeah, to explain why we'd have to do a whole separate episode talking about memory coalescing. And, you know, we've talked a few times about how to get Olivier on this podcast to explain to people what GPUs are. And maybe that's the way to do it is that we tell him, hey, because the last time we talked to Olivier, he was like, that's such a big topic. You have to give me some parameters. And so maybe the parameters that we give him is we tell him, yeah, come on and explain memory coalescing.
Starting point is 01:01:13 Explain how the world of memory traffic is completely different on CPUs and GPUs in terms of what's optimal. Boy, this is just going to be a really difficult episode for you to cut. There's no clear middle point. We normally – I don't think I'm going to split this. I think I'm just going to release it all as one because – It deserves that. But the audience should understand that normally when Connor and I record,
Starting point is 01:01:42 we record on Saturday mornings for me, Saturday afternoonsons for him normally we record for an hour and a half and that becomes like three episodes this is not that yeah we're at like what an hour and 15 minutes and i really want to include the sock stuff so it's just you're gonna be if you're if you've listened through i only got an hour and five i only got an hour and five minutes here in my on my recorder so my remember when we started this podcast how long did we say episodes were gonna be uh my goal was 30 minutes and then it became like an average of 35 and i was like that's perfect 35 minutes i thought that we said i thought that we said 20 oh yeah yeah actually that's true because yeah uh magic read-along is like roughly 20 it's like a long episode for them is like 30 minutes um that that was ever had an episode under 20
Starting point is 01:02:30 under 20 minutes no i think our shortest ever was like 23 and that was because that was the episode on uh our impromptu episode on inclusive language that happened inception like because we ended up recording it in the midst of another episode. I'm sure if we had set out just to talk about that from the get-go, it would have ended up being 40 minutes easily. All right. Well, we should probably shut up before this becomes unbearably long. Anyways, we will put the god ball with the implementation of this algorithm.
Starting point is 01:03:05 Now, I have done some verification that it's correct, but I do not know that I currently have any parallel implementation of this, of the reduction that I need to actually test this. So, but don't worry. I will soon have somebody put this into into thrust and then we'll be able to uh to play around with it yep thanks for thanks for staying tuned in this long or you know potentially we're talking to nobody and everybody turned us off at this point um that's also possible but this is this podcast really more for to make us feel good than to for the audience anyways.
Starting point is 01:03:48 Right. Thanks for listening and have a great day.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.