Algorithms + Data Structures = Programs - Episode 232: Algorithms! Live from New York!

Starting point is 00:00:00 Live from New York, it's Monday night. A pine cone? Is that a special thing in New York? Oh, I was going to say, we got lots of pine cones in Canada. Why are we talking about pine cones? You know, I know the comedian who bombed tonight tried to make a callback that failed, but I've got a callback in this episode

Starting point is 00:00:17 that I think is going to be very successful, which is the callback to the reduce by key example that we gave earlier, where you want to reduce each row in a matrix. How'd you feel about the squid fries? I expected potato fries with the squid-like essence. Instead they were pieces of squid that were fried. So I was very shocked. If I could get it french fries at this hour I would. Instead of big algorithm library which is what we have today that we should have gone with small composable algorithm library. which is what we have today, that we should have gone with small, composable algorithm library.

Starting point is 00:00:47 We're okay with a if-then situation, we're if dog-then cat. There, I tried to loop it back into technical things, because you guys refuse to talk technical. Whoa, whoa, whoa, whoa, whoa, listen, folks. Welcome to ADSP the podcast episode 232 recorded on April 14th, 2025. My name is Connor and today with my co-host, Bryce, we record live from New York, we chat about algorithms, overload sets, libraries, and more. All right, go. Are we recording now? Yeah, you're not exactly walking in a straight line. What?

Starting point is 00:01:36 This line looks entirely straight. Should we do a sobriety test right now? Yes, we definitely should. Boom, boom. Actually, I think I would pass to be honest. We're here, potentially episode 232 on the streets of New York. Do you actually know what street we're on? Here, you hold the mic.

Starting point is 00:01:51 You'll be doing most of the talking. I think we're on like 23rd and something. We just left the New York Comedy Club after seeing, I'm going to say two out of the five comedians were really, really good. No, I'm sorry, Shima. I'm gonna say two out of the five comedians were really really good. No, I'm sorry, Shima. I'm sorry. The guy who bombed... He was awful. Yeah, but him bombing was in and of itself amusing.

Starting point is 00:02:15 Just stop talking about pedophilia, folks. If the crowd's not into it, change topics. Don't keep on coming back to pedophilia. What is it? And like, I've seen people bomb at that club before and like they always when like they get desperate, they always turn to lewdness. It's like anyways, why are we in New York, Connor? We're here for a actually I don't actually know the official name. It is a I think I'm going to do my best and then Bryce will correct me.

Starting point is 00:02:40 It is for a modern CUDA C++ training, a recently developed state of the art training for targeting Nvidia GPUs. Is, how close was the title of quote, modern CUDA C++ training, end quote? How, I think I got, that is what I would call it because the actual official name, which is like fundamentals of accelerated computing and modern CUDA C++ is way too long.

Starting point is 00:03:06 That's a mouthful. But anyways, Connor is here in New York because, well, you weren't planning on being here. When did you even, when did this even materialize? So yes, to our global listeners, I was asked, today is April 14th, I believe. Apologies to my meetup attendees that I had to move a meetup that we were supposed to have tonight. I was asked initially one week ago on Monday if I was available to help with the training.

Starting point is 00:03:38 And I said, if you really need me, the answer is yes. What is this? USPS Is that what you pointed at? A pine cone. Is that a special thing in New York? I was gonna say is we got lots of pine cones in Canada. Why are we talking about pine cones? Anyways, I was asked a week ago. I said if you need me yes, but I don't once again. I have to say folks said, if you need me, yes, but I don't, once again, I have to say folks, I'm sorry to disappoint some of our listeners, but if you can tolerate an opinion you might potentially disagree

Starting point is 00:04:12 with, we are temporarily trying to boycott the states. It's not going very well because I'm going to be having three different trips down here over the next two months. This is the first of the three, but yeah, things are a bit crazy. Anyway, so a week said yes if you need me but preferably no it turns out they did need me you notified me on Wednesday I got approval on Thursday and here we are on Monday and uh yeah we're here with my beautiful fiance do you want to say hi sweetie hello what a lovely voice and uh and so yeah we were here for training the training I think went phenomenally. Well, and I should explain the like, I got asked to do this training like 10 days ago, which is a little bit short notice.

Starting point is 00:04:54 And it's not, this is not like a training I've done before. This was like an eight hour training, but I, you know, like I give a lot of talks. If someone asks me to give a talk at work, like I'm like, sure but they like I was initially told like other would be like 20 people and I was like all right that's fine like 20 people like we like to have a 10 to 1 ratio of instructors to students so I'm like I'll find one person in the New York area and so then I ask like there's a lot of people on my team who like live in New York there's maybe like four or five people and I ask all of them and they're like every single part like one person's like I gotta take my dog somewhere,

Starting point is 00:05:29 one person's getting work done on the house, renovations, like everybody had something. I was like uh-oh and then and then like we have some planning meetings. Somebody's like oh yeah there will be 80 to 100 people and I'm like wait there's a lot more people than I expect and somebody's like, oh yeah, there'll be 80 to 100 people. And I'm like, wait, that is a lot more people than I expect. And so now like my search expands beyond New York and I'm asking like everybody who I even remotely work within Boston. And then I'm like, oh, Toronto's nearby. So maybe I'll ask Connor. Long story short, we ended up flying two people from Toronto in and then two

Starting point is 00:06:03 people from Boston came and then one person came all the way out from California for like less than 24 hours to do this training. But it turned out great. We had a great time. We taught people lots of things about Kudos, E++. Great questions were asked. A fun time was had by all, and I have not lost my voice yet. Professor Bryce is, maybe that'll be a thing.

Starting point is 00:06:23 Maybe that'll be a thing. We'll see. And I will say, they were phenomenal students, if you will, I mean, they're all professionals, so maybe calling them students isn't the right word, but yeah, great questions. And there was one really interesting question. So one of the examples in the talk was,

Starting point is 00:06:40 this is a problem that, a recent talk I gave on CUDA C++, I started talking about, I gave an example of using some of the reduced by key algorithms. We've talked about them on the podcast before, but we were, like the canonical example of reduced by key is you want to like reduce the rows of a row major matrices. Like you want to do one reduction for each row. So reduced by key, what it does is it takes a sequence of keys as an input and then a sequence of values as an input. And for each run of consecutive keys, so for each sequence of consecutive keys that are

Starting point is 00:07:19 all the same, it'll reduce that set of values. And so if you've got a row major matrix and you want to reduce each row, each row is contiguous. So the key that you use is just the row that that particular element is in. But what if you want to do a reduce by key of the columns? Well, the columns are not contiguous, so the key sequence for the column of each cell would be like 0, 1, 2, 3, 4, 5, 6, 7, 8. And reduced by key, it only works on, it only combines consecutive keys that are the same.

Starting point is 00:07:58 So if you have a bunch of keys that have the same, you know, column ID but they're not next to each other in the key sequence, it doesn't matter, they're going gonna be separate groups. And so someone was asking, well how would you do reduce by key on that? And just there's, you know, you could have a new algorithm like an unordered reduce by key or one that like does a sort or something. But we were sort of theory crafting there and you know, Connor, I was saying I think you can do it with reduced byte key you just need the right type of iterator and Connor you suggested how it can be done. I mean it's quite obvious and and also

Starting point is 00:08:32 too it's obvious for me because in a research project I implemented an algorithm that was called chunk fold where chunk is the equivalent of the C++ 23 chunk that takes an integer which which is basically this, right? You're doing a chunking by an integer size and then folding afterwards is the equivalent of doing a row-wise summation on a 2D matrix. And the way that I do that is exactly what you did. So the column-wise one is pretty intuitive and we have a fancy iterator in thrust called the permutation iterator and so you could just build up that index permutation sequence that gives you the columns. Would it be efficient?

Starting point is 00:09:15 Definitely not. But it would get the job done. And so the idea here I think is that you would have the key sequence where you'd have all the column IDs contiguous and then you would, the value sequence would be the thing that would be the permutation iterator. So in effect you're doing like a reduce that also does a transpose of the data. Like a reduce by key with a transpose built in there. And it's actually interesting because there are more efficient ways to do a

Starting point is 00:09:43 transpose than with the permutation iterator and so like maybe we need to have some sort of transpose iterator that under the hood the library could recognize that it's a transpose and use some efficient transpose mechanism because the problem with doing like the the the reduce like the column summation the problem is that you're doing a gather. So you're striding through memory with these big strides and that can be inefficient. And so there may be a slightly more efficient way to do this. I don't know whether there's like some way to write some transpose iterator that's going to be better or that would be recognized by the

Starting point is 00:10:22 underlying algorithms. But it was a really good question that someone asked and now I want to go and I want to add it as another example to one of my talks. Did you have any interesting encounters with people today? Did you learn anything new? Did I personally learn anything new? I don't think so. But I think my overall takeaway is that the folks at this corporate entity that- I don't believe you. There's no way you didn't learn one new thing.

Starting point is 00:10:50 At least in the training. I didn't learn one. Oh, yes. I mean, actually I did learn, I thought you meant from the questions and the interaction with the students. But from the trainings, yes. I mean, similar to one of the other trainee individuals, I did not know what the third

Starting point is 00:11:06 integer parameter of the triple chevron kernel launch is, and that was for dynamic shared memory. Admittedly, I spend very little of my time writing raw kernels most of the time. Well, you should not. That's a bad... You should not do that. You should not do that because it's very hard to do that correctly, and there's libraries, and you should just use libraries Which is which is 67% of the point of the training is target thrust target cub

Starting point is 00:11:29 We have many very optimized by kuda ninja libraries that you can use so that you do not have to write your own kernels and So I don't want to throw our other TA under the bus, but someone asked. What is that third? triple chevron parameter? And in my head I was like, yeah, I didn't get asked that question, but I would have also had to ask and it was for dynamic shared memory. So yeah, I definitely did learn. That was when TA knew the answer.

Starting point is 00:11:56 He just thought that everybody in the room should know the answer. And so definitely I learned a few things. What was a few of the times I was actually working through the training because I had not seen the training before. I always forget that with NVCC, generic lambdas are not supported. So I naturally always reach for auto type

Starting point is 00:12:20 in the parameter list of a host and device annotated lambda with NVCC, and that'll fail to compile. It's a simple fix, but almost like every single time for the first couple of examples, I would write that and then be corrected. There was a few other things too. I can't remember them all off the top of my head, but yes, from the training itself,

Starting point is 00:12:37 like not in the first half, definitely all of that stuff is my expertise. But in the second half of the training, definitely there was some stuff I picked up. I feel like there were other things we were supposed to talk about. God, I have a list somewhere. But after eight hours of talking and then a comedy show, my brain is not sufficiently there to remember things. But what else were we supposed to talk about?

Starting point is 00:13:03 I mean, we went to dinner. I mean, Shima was there. We can get Shima's review on the comedy club. Uh, you said what? Two of the? I think two of the five were good. The there was, uh, the first and the last, the first, the first and the second. The first guy who, who he had, he had a joke. His last joke was about that or his last sequence of jokes.

Starting point is 00:13:23 There was this reference to these tortoises being intimate, and the guy was wearing two or three jackets, and he kind of looked like a tortoise, and I swear it was intentional. I swear that man intended to look like a tortoise. He was funny. And then there was this tall Jewish fellow who had a very, very long face who was talking about his crazy neighbor, and I just thought I thought he was the best I thought it was hilarious. What say you Shima? We're giving the mic over to my fiancee her I

Starting point is 00:13:53 don't know not first time but you've been on the podcast few times before here we go we're sitting down which will try not to upset the other patrons. Sharing that comedy club is crazy. Nobody cares about what I think of a comedy. There are many people that care. Sharing that comedy club is crazy. Nobody cares about what I think of a comedy. There are many people that care. Okay, let me see. So my review of the comedy show is that admittedly it's better than Toronto comedy. We're very familiar with the Toronto comedy scene.

Starting point is 00:14:17 And I have to admit that the New York comedy is better. So I'll give you that admission. Other than that, there were five comedians. Two women, three men. Four out of five, I thought, were very funny. And I think objectively, people would think it was funny. There's just one guy that bombed. Which is okay, that happens.

Starting point is 00:14:36 Yeah, nobody really related with his jokes. And that's okay. But the other four were very funny. So it was a good night. Was it the other four? I mean, I know the last one, she was pretty good too. Was the other one pretty good? That's actually, the other one was very funny, yeah. She was the one that was like, yeah, like, people say like, don't you pray?

Starting point is 00:14:55 She's like, I meditate. Oh yeah, yeah, yeah. Just in the same direction five times. They were all pretty good. That woman's one good joke. One good joke. Wait, before I move over, because we're separated by this big white table here. And we've got to record for at least like 50 more minutes to get a full episode out of this.

Starting point is 00:15:14 What else would you like to say? We went to a Korean restaurant tonight with some other Nvidia folks. A lovely dinner. We talked shop. A lot dinner, we talked shop. A lot of talk about the future, the plans for the training, the company. The singularity. The singularity, yeah.

Starting point is 00:15:31 What was your review of the dinner? It was a great dinner. What else is there to say? Lovely food, lovely company. How'd you feel about the squid fries? I expected potato fries with a squid-like essence. Instead, they were pieces of squid that were fried. So I was very shocked.

Starting point is 00:15:52 And if I could get french fries at this hour, I would. And this is totally off-topic for this podcast, but we had two different individuals talk to us about our potential future acquisitions of pets, specifically dogs, and a little bit cats. Have we, has our, as a, you know, partnership, has our view evolved of what we will be acquiring? We're okay with a if-then situation. We're if dog, then cat. There, I tried to loop it back into technical things because you guys refused

Starting point is 00:16:26 to talk technical. Whoa, whoa, whoa, whoa, whoa. Listen, listen folks. That's a very technical podcast. I did my part. I did my part. Yeah, also too. You know, Bryce is known for, you know, we are Mr. and Mr. Tangent, you know, we talk about how many episodes have we had Bryce? All All right, say goodbye, sweetie, I'm gonna go over there and we're gonna continue recording. Goodbye. That's such a lovely voice.

Starting point is 00:16:51 We're Mariner, folks, we're Mariner. How many times, Bryce, have we started recording an episode and like two minutes in, we went in a completely different direction and the first two minutes was, this is what we're gonna talk about, and then the remaining 28 minutes is we talked about absolutely nothing to do with what we talked about in the first two minutes how many times I think that's

Starting point is 00:17:08 it's been a lot of times I think that's like that's my fault I think that's on me for the most part oh gosh I'm trying to think about what else you're supposed to talk about oh yeah we can spin this into the topics that we had before that we never got to on the previous recording you wanted to so anyways this is, this is the, um, this is going to be one episode folks. We're going to, we're going to call this, I don't know, New York, New York plus something else. The training was great. Hope to do this again in the future.

Starting point is 00:17:35 And I think the, yeah, the folks at the training, they, they loved it. I think they got a lot out of it and they were great spinning back to a couple episodes ago, the last time we recorded, we ended up doing the Tesla, or not the Tesla and talking about AI a bunch and we didn't get to overload sets. Do you wanna talk about overload sets right now? Or do you wanna talk about the stuff that we were supposed to bring up from earlier today?

Starting point is 00:17:59 Because we can take this in two different directions. I mean, that was one of the topics that you want. What? I don't know, I think there was a few other things that you wanted to talk about from earlier today but I'm not sure which one... Let's talk about the overload sets. I feel like I'm slowly... Oh god I'm turning into such a C++ personality. I feel like I'm slowly over the course of many years assembling content for a book which I know like every every person once they reach a

Starting point is 00:18:24 certain age in the C++ community starts reading a book but I've been thinking more about like design and and oh excuse me excuse me did you guys forget a phone oh oh Connor just noticed that somebody left a phone Connor the hero the hero, the hero. So, alright, I will, I hate to do this to my co-workers. I hate to do this to my co-workers. And I'll preface this by saying I don't mean to shame anyone, okay? I don't mean to shame anyone. I'm not, This is not a tactic to push forward any changes. However, I'm going to explain a situation. So we have an interface. There's this method on a... Should we move to these couches over here? We can move to those couches. We'll tell the

Starting point is 00:19:18 story about the cookies. All right, I'll tell the story about the cookies now. So we were at this training all day. I was doing the lecture part, and then there were these instructors who were going out for the interactive part. So I was talking a lot. And I was told at lunch, well, first of all, at lunch, I picked a box lunch that did not have a cookie in it. And I was sad, so then I picked a different box lunch

Starting point is 00:19:37 that had a cookie in it, but that cookie was not particularly good. And then I was promised later in the day that there would be a cookie bar. And you know what happened? Connor, was there a cookie bar? There was not. And the patrons that were attending the training did get live updates

Starting point is 00:19:52 when we came back from break and there was no cookie bar. But then one of my co-workers who had come in from Toronto, Ashwin, he asked where I would get cookies from. And I mentioned, well, there's, you know from and I mentioned well there's you know Levine, Levine's, there's Mamon and a couple other places and then he offered to order cookies for the whole team and I was a little hesitant but I was like you know what I could use a cookie so we got these Levine cookies and these Levine cookies they are they're not really cookies, they're like cakes. They're like big, thick things.

Starting point is 00:20:26 It's like a thousand calories. But I ate one of them nice and warm, and it was just perfect. Anyways, back to Overload Sense. So yeah, so this, again, apologies to my coworkers. I don't mean to make anyone feel bad. But there's this method on a class and it's called weight. And there's two different overloads of weight.

Starting point is 00:20:52 One overload takes no argument and the other overload takes a object. And the first overload of weight, when you call it, it blocks the caller. The second overload of weight does not block the caller. It tells the object, you object should wait upon this thing that I'm passing to you. So like the object that you're calling it on is this thing that we call a stream. And so the first overload of weight that takes no argument, it says, hey, I want to block until this stream of work is finished. And then the second overload of weight says, tell this stream of work

Starting point is 00:21:37 to wait on another stream of work. But it does not block the caller. And I think that this is not the correct design, because I could understand if you just had the second semantics. If you had the function, you called it wait, but it didn't block, it just told the thing to wait upon some other thing. But I cannot abide having two overloads, the same thing, where one of them will block the caller,

Starting point is 00:22:07 but the other one will not block the caller. I think that this is confusing, and I think that this is not good interface design. And I feel that a good overload resolution set should have the same semantic behavior, and for anything that has a subtlety, like for example, whether a parameter is modified, whether or not the function blocks, things like that, anything that's got caveats that requires nuance to understand should be the same across all of

Starting point is 00:22:42 the members of that overload set. And if there is two different overloads in this overload set have some subtle nuanced differences, like one of them synchronizes and one of them does not, then those should be two separate functions. Or if one of them modifies a parameter and one of them does not modify that parameter, those should be two separate functions. And I feel like there is some fundamental design principle here that belongs in the Book of Bryce API design.

Starting point is 00:23:15 The Book of Bryce. I look forward to reading that. Honestly, that is what I would call it, I'm sure, because, you know, my ego is very large. Yeah, I mean, I like it. I like it. I mean, I don't think anyone would expect anything less. Honestly, I have nothing to add. I completely agree. And when I was trying to brainstorm in my head, what are the good overloaded functions or APIs

Starting point is 00:23:37 that I can think of? Like, the one that came to mind is from Clojure. It's a partition function that is kind of a sliding window or chunking function where it takes a sequence and chunks it into different sequences. I don't remember what the default behavior is. I think it's a chunking behavior, but it has overloads of it which lets you set the stepping and like the windowing size, you know, and you know, it's basically that the API that has the fewest parameters just has defaults basically for the others. So the behavior is achievable by the one that has an overload that has more parameters.

Starting point is 00:24:18 And so definitely the what you're talking about here, it's like you can't get to the first weight function from the second one, because it's just completely different semantics so yeah I mean like you're telling me this and yeah it just sounds it just sounds like so I like that rule that if it's if it's an if it's got if the overload is that there's like multiple like a new additional parameters if it's like some of them are defaulted that that you should be able to get to the most defaulted version from the more explicit versions. I think that that is another good rule to go into the book of design. Yeah, I like that rule a lot. Well, there's another rule that I

Starting point is 00:25:01 remember when we were designing my first committee paper, the binary binary transform reduce, there was or maybe it was one of the scan algorithms but there was it was one of these algorithms that takes a bunch of different parameters like it takes like two operators and then like maybe it was one of like maybe it was like the binary binary transform reduce so it took two operators and then an initial value and there was some question about what the ordering of the whether like whether it should be like operator one init operator two or like init operator operator and i thought that init operator operator made more sense but stl not the not the library, but the person who maintains the Microsoft standard library.

Starting point is 00:25:46 He made this argument that when you have these additional parameters for new overloads, that they should be additive, that the order should be the same. If you have a three argument version and a four argument version, that the position of the arguments should not move. Like if the third argument should not become the fourth argument in the next overload. That if it was the third argument in the version that had just three arguments, that it should remain the third argument and that the fourth argument should be added.

Starting point is 00:26:21 And that that was why you should have this somewhat unintuitive ordering where it's like operator, init, operator, instead of having the init come before both of the operators. And I thought that that made sense, too. And I guess it helps with overload resolution to avoid ambiguities in some cases, too. But his argument was not about the disambiguating. His argument was just that this is the more natural way

Starting point is 00:26:44 to have it be. And that was compelling to me. What's interesting about that is I agree with what you're saying but I know the exact thing you're talking about and that is that the thrust transform reduce and the C++17 stood transform reduce have a different parameter order. And I'm almost positive that thrust is the one that does binary operator, init binary operator, but the std transform reduce from C++17 does init operator operator. Which it is possible that younger Bryce was stubborn and at the time fought for and successfully advocated

Starting point is 00:27:27 for the order that made more sense to him and only later, retroactively in my mind, I came along, I came around to agree with SDL. And it's also possible that we only realized the mistake too late and that we weren't able to fix it. Well, because I mean, a knit's always necessary. So I thought the one that was... I think this may have been in the order of maybe the parameters for the scan because inclusive scan and exclusive scan, because in exclusive scan, you always have to have

Starting point is 00:27:59 the initial value, whereas in an inclusive scan, it's optional. So in an exclusive scan, the order is inputs, outputs, initial value, operator. And then for inclusive scan, it's operator, initial value. Because you can have the operator without an initial value with inclusive scan, but for exclusive scan you always have to have the initial value. So it might have been something around one of the scans or maybe around the transform scan algorithms, the compound transform scan algorithms, that may have been it, that the transform inclusive scan and transform exclusive scan.

Starting point is 00:28:43 I remember it was something with two operators, and the transform inclusive scan and transform exclusive scan. I remember it was something with two operators and the transform inclusive scan and transform exclusive scan have two operators because they have the scan operator and they have the transform operator. Okay, yeah, that makes sense. If you're talking about the scans, I was thinking about the transform reduces. Somebody could go to like the meeting minutes from like 2017 C++ committee meetings and find when STL made these points. It would have been in like the library minutes from very early committee days. But I think that the key point is that this argument is not the specific history here,

Starting point is 00:29:17 but this argument that STL made that about that if you have an overload that takes two arguments and an overload that takes three arguments and an overload that takes three arguments and an overload that takes four arguments, and that it's additive, that the first two arguments are the same in all four of these overloads, that you should not move, or if it's the same argument,

Starting point is 00:29:37 you should not move around its position across different overloads. And I think that was compelling to me. Yeah, I mean, that completely syncs with my sense of like you should be able to replicate like for instance in in the same kind of partition function enclosure that leads to a couple of these sliding chunking functions. Haskell only provides you the most generic version where you provide a step value and a window value Which you can spell the stride chunk and slide versions by specifying one and a value of n for either or the parameters

Starting point is 00:30:14 But the point being is like at the generic level you have you can spell everything You know, it's like it's when you I posted that blog post once that talks about how stood mismatch is the most generic version of a bunch of, you know, is sorted and whatnot. What? Like sometimes you don't need two sequences. You just need one sequence. And so you can ignore the second iterator to, you know, you're kind of bending and what a what a bend call it. It's like you're you're bending the algorithms in a perverse way. That's not how they're intended to be used,

Starting point is 00:30:47 but you can spell a single iterator reduction with a two iterator algorithm by just ignoring one of them or discarding one of them. And it's this idea that at the top level, you wanna be able to spell everything. And so anyways, to getting back to the wait API, if you've got different semantics and it's doing different things that, yeah, just,

Starting point is 00:31:10 I have nothing to add to that in that other than that. I agree. If it's got different semantics, that sounds like it should be two different APIs, right? And what you were talking about about discarding there made me think of, we have this special iterator in Thrust called a discard iterator. And it's something that I don't think we ever really see.

Starting point is 00:31:32 We don't have in the standard library. And it's because there's not really algorithms that need it. And the reason is, I don't think there's any algorithms really in the standard library that take two outputs, are there? There is only a sync. Actually, let's play a game. It came up in a past episode, probably the listeners might know.

Starting point is 00:31:48 There's only one algorithm that outputs to two different output iterators. Can you name it? Probably not. Is it one of those set ones or something? Or like the partition or something? It is, it is is it is partition copy so partition part takes a unary predicate and everything that passes it writes to the beginning and then everything that fails writes to the end so that you end up with a partitioned kind of predicate

Starting point is 00:32:18 sorted sequence partition copy does the same thing but it just's a linear pass. Anything that passes writes to the first output. Anything that fails writes to the second output. Other than that, though, so yeah, potentially, if you only cared about the first, but if you only care about the first predicate passing elements, that's just a copy of. You don't need to use a partition copy with a discard iterator on the second output iterator.

Starting point is 00:32:47 We have an algorithm for that, which is why it probably doesn't exist. And, you know, I know the comedian who bombed tonight tried to make a callback that failed, but I've got a callback in this episode that I think is going to be very successful, which is the callback to the reduce by key example that we gave earlier, where you want to reduce each row in a matrix. And these by key algorithms in Thrust that take with a keys input and a values input, they take two outputs, a keys output and a values output.

Starting point is 00:33:22 And this is useful because in some cases, you want to keep the association of this was the key for this group, this group of values that I reduced, this value that's the aggregate of this group. I want to know what the key is. In the case of the row-wise reduction, if our key is what row am I on, then in this output vector, the index of each element in the output vector corresponds to what the row was.

Starting point is 00:33:58 And so you don't need the key's output here. And you don't want to write the key output to somewhere in memory if you don't need it because that's gonna waste memory bandwidth and it's gonna waste storage and so thrust has this discard iterator and what a discard iterator does is it when it's dereferenced it returns this proxy object that when you assign to it, it just does nothing. So it just ignores the write. And because we have these by-key algorithms in Thrust,

Starting point is 00:34:32 and we also have a couple other types of algorithms that take two outputs, and because of these, we have this discard iterator in Thrust. And I think it's most commonly used with the by-key algorithm. So there's a few other places where it's used. But until we introduce by-key algorithms to the standard library, it's not something that we would need.

Starting point is 00:34:54 But this is one of the fancy iterators that we have in thrust that we showed in the course today. But that's just not something that's in the standard library yet. Because there's a bunch of algorithms in thrust that aren't in the standard library that really ought to be like these by-key algorithms that let you do these batched operations and are really quite powerful.

Starting point is 00:35:13 Well, I mean, we're gonna have to revisit this topic because, I mean, it's been too long that I've been working on the stuff that I've been working on that hasn't been published, but it is. Eminently. Eminently this year, especially after the most recent stuff that I've been working on.

Starting point is 00:35:28 It's gonna be out in the wild. People are gonna be able to use it, but two of the motivational examples that I talk about are Unique Count, which we spent much time in Venice many years ago. Was it many years ago? We don't know, but it feels like a long time ago. But Unique Count, which is really just a zip iterator with a not equal to pass to a

Starting point is 00:35:49 reduce, which ultimately I'm not sure if we actually ever revisited that topic. But anyways, unique count is a is a hand-rolled specialization of fancy iterator plus algorithm. Transform reduce, a hand rolled similarly spelled version of a fancy iterator, transform iterator, plus an algorithm reduce. And I realized there's a third one today in Thrust and it is tabulate, which is just a counting iterator, AKA IOTA for my array language fans out there, plus a transform. And it's one of my array language fans out there, plus a transform.

Starting point is 00:36:25 And it's one of my big motivating points is that we have these hand-rolled bespoke algorithms that are a combination of a fancy iterator and algorithm. But the algorithms that we provide you that do these are limited. It's tabulated. It's unique account. It's transform reduce.

Starting point is 00:36:42 And then outside of that, you have to go and compose them yourself. Whereas we could be providing you with an API that does this all behind the scenes and makes it a lot easier. Like it's still a fantastic resource. Don't go write your kernels by yourself. Reach for this stuff. But I think in the future we could be providing you a more ergonomically to spell this stuff

Starting point is 00:37:00 that does it automatically behind the scenes. Stay tuned 2025. Hopefully going to get published. So one of the reasons why we do the, the composed operations, one of the arguments, like one of the arguments for things like transform reduce and transform inclusive scan and transforming exclusive scan, um, is that it's more efficient, uh, on some platforms, some backends, uh, for like the parallel version to have these fused algorithms instead of being able to... well, okay, the argument at the time is this is before we

Starting point is 00:37:31 had ranges. So the reason that we had like transform reduce was because it's more efficient than doing a transform followed by a reduce and we didn't have a transform iterator at the time. I think that was the argument. And there's also this claim that maybe thrust should have a tabulate because maybe there's some faster way under the hood that we could do transform plus counting in the radar. I think in practice, it's not. I think in practice, it just maps down to that.

Starting point is 00:38:00 But certainly, there were some cases where there's something that could be expressed as composition of other algorithms where there's a more efficient way to do it. However, and there's an argument that we don't know what all of those cases are today, and that for common use cases, it makes sense to have a named algorithm so that maybe we come

Starting point is 00:38:21 up with some faster way of implementing it. And there's a bunch of cases in thrust where we know that there's a faster way of implementing the algorithm, but we implement it through composition today just because it's not been an important enough use case to implement the faster path. Even despite all of that, I think that there's an argument to be made that instead of big

Starting point is 00:38:42 algorithm library, which is what we have today, that we should have gone with small, compos big algorithm library, which is what we have today, that we should have gone with small composable algorithm library. And one of the reasons I feel this way is that if you think about the surface area of the algorithm library as being just like a couple of key entry points that are in composition on top of them, one, I think it makes it easier to port to interesting new targets and back-ends,

Starting point is 00:39:06 especially for parallelism. But also, I think it would give people a stronger intuition about how to compose. Like, if instead of having count, if you just had, you And instead of having max element, you just had reduce with max. I think that if people had to write reduce with max, instead of writing max element, people might think more about composition in customizing algorithms with composable iterators and ranges.

Starting point is 00:39:42 And that might mean that they would be more likely to use reduce with some compositional thing instead of writing their own function. And so maybe we made a mistake in going with the big algorithm library and that a smaller set of composable primitives would have taught people a different programming model that would have made them think more about how to compose things and that people would spend less time writing raw for loops and reinventing

Starting point is 00:40:09 the world. I 1000% agree with this and I have thought a lot about to like introducing this kind of library, the pedagogy around it and you know there are these high-level generic, scan, reduce, map. But there are other algorithms in that category. And literally in the docs, I'm not sure if you noticed on the hyperlinks on the side of that screenshot I sent you, there's a star next to, like in each category, one index maps.

Starting point is 00:40:40 There's a star at the top with map. And then everything else is just a specialization of that. And there's certain algorithms that like fall by the wayside that, you know, my who go translate website, map adjacent, which is what it's called in Haskell. It was initially called the JSON difference in C++ 11. Then it's called the JSON transform in C++ 23. It's called prior in queue.

Starting point is 00:41:02 It's called zip with and Haskell. It's this algorithm that falls by the wayside. And so people don't even recognize it by name 23, it's called prior and Q, it's called zip with and Haskell. It's this algorithm that falls by the wayside and so people don't even recognize it by name but the point being is that teach people the generic version, give them specializations but let them know that like if we're not providing you the specialization it's just a custom binary operation that you can pass to one of these things. I see, she must turn and leave. Do not leave.

Starting point is 00:41:24 I'm gonna put this in the episode. Do not leave me. Do not leave me. I thought she was happy, folks, with her head on my chest and getting head pets, but she could not suffer through. One more word uttered about binary operations and algorithms. So we're gonna wind this down in the next... We said two, but I'm gonna extend it to three.

Starting point is 00:41:42 I mean, I think there's maybe a middle ground where we could have just explicitly defined the algorithms that are compositions as saying like these are compositions and in fact in thrust because that's often how we define things like we when we're teaching thrusts or in the documentation we like explicitly say like you know like in fact like like this is equivalent to that and in fact it is implemented that way like I commonly when I'm talking about transform reduce and thrusts I Like, you know, like, in fact, like, like, this is equivalent to that. And in fact, it is implemented that way. Like I commonly, when I'm talking about transform, reduce and thrust, I often say,

Starting point is 00:42:08 yeah, in, in thrust, the speed of implementation of transform reduce is just reduced called with the transform iterator. We do not customize the transform reduce, you know, algorithm for the CUDA backend. We just rely on the generic one that just calls reduce with a transform iterator. And I think telling people that, they can still call the easy transform reduce version, but being able to tell them that, I think it makes people really get the compositional aspect.

Starting point is 00:42:36 It's the most important algorithm, and we rely on the composition for that speed of light code path. Something that's crazy, though, and hopefully this doesn't go to four minutes when I, we initially said two, but then I extended it to three, but it's that, I see, the love of my life is nicking my leg, telling me that she wants to go brush her teeth.

Starting point is 00:42:54 We're gonna leave it all on the pod, folks. Is that tabulate, is the name for counting iterator, which is a poorly named iout, so like, the name in C++. You think, wait, you think the counting iterator is a poorly named iota? I think iota is a poorly named counting iterator, which is a poorly named iota. So like the name in C++. You think the counting iterator is a poorly named iota? I think iota is a poorly named counting iterator. The point is there's disagreement. So here's what's crazy, folks. In C++11, it's called iota.

Starting point is 00:43:16 In thrust, it's called sequence. As an iterator in thrust, it's called counting iterator. So we've got sequence, iota, counting iterator. And then if you add the counting iterator to the transform algorithm you get tabulate. So now we've got iota, we've got sequence, we've got counting, plus transform, equals tabulate, and on top of that in rapids the counting plus transform iterator was such a common pattern we created a new iterator called the counting transform iterator.

Starting point is 00:43:47 And all of this is to say is that like, yes, I agree that hard coding some of this, if you got a happy path, whatever, but the point being is you wanna do all the different variations. So just build the composable version. Chef's kiss. Final words, we're signing off.

Starting point is 00:44:04 We're in New York, we're at the Evelyn Hotel. I'm not sure how Bryce is getting home. What are your final thoughts Bryce? Tell the people what they need to hear. Well, I was just wondering about like, you know, okay, what's the archaeology of why do we have sequence and tabulate and counting iterator? And I realized, and this is a great cliffhanger, I realized we could get answers to all these questions because we, our co-worker Jared I realized we could get answers to all these questions because our coworker Jared creates the library that we talk about all the time. We could talk to Jared.

Starting point is 00:44:33 We need to talk to Jared and we're also going to bring on Georgie who gave, in my opinion, what might have been the best. It's up there with prices from the GTC 2025 conference. We got to get Georgie on because this new hands on start with Thrust, go to Cub, write your own kernel and how do you write your own kernel that matches the speed of life that's offered by Thrust and Cub. We're going to have them on. We're going to talk about it. Links in the show notes. Final thoughts once again, signing off. I was able to hold my sneeze in until you finished. All right, Bryce needs to see

Starting point is 00:45:06 you. We're going to see you live from New York. It's Monday night. And she was asleep now. She has been a super trooper. Bye. Be sure to check these show notes either in your podcast app or at adspthepodcast.com for links to anything we mentioned in today's episode, as well as a link to a GitHub discussion where you can leave thoughts, comments, and questions. podcast.

Your Ad Here

Algorithms + Data Structures = Programs - Episode 232: Algorithms! Live from New York!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.