Algorithms + Data Structures = Programs - Episode 256: 🇩🇰 Algorithms: Replicate, Scatter, Gather & RLD (Part 2)

Starting point is 00:00:00 For anybody who's familiar with numpy style broadcasting, where when you're working with two arrays of different already... It's erity, not arty? Oh, screw. Artie sounds like Ariana Grande's cousin. All right, we're cooking here. We might have explained the scatter completely wrong. We're just two men sitting here talking into chat GPT.

Starting point is 00:00:20 What I should have been saying is gather, but I'm not even confident about that because my confidence is shot because I totally messed up the description of an algorithm. Get how old to yourself, man. we're cooking here. And so once again, I shall tilt it windmills on my quest to figure out whether you can do this on a single-pass algorithm. Is it poorly named? Maybe it's actually, it's actually well-named, and scatter is poorly named. The point being is I don't like the asymmetry between scatter and permutation iterator. When you launch a kuda kernel, you have to specify how many

Starting point is 00:00:49 blocks you're running with. This, this microphone's a little bit waterproof, right? I don't know. Okay. It's starting to rain now, folks. We might need to kill it. Welcome to ADSP, the podcast, Episode 256 recorded on September 20th, 2025. My name is Connor, and today with my co-host, Bryce, we continue to record live from the streets of Copenhagen. We continue our conversation from last week about the replicate algorithm and two other algorithms, scattering. and gather. All right, folks, so right off the bat, I do need to apologize.

Starting point is 00:01:37 In this video, for the majority of it, I refer to the scatter algorithm, when really what I meant was gather, and I got incredibly confused. We rectify this at some point in the conversation. Scatter is an algorithm that exists in thrust, doesn't exist in the C++ standard library, but I'm sure it exists in other libraries and languages. You're given a sequence and, like, your data sequence, which could be, you know, numbers, whatever, any type. And then you're given, I believe it's indices.

Starting point is 00:02:08 And I'm going to feel bad if this is incorrect. But I'm almost positive. It's a sequence of integers that correspond to indices that are between zero and one minus the length of your data sequence. And, or that's what the values of the indices have to be. Obviously, it has to be a valid index. and the length of your I believe it's what they call

Starting point is 00:02:29 an index map the length of that all right perfect let us sit is this going to be wet though although this is a it's a theater which is not open

Starting point is 00:02:40 it's clearly a restaurant but hopefully just no one will come in service although this is wet as well these ones are dry well these can't possibly be wet but these tables are not clean so I think this is totally fine no one's going to come in service here

Starting point is 00:02:54 a little dark but that's okay we got a nice view so the length of your so you've got your data and you've got an index map that can be any length you want so if the length of our data is three

Starting point is 00:03:07 ABC we can have an index map that consists of just a single one and what it does is it scatters the element at index one to your output sequence so if you scatter

Starting point is 00:03:20 if you so scatter you've got your data and you've got an index map, which is a sequence of indices. And scatter basically goes, it iterates through the indices and then copies them to... So it's like a copy if, except that a copy if just takes a predicate, and then it's a stable copy where the relative orders are the same, obviously, because you're just getting rid of some of the values, and then anything that corresponds...

Starting point is 00:03:53 This is not how permutation iterator works. I'm getting a permutation iterator. I'm trying to explain that sequence, for those that aren't familiar with, thrust sequence, thrust sequence is the equivalent of stood iota, the algorithm in C++. It generates incrementing integers starting from a value, typically zero. And in C++20, we added the view iota, which does the exact same thing just lazily. In thrust, the two names for these things are thrust sequence and the thrust counting iterator. The same way that there's a relationship between sequence and counting iterator,

Starting point is 00:04:25 there's a relationship between scatter and the permutation iterator. So scatter takes your data and an index map and it copies from the indices and the values that correspond to those to an output sequence. So if you've got ABC and 1, you just get the character B. If you've got ABC and 321, you end up reversing your sequence. If you have ABC 1-2-3-2-1, your output sequence is ABC CBA. You're basically, you know, it's pretty straightforward. it. A permutation iterator, potentially better named as the scatter iterator, is the same thing, but you're just bundling an iterator that points to your data and an iterator that points to this index map. That's it. And so if you're trying to do any kind of operation lazily, that is a specialization or some form of permutation, you can do it with a permutation iterator. You need to reverse a sequence.

Starting point is 00:05:23 you can do that with a permutation iterator. You need to cycle a sequence. You can do that with a permutation iterator. You need to replicate a sequence. You can do that with a permutation iterator. A lot of the times, it's not the most performant way to do this stuff because of the fact that a permutation iterator is so generic. There's some overhead there.

Starting point is 00:05:39 But the point being is that a permutation iterator, is it poorly named? Maybe it's actually well-named and Scatter is poorly named. The point being is I don't like the asymmetry between scatter and permutation iterator. And anyways, this is all to say that the very first implementation of the replicate integer overload just made a call to the permutation iterator where the index map was just cycling indices, which how do you do cycling indices in thrust? You can just do a counting iterator composed with a transform iterator that does modulus over the length of your sequence.

Starting point is 00:06:19 So you generate the numbers. You know, if you want to replicate it, well, I guess, well, actually, that's a good point. Oh, no, right, because N is the same for everything, right? So if you're trying to replicate everything three times, you do a counting iterator, zero to nine, or zero to eight inclusive, and then you do modulus three, and then you're going to end up with zero one two, zero one two, zero one two as your index map, and that is the equivalent of a replicate when you apply, you know, a copy if using that part. Actually, that's a great way of framing it. Scatter can be implemented and probably is implemented. Actually, I don't know. I don't know that.

Starting point is 00:06:57 As thrust copy if with a permutation iterator. That's just what scatter is. Anyways, I'm going to stop talking. You're doing a good job of explaining it. I would, yeah, for the integer version, I mean, I guess you can just think of it as just being that you're copying from the original. array, but like you've, I have an array of length, you know, in, and then I want to replicate each element m times. Then I've got an array of length n times m. And when I index into that array of length n times m, I'm indexing into the original array of length in, and I'm just moduloing

Starting point is 00:07:48 so that I wrap around and that's that's that's it and that's probably the fastest way to do it except I totally messed up as some of you are screaming in your head right now that doesn't make any sense because it doesn't what I just described is cycling if you're replicating you want to be doing the opposite of modulo which is integer division right so you want to go from zero to eight inclusive not to zero one two zero one two zero one two zero and two because that results in ABC ABC ABC which is cycling you want to have zero zero zero one one two two which is interdivision, and then you end up with AA, B, B, B, CCC.

Starting point is 00:08:21 Yeah, yeah, yeah, sorry, I should have also called that. You only messed up because I've been. Yeah, but, but, um, but, and permutation iterator should boil down to just being the inner drift division. Although you were saying earlier that you, wait, hit some perfor sheets, right? Permutation iterator doesn't do any of that work for you. You have to give it the index map. So it can be, it can be completely a random sequence, right?

Starting point is 00:08:42 If you want to do a shuffle, uh, so, so for the index map, you're passing in some sort of transform iterator that does the division. Exactly. So the transform, it's a counting iterator with a transfer. So it's counting iterator, piped or whatever, to a transform iterator that does the integer division. And that iterator, that is the composition of those two iterators, you pass to a

Starting point is 00:09:03 permutation iterator. And so if you want, you could just do like a random shuffle by generating a random sequence of indices and then that, you know, call a copy if with that permutation iterator, and that would be a shuffle, right? You can do anything almost with a permutation iterator. You can do a copy with a permutation iterator where your index map is just a counting iterator. You wouldn't want to do it that way because that's ridiculous. But the point being is that the permutation iterator is like the most powerful of the iterators.

Starting point is 00:09:29 Like if you think we've got constant iterator, we've got counting iterator, we've got transform iterator. To be an important clarification I just realized is we mentioned a replicate iterator before and we mentioned two overloads of replicate. The replicate iterator only exists for the integer version, right? Wait, wait, wait. There's no replicate iterator. There's two overloads of the replicate function. One, they both take the data sequence. One takes an integer, and the other one takes an array.

Starting point is 00:09:54 But, sorry, to clarify further, one could imagine writing a replicate iterator for the integer version, because you can do that lazily. I don't think you can do it for the array version. Yes, if I understood what you just said, the integer version is lazy. The array version is not. And the reason for that is because for the array version, it's trivial a priori to know ahead of time, like, the length of the sequence. And so then, thus to know, like, the integer version, you just said the array version. The integer version.

Starting point is 00:10:22 It's trivial to know ahead of time the length of the sequence, and it's important to know the length of, or not just about knowing the length of sequence. It's trivial ahead of time to know the placement of, you know, of where every element's going to be in the output, whereas when you've got variable size replicates, you don't necessarily, like, it's no longer an independent problem. problem, you know, of where each element is in the output, and that makes it some form of scan, some form of copy-if. And you mentioned earlier running into perf issues with the permutation iterator.

Starting point is 00:11:00 Was that the case? Or, I think the implementation you just described should be the optimal implementation. Using the permutation iterator, which is a very generic utility, is a, I have found a lot of the times comes with overhead because we're using a permutation iterator to do something more specific. Yeah, yeah, yeah. What we really want is a custom iterator that doesn't incur the overhead of something so generic.

Starting point is 00:11:30 And so a lot of the times, things like cycle iterator and cycle iterator, replicate iterator, a couple others, I would always implement in terms of the permutation iterator because that was the utility there. But one time I was profiling my work versus Jacks, which I consider, or anecdotally in my profiling, the Jacks jitting, typically is the best for this kind of stuff. And there was one example that was, Jacks was beating the work that I was working on. And so I started doing some tuning, and I found like three different times switching the cycle iterator to basically implement a bespoke thing that didn't rely on permutation iterator and doing the same. same thing for the replicate iterator. Each of those led to like a performance improvement.

Starting point is 00:12:16 And I was kind of surprised, but it does, like anytime you're reaching for something that is not in terms of like a generic reduction, because you're just passing a binary operation, but something like a permutation iterator that has to be capable of more, similar to when we talk about using reduced by key if you want to do like a row-wise reduction, reaching for reduced by key, which can handle variable length segments, whereas the row. of a matrix always have the same length. Anytime you're doing that, it shouldn't be surprising that it incurs

Starting point is 00:12:47 some kind of overhead. But it is weird because it's just this iterator abstraction, it should just melt away. But I'm not shocked to hear that you did run into perf issues with it because we're talking about an operation that,

Starting point is 00:13:08 we're talking about two operations. like the cycle and replicate, both of which boil down to doing integer division or integer modulo. And so it's possible that the kernel ends up being like integer throughput constrained or maybe even register constrained and that some of the abstraction cost of permutation iterator, which again should be able to boil away, may still be present. I guess it's, I think it's maybe also possible that the use of permutation iterator throws off like vectorization and maybe that's causing problems.

Starting point is 00:13:52 But anyways, it would be interesting to look more into that, but I think we digress. Yeah, and I think what will happen here, folks, is we recorded the episode in Horhus last And so I think we've figured out that you were going to be listening to that on October. Let's check the calendar, folks. I've got too many apps. The calendar's way down there. You will have already listened to the Orhus episode on October 5th. We are going to...

Starting point is 00:14:22 Actually, so wait, this episode is going to come out on the 12th. So maybe I was going to say, we plan on recording at least once when we're at NDC Techtown. We might do it on the Wednesday night. They've got like a little social afterwards, and our workshops will be done. Your talk will be done. So I think probably... It'll be interesting to see whether either of us still have a voice after two days of eight-hour courses.

Starting point is 00:14:45 That's a good point. Pending that we have voices, we will record probably in a couple days on the Wednesday, the first day of NDC Techtown. And so the seven episodes ago when I mentioned that like, oh, remind me that we got to talk about replicate, is that it is, I think, one of the most interesting algorithms that I've implemented because, Because it has such an interest, like the, going from the integer overload, which is pretty

Starting point is 00:15:11 trivially done and not only done but done lazily, to the array overload, it's almost a completely different algorithm because you can't do it lazily or you definitely can't do all of it lazily. And so think about it. You know, Bryce already gave you a hint. A scan is involved, which it's almost become, oh, I guess it's the integer. I think it is just, I think it is strictly just run-length decode. I mean, but run length decoding, though, is cheating.

Starting point is 00:15:38 How's it cheating? It's the same inputs. It's got to be just run-length decoding. I guess it is just run-length decoding. Well, Bryce is currently looking up. Does Cub have a run-length decode? Yeah, what's the API? It takes run value and...

Starting point is 00:15:55 It takes values as an array of N, and it takes lengths as an array of N. But we're still going to talk about how would you implement this in Thrust if you were limited to using thrust. So run length, decode, and cub, it decodes into like a window of a limited size and then it's repeated until all the runs have been decoded.

Starting point is 00:16:21 I'm almost certain that the way that Cub does it is not going to be the way that you've implemented it because you described implementing in terms of a scan, which sounds right. But I think what Cub does is Cub does is Cub picks a static fixed size window and it starts with the sequence and it decodes

Starting point is 00:16:42 it starts decoding what it's got until the window is filled and then it must write the window to the output and then it starts doing another window why is this optimal that doesn't sound very parallel That does not sound very parallel.

Starting point is 00:17:02 That sounds very sequential. This is for the block length one. I'm going to have to look at what the... I would guess that using a scan, you can figure out where in the data sequence the beginning of each window decoding starts. Because you can back that out with a operation that I will not reveal here.

Starting point is 00:17:27 I think we're essentially... Saying the same thing? No, no, no. I think we're essential. So essentially we've come upon how do you implement efficient GPU decompression. So what I just described was for the block run-length decode. All right, you talk about things while I look at what the device-wide one does.

Starting point is 00:17:46 Bryce's deep-reading GitHub code, and it's good to be in Denmark. Had a lovely run today. We did a long run. 30K. Started off a bit rough because I did not eat anything. but then I got a power raid at the 11K mark. Denmark is a very beautiful country. Copenhagen is a very beautiful city.

Starting point is 00:18:05 We'll definitely be back. Yeah, a beautiful evening. Good to be alive, folks. Good to be live. Bryce wants the mic. So there's a device-wide run-length encode in-cub. There's no device-wide run-length decode. There's just this block run-length decode

Starting point is 00:18:25 that allows you to retrieve a window from the run-length decoded array, which is, I guess makes sense if you want to, like, do block, if you want to just, like, process, like, a chunk. Like, you want to be able to work with, like, a fixed-sized chunk when you're doing, like, Kuda programming. But, okay, now I want to hear,

Starting point is 00:18:48 I want to hear how you implemented it. Well, I need to, I would need to go back and actually look, because, like I said, the first time I implemented it, it was incorrect, but off the top of my head, whether this is the actual implementation, and this discussion may never see the light of day if it's not, you do a plus scan on the integers in your integer array. Bryce doesn't even listen anymore.

Starting point is 00:19:11 I am. I am. What are you doing? He can multitask, he says. You do a plus scan on your integer array, and that gives you the indices where each of your decoded segments start. Yes. That's the first step. And then once you have that, you want to basically create an index map that corresponds to copying the first element at index zero to however many spots the first integer represents.

Starting point is 00:19:44 So if your sequence is ABC and your index map is one, two, three, you want your resulting index map to be zero, one, one, two, two, two. how do you do this exactly you want to and this is the this is the tricky part so you have your plus scan that is going to give you I mean you can do it with a scan right yeah I mean the plus band gives you one

Starting point is 00:20:11 but the index map you can make the index map with a scan I don't you could do a single path hang on I think you just described you need you need to figure out where you're right so you do your plus scan on your integer sequence and that gives you one three six and those are the lengths that's technically the the ending index plus one of each of your segments so if you prefix that plus scan or technically so I mean in thrust this is an exclusive scan but I don't I don't

Starting point is 00:20:58 disambiguate I only have one scan and if you want an inclusive scan exclusive scan you just prefix it with a zero so if you add a zero to your one three six so what one three six to repeat and be clear is the result of a one two three plus scan so the first element is just your one your second element is one plus two plus two plus three so one three six add a pre-panda zero so you end up with zero one, three, six. We're not going to need the six, folks. That's a little secret.

Starting point is 00:21:29 And so now what you want to do is you want to do a... Oh, fuck. This is a very weird scan because... Wait, wait, wait. It's expanding. I think we might have explained...

Starting point is 00:21:51 We might have explained this... We might have explained the... scatter completely wrong because all right i'll just think out loud i'm allowed to make mistakes so right now what i'm thinking is that with the n minus one values of your zero prefixed plus scan so zero one three six if we drop the six you want to scatter the indices basically zero one and two two, those indices. You want to scatter zero to the zero index, one to the one index,

Starting point is 00:22:31 and two to the three index, which is that, those indices, those... You're going to do a pass later that... Then you're going to do a max scan. So if your output sequence is going to be six values in length... So you do two scans. You do two scans. You do a plus scan on the integer sequence. Then you do a scatter.

Starting point is 00:22:53 Yeah. To get the initial value of each segment in your output sequence. And then if you do a max scan, so what your output sequence, it starts is six zeros. Then it starts as zero in the first index, one in the second index, and then two in the third index. So it's zero one, zero, two, zero, I believe, if I got that right. And if you do a max scan on that sequence, basically, and you know, that because the values are indices, so they're always increasing, you're going to end up with 0-1-2-2-2. So the 1 basically gets copied to everything up until the 2, and then the max-scan,

Starting point is 00:23:37 the max kicks in again, and the 1 resets to a 2, and then everything after that is a 2. The thing that I've realized, though, is that I'm pretty sure, well, actually, I'm now I'm confused, a scatter might not scatter indices to a destination. it might scatter the elements to that index. We have to look that up. Actually, by look up, I'm just going to ask Gemini.

Starting point is 00:24:00 Actually, we'll get Gemini on the pod right now. Write me a kernel that does device-wide, run-length decoding, using cup. What does the algorithm scatter in the Thrust Parallel Algorithms Library do? we're just we're just two men sitting here talking into a chat GPT

Starting point is 00:24:26 I'm a little bit surprised because in the past it usually reads it to me and I guess I click some button that turned off the live reading but it says the thrust scatter algorithm in the thrust parallel algorithm library is a powerful and efficient parallel primitive that reorders a sequence of elements

Starting point is 00:24:41 it copies elements from a source sequence to a destination sequence but instead of a one-to-one copy it uses a map to determine the destination index for each element. So, yeah, so I... It's doing the same thing with you. I did...

Starting point is 00:24:55 No, no, no. I messed it up. So it says the essence is that thrust scatter performs the following. The destination of the map of the Ith Index

Starting point is 00:25:06 is equal to the source at the Ith Index. Which means... Write it again, but no comment and make it as concise as possible.

Starting point is 00:25:18 remove all of the error handling. So what does this mean? Do we have to cancel the whole scatter conversation? I don't know. So I think in most cases, you will do a first pass anyways to determine the size of the decompressed sequence because you need to allocate memory for it.

Starting point is 00:25:44 Like I struggle to imagine a use case I struggle to imagine a use case where you want to do the decoding, but where you don't need to know ahead of time how large the thing is going to be when decoded. I mean, of course, but that's just a reduction. Right, right, but I'm saying if you have that in for, or if you're going to do that reduction...

Starting point is 00:26:11 And also, it's not even a reduction. It's just the last element of the scam that you have to do on the... Right, that's what I'm getting. That's the point. getting to is that is that if you if you're going to do the scan anyways like the way the way that when I asked chat GPT it did like a reduce first so that it could allocate the temporary storage and then it did a scan to figure out you know the the the where shit's going to go and then it did a third pass to actually move stuff and three passes seems like overkill you don't need that

Starting point is 00:26:43 first reduce you can just do the scan um But I think that this may be a case where you can't get it down to one pass. And the reason you can't get it down to one pass is because I can't imagine it being useful to be like, oh, I want to decode this thing. And here's like, you know, a gigabyte of storage decode it into. But if it's bigger than that, I don't care. Although now that I say it, now that I say it, if I'm, maybe actually I'm wrong about that. maybe like if I've got like some things, some image I'm decoding or something where like I want to decode it into like a fixed size buffer and then I'm going to write it to disk, right? Because like I don't know how large this thing is going to be when I decompress it. But I want to like decompress it in stages and then write it out to disk. And so once again, I shall tilt it windmills on my quest to figure out whether you can do this on a single pass algorithm. I'm almost certain that there's a way to do this with a single pass algorithm. However, it might require the use of some things that, um, that we've never talked about before,

Starting point is 00:27:49 in particular a completely different model of GPU computing, which we don't typically think about, which I think may be required in this case, and that would be a cooperative launch, because if we have a bunch of groups of threads that are working on this decoding problem, like at the same time, we either need to be able to have an unknown,

Starting point is 00:28:23 like an arbitrarily growing number of blocks, like of threads, right? Like each block of thread decodes a known amount. Okay, maybe, okay, all right, all right, all right. Let me, let me, I'm cooking here. So when you launch a kuda kernel, you have to specify how many blocks you're running with. This microphone's a little bit waterproof, right?

Starting point is 00:28:43 I don't know. Okay. I'm going to rain now, folks. We might need to kill this stuff. I just have to say, I don't know how I'm going to edit this because I'm pretty sure every time I said scatter earlier, what I should have been saying is gather, but I'm not even confident about that because my confidence is shot because I totally messed up the description of an algorithm.

Starting point is 00:29:02 Get how old to yourself, man, we're cooking here. I'm not so I can imagine two different types of implementations. So when you launch a Kuta kernel, you need to, like you need to specify statically how many blocks of threads are going to execute. And the best, there's two different ways that we could approach this problem. Way number one is the way that we typically think about CUDA computing where each block of threads...

Starting point is 00:29:29 Look at this ice cream. This is... Yes, we're going to go in there. Let me finish my this feel. Each block of threads decodes a fixed chunk of work. Typically when we're talking about these problems, we imagine a world where there's a fixed number of blocks of threads, and each block

Starting point is 00:29:47 does a fixed amount of work and to do that you need to be able to know how much work there is in total in this case we don't know how much work there is in total however we could imagine that we could launch a kernel that where we assume

Starting point is 00:30:05 we're trying to get undercover or we're going to the ice cream plate all right here we're doing both but it's too loud now it's all right So we could imagine, one approach I could imagine would be to say we're going to do a fixed amount of work per block, we're going to launch a fixed number of thread blocks, and then if that is not sufficient, we launch a second kernel that continues the decoding. now that might mean that we end up like wildly we might end up using a lot more compute than we planned because maybe the maybe the decoding takes a lot less work than we planned but the second approach

Starting point is 00:30:50 the second approach is this cooperative approach in this second approach we launch a much smaller number of blocks of threads but the amount of work that each block of thread does is variable. That is, we don't know ahead of time or how much work they're going to do. They're each going to do roughly the same amount of work, but it's going to be what we sometimes call like a persistent kernel or a cooperative kernel, where it's going

Starting point is 00:31:15 to continue, it's going to work on a chunk of input, it's going to do that chunk of input, and then it's going to load the next chunk of input and start doing that next chunk of input. However, to write that type of kernel, you need to

Starting point is 00:31:30 be able to have all of the blocks communicate with each other and synchronize with each other because in this case once you're done with your chunk of work to be able to start in the next chunk of work you need to know where everybody else wrote their chunk of work which would require you to do a scan but in this case we would maybe want to do a a cooperative scan and the advantage of doing that is that we can do the cooperative scan at the same time that we're loading the next chunk of input

Starting point is 00:32:04 because while we don't know where the next chunk of output will go, we know we can statically assign where the next chunk of input will come from, which means that we can be loading it concurrently with figuring out that scan. Okay, I think I may have lost some people there,

Starting point is 00:32:23 but we're going to go get ice cream. Yeah, he lost me, but it's hard to focus after the devastating mess up. I'm just really hoping that gather is, it's possible that everything I said is true for gather and then I can go in and every time I say scatter, I'm just going to cut in, I'm going to cut out scatter and cut and gather. I believe that the thing that you described

Starting point is 00:32:45 is an algorithm within thrusting cup. Also, you got to, wait, don't move, don't move. You're going to get rid of this thing off my head? No, I'm just going to take pictures. What is it? It was a little flaw, I think. Don't worry. It's okay.

Starting point is 00:33:04 It wasn't anything scary. All right, folks. Hopefully that was audible because we're standing next to a bar with like a dance party. And we're ending this. Goodbye from Copenhagen. We're not done yet. We're not done. Yeah, yeah.

Starting point is 00:33:15 We're turning it. No, we're getting ice cream. Then we're going to continue. We're getting ice cream. We're not continuing. And we will pick this up again. That's actually a good question. If we're recording and we might have to record again before Wednesday.

Starting point is 00:33:27 We're going to pick this up again in five minutes. We will not. We will not. This is going on my backpack. It's raining. And a good night from Copenhagen. Maybe we'll record next time. We're not going to have the energy after Monday and Tuesday. This has become very ADSB. No, I'll have energy for this. For this? All right. You'll hear from us. Today is Saturday. Just to be clear, this entire problem we've been talking about is essentially GPU decompression. Yes. Yes. Yes. And it is, am I not correct? It is a very interesting algorithm.

Starting point is 00:33:59 It is a very interesting algorithm. Oh, we didn't, okay, we need to mention the final thing, which is why we were talking about replicate in the first place, which is because of array broadcasting. So imagine that you've got like a vector and a matrix, and you want to, like, apply some operation in, like you want to apply some element-wise operation, like adding or, you know...

Starting point is 00:34:25 You're explaining this incorrectly. It's not when you want to do an element-wise operation, It's when you want to do some kind of binary operation on a matrix and a vector. So the example that we were talking about earlier, Ashwin, who were going to be meeting up with tomorrow in Norway, posted internally something about a row-wise soft max, which is a computation and, I don't know, machine learning stuff. But essentially, you're performing a bunch of binary... The fundamental operation to flash attention, which is fundamental to GPU inference,

Starting point is 00:34:59 which pays all of our bills. It's just so sad because they give it a name like softmax and you're just like, like all these things, soft max, Relu. It's just a combination of a couple of binary operations. Like it's nothing fancy. We will do a whole episode about softmax and why it's magical and we'll talk about Argmax too. We'll do that. The three binary, it's three, two binary operations and a unary operation.

Starting point is 00:35:17 There's max. There's summation. I guess there's division. And then there's also exponentiation. But it's like you do those things in order. You're done. It's like, it's so boring. And like they're like, soft max.

Starting point is 00:35:28 You hear soft max. and you're like, ooh, like, how complicated that out? Do you know that there are entire teams that exist in Vida, whose job it is to make softmax run faster? I'm just, I'm just, it's, anyway, I just, like, these names, like, and also RELU, like, I can't remember what RELU is, and there's a couple different versions of it, but, like, I'm pretty sure it's just, like, one over a number, or something like that, or it's, like, capped a...

Starting point is 00:35:47 Anyways, I just, like, as someone who studied actual mathematics, and I suffered through actually, like, difficult stuff, they give all these fancy names, and it's just, like, it's just basic math. Anyways, row-wise softmax involves basically needing to do an operation where you calculate the maximum of each row, and then you divide each of the values in the matrix by the maximum of the corresponding row. Bryce is dancing now. And so that means is that if I've got a matrix, I compute however many rows, maximums, and then I want to divide that maximum into each.

Starting point is 00:36:27 of the values in each of those rows. So it's not an element-wise operation. It's an element-wise operation on a row-wise basis, which means, like, if I've got a three-by-three matrix and three values for the rows, I want to divide the first row by the first max, the second row. Anyways, I think people get it.

Starting point is 00:36:41 The point being, though, is you can use replicate to turn that vector into... Not a matrix, but, like, or I guess basically a matrix. For anybody who's familiar with, like, numpy-style broadcasting, Hopefully we've not lost the... We might have blown out the gain a little.

Starting point is 00:36:59 That might have been my bad. But for anybody who's familiar with NumPai-style broadcasting, where when you're working with two arrays of different already, the smaller array can be stretched. It's arity, not arty? Already sounds like Ariana Grande's cousin. All right. We're cooking here.

Starting point is 00:37:16 Where it can be stretched to the rank of the larger array. And you can use replicate to do that stretch. Catching. Anyways, we're going to go good ice cream now, but that's why we were interested in talking about replicas. It's actually not why we were interested in. I only realized when I was trying to implement the row-wise softmax that, like, oh, replicate's a very convenient thing if you don't have broadcasting in your library language. I had implemented replicate a month before because I was trying to solve some problem that clearly just needed to replicate. And then when it came to implementing, it was like literally some Pearl Weekly Challenge or leak code problem that it was just like, oh, like, clearly what I'm missing. missing here is replicate, which is, it's just an APL primitive, and I hadn't added it. And this library that I've been working on, it's heavily inspired by all the array operations.

Starting point is 00:38:05 We may need to talk to the, like, the, we may need to talk to some of, like, the, like, NVJPEG and, like, the NV comp people, the compression library people. Because they know about compression? Well, yeah, I mean, I think the people make the compression library know about compression, yeah. Do we need to talk to them? No. Where we talk to them? Maybe. I do believe that a single-pass algorithm exists here. I mean, I always believe a single-pass algorithm exists here, but, all right, Connor's fixing his sweater. His tan invidia sweater.

Starting point is 00:38:37 It's gold. There is no world in which this is gold. It's gold. There is no world in which this is gold. All right, we're going to go to ice cream. And I am dancing now. Very poor. But I did learn some dance moves

Starting point is 00:38:58 When I was in Ethiopia They taught me how to do the shoulder thing Which I kind of remember how to do Yeah There we go folks She fixed the sweater My only sweater and my only shorts That I brought to Norway at the middle

Starting point is 00:39:16 That's true I do have a He brought shorts to Norway He's going to be cold Have I been cold this whole time But right now we're in Denmark not Norway. It's going to be colder in Norway. I have been in Norway in September. Trust me. It's going to be cold.

Starting point is 00:39:30 I think we're going to be just fine. Was I slightly underdress for a couple of the restaurants we went to? It's not a maybe. It's a most definitely yes. I took this man to the nicest restaurant in Orhus, or however you pronounce it, wearing his Lulu Lemon shorts and a running t-shirt. Yeah. Was it a running t-shirt?

Starting point is 00:39:50 I know. I kept the hoodie on. I kept the hoodie. Yeah, yeah. Listen, you know what? I would have brought nicer clothes if I had thought about the fact that Bryce wanted to eat at fancy places. I need you to say that again with a straight face. Is this what, so that we can put this on the podcast? I don't know what I'm doing with this, but...

Starting point is 00:40:10 So, Bryce asked, was it Chatsy B.T or Gemini? Chachy B.T. For some humorous names for Scattering Gather because of my misunderstanding earlier. one of the initial unhumorous was scatter to and gather from which i really like i was already thinking about scatter two but anyways the top humorous one was yeet and yoinck and then bryce thought that was hilarious and that's what we should call it but my response was well that doesn't make any sense because yeat means to like delete to get rid of to toss it you know when they say yeat it means get rid of it and that's not what scatter's doing so the semantically it has nothing

Starting point is 00:40:47 to do with the actual semantics so i believe you said specifically heat so romantically has nothing to do with the actual operation. Yeah, yeah, something like that. Oh, God, I'm dying. I'm going to read some of the other, some of the other ones. Be sure to check these show notes, either in your podcast app or at ADSP thepodcast.com for links to anything we mentioned in today's episode, as well as a link to a get-up discussion where you can leave thoughts, comments, and questions.

Starting point is 00:41:12 Thanks for listening. We hope you enjoyed and have a great day. Low quality, high quantity. That is the tagline of our podcast. It's not the tagline. Our tagline is chaos with sprinkles of information.

Algorithms + Data Structures = Programs - Episode 256: 🇩🇰 Algorithms: Replicate, Scatter, Gather & RLD (Part 2)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.