Algorithms + Data Structures = Programs - Episode 256: 🇩🇰 Algorithms: Replicate, Scatter, Gather & RLD (Part 2)
Episode Date: October 17, 2025In this episode, Conor and Bryce record live from Denmark! They talk about the replicate, scatter, gather and run length decode algorithms!Link to Episode 256 on WebsiteDiscuss this episode, leave a c...omment, or ask a question (on GitHub)SocialsADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterDate Recorded: 2025-09-20Date Released: 2025-10-17thrust::gatherthrust::scatterthrust::permutation_iteratorthrust::counting_iteratorthrust::sequencethrust::transform_iteratorthrust::copy_if (stencil overload)parrot::replicate ImplementationJAXthrust::reduce_by_keycub::RunLengthDecodeAPL Wiki ReplicateArrayCast Episode 110: Implementing ReplicateRow-wise Softmax in TritonRow-wise Softmax in ParrotIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
For anybody who's familiar with numpy style broadcasting,
where when you're working with two arrays of different already...
It's erity, not arty?
Oh, screw.
Artie sounds like Ariana Grande's cousin.
All right, we're cooking here.
We might have explained the scatter completely wrong.
We're just two men sitting here talking into chat GPT.
What I should have been saying is gather,
but I'm not even confident about that
because my confidence is shot because I totally messed up the description of an algorithm.
Get how old to yourself, man.
we're cooking here. And so once again, I shall tilt it windmills on my quest to figure out
whether you can do this on a single-pass algorithm. Is it poorly named? Maybe it's actually,
it's actually well-named, and scatter is poorly named. The point being is I don't like the asymmetry
between scatter and permutation iterator. When you launch a kuda kernel, you have to specify how many
blocks you're running with. This, this microphone's a little bit waterproof, right? I don't know.
Okay.
It's starting to rain now, folks. We might need to kill it.
Welcome to ADSP, the podcast, Episode 256 recorded on September 20th, 2025.
My name is Connor, and today with my co-host, Bryce, we continue to record live from the streets of Copenhagen.
We continue our conversation from last week about the replicate algorithm and two other algorithms, scattering.
and gather.
All right, folks, so right off the bat, I do need to apologize.
In this video, for the majority of it, I refer to the scatter algorithm, when really what I meant
was gather, and I got incredibly confused.
We rectify this at some point in the conversation.
Scatter is an algorithm that exists in thrust, doesn't exist in the C++ standard library,
but I'm sure it exists in other libraries and languages.
You're given a sequence and, like, your data sequence,
which could be, you know, numbers, whatever, any type.
And then you're given, I believe it's indices.
And I'm going to feel bad if this is incorrect.
But I'm almost positive.
It's a sequence of integers that correspond to indices that are between zero
and one minus the length of your data sequence.
And, or that's what the values of the indices have to be.
Obviously, it has to be a valid index.
and the length of your
I believe it's what they call
an index map
the length of that
all right perfect
let us sit
is this going to be wet though
although this is a
it's a theater
which is not open
it's clearly a restaurant
but hopefully just no one will come in service
although this is wet as well
these ones are dry
well these can't possibly be wet
but these tables are not clean
so I think this is totally fine
no one's going to come in service here
a little dark
but that's okay
we got a nice view
so the length of your
so you've got your data
and you've got an index map
that can be any length you want
so if the length of our data is three
ABC
we can have
an index map that consists of just a single
one and what it does
is it scatters
the element at index one
to your output sequence
so if you scatter
if you
so scatter
you've got your data
and you've got an index map, which is a sequence of indices.
And scatter basically goes, it iterates through the indices and then copies them to...
So it's like a copy if, except that a copy if just takes a predicate,
and then it's a stable copy where the relative orders are the same, obviously,
because you're just getting rid of some of the values, and then anything that corresponds...
This is not how permutation iterator works.
I'm getting a permutation iterator.
I'm trying to explain that sequence, for those that aren't familiar with, thrust sequence,
thrust sequence is the equivalent of stood iota, the algorithm in C++.
It generates incrementing integers starting from a value, typically zero.
And in C++20, we added the view iota, which does the exact same thing just lazily.
In thrust, the two names for these things are thrust sequence and the thrust counting iterator.
The same way that there's a relationship between sequence and counting iterator,
there's a relationship between scatter and the permutation iterator.
So scatter takes your data and an index map and it copies from the indices and the values
that correspond to those to an output sequence.
So if you've got ABC and 1, you just get the character B.
If you've got ABC and 321, you end up reversing your sequence.
If you have ABC 1-2-3-2-1, your output sequence is ABC CBA.
You're basically, you know, it's pretty straightforward.
it. A permutation iterator, potentially better named as the scatter iterator, is the same thing, but you're just bundling an iterator that points to your data and an iterator that points to this index map. That's it. And so if you're trying to do any kind of operation lazily, that is a specialization or some form of permutation, you can do it with a permutation iterator. You need to reverse a sequence.
you can do that with a permutation iterator.
You need to cycle a sequence.
You can do that with a permutation iterator.
You need to replicate a sequence.
You can do that with a permutation iterator.
A lot of the times, it's not the most performant way to do this stuff
because of the fact that a permutation iterator is so generic.
There's some overhead there.
But the point being is that a permutation iterator, is it poorly named?
Maybe it's actually well-named and Scatter is poorly named.
The point being is I don't like the asymmetry between scatter and permutation iterator.
And anyways, this is all to say that the very first implementation of the replicate integer overload
just made a call to the permutation iterator where the index map was just cycling indices,
which how do you do cycling indices in thrust?
You can just do a counting iterator composed with a transform iterator that does modulus
over the length of your sequence.
So you generate the numbers.
You know, if you want to replicate it, well, I guess, well, actually, that's a good point.
Oh, no, right, because N is the same for everything, right?
So if you're trying to replicate everything three times, you do a counting iterator, zero to nine, or zero to eight inclusive, and then you do modulus three, and then you're going to end up with zero one two, zero one two, zero one two as your index map, and that is the equivalent of a replicate when you apply, you know, a copy if using that part.
Actually, that's a great way of framing it.
Scatter can be implemented and probably is implemented.
Actually, I don't know.
I don't know that.
As thrust copy if with a permutation iterator.
That's just what scatter is.
Anyways, I'm going to stop talking.
You're doing a good job of explaining it.
I would, yeah, for the integer version, I mean, I guess you can just think of it as just being that you're copying from the original.
array, but like you've, I have an array of length, you know, in, and then I want to replicate
each element m times. Then I've got an array of length n times m. And when I index into that
array of length n times m, I'm indexing into the original array of length in, and I'm just moduloing
so that I wrap around and that's that's that's it and that's probably the fastest way to do it
except I totally messed up as some of you are screaming in your head right now that doesn't make
any sense because it doesn't what I just described is cycling if you're replicating
you want to be doing the opposite of modulo which is integer division right so you want to go
from zero to eight inclusive not to zero one two zero one two zero one two zero and two because that
results in ABC ABC ABC which is cycling you want to have zero zero zero one one two two which is
interdivision, and then you end up with
AA, B, B, B, CCC.
Yeah, yeah, yeah, sorry, I should have also called that.
You only messed up because I've been.
Yeah, but, but, um, but, and permutation iterator
should boil down to just being the inner drift division.
Although you were saying earlier that you, wait, hit some perfor sheets, right?
Permutation iterator doesn't do any of that work for you.
You have to give it the index map.
So it can be, it can be completely a random sequence, right?
If you want to do a shuffle, uh, so, so for the index map,
you're passing in some sort of transform iterator that
does the division.
Exactly.
So the transform, it's a counting iterator with a transfer.
So it's counting iterator, piped or whatever, to a transform iterator that does
the integer division.
And that iterator, that is the composition of those two iterators, you pass to a
permutation iterator.
And so if you want, you could just do like a random shuffle by generating a random sequence
of indices and then that, you know, call a copy if with that permutation iterator,
and that would be a shuffle, right?
You can do anything almost with a permutation iterator.
You can do a copy with a permutation iterator where your index map is just a counting iterator.
You wouldn't want to do it that way because that's ridiculous.
But the point being is that the permutation iterator is like the most powerful of the iterators.
Like if you think we've got constant iterator, we've got counting iterator, we've got transform iterator.
To be an important clarification I just realized is we mentioned a replicate iterator before and we mentioned two overloads of replicate.
The replicate iterator only exists for the integer version, right?
Wait, wait, wait.
There's no replicate iterator.
There's two overloads of the replicate function.
One, they both take the data sequence.
One takes an integer, and the other one takes an array.
But, sorry, to clarify further, one could imagine writing a replicate iterator for the integer version, because you can do that lazily.
I don't think you can do it for the array version.
Yes, if I understood what you just said, the integer version is lazy.
The array version is not.
And the reason for that is because for the array version,
it's trivial a priori to know ahead of time, like, the length of the sequence.
And so then, thus to know, like, the integer version, you just said the array version.
The integer version.
It's trivial to know ahead of time the length of the sequence, and it's important to know
the length of, or not just about knowing the length of sequence.
It's trivial ahead of time to know the placement of, you know, of where every element's
going to be in the output, whereas when you've got variable size replicates, you don't necessarily,
like, it's no longer an independent problem.
problem, you know, of where each element is in the output, and that makes it some form of scan,
some form of copy-if.
And you mentioned earlier running into perf issues with the permutation iterator.
Was that the case?
Or, I think the implementation you just described should be the optimal implementation.
Using the permutation iterator, which is a very generic utility, is a,
I have found a lot of the times comes with overhead because we're using a permutation iterator
to do something more specific.
Yeah, yeah, yeah.
What we really want is a custom iterator that doesn't incur the overhead of something
so generic.
And so a lot of the times, things like cycle iterator and cycle iterator, replicate iterator,
a couple others, I would always implement in terms of the permutation iterator because
that was the utility there.
But one time I was profiling my work versus Jacks, which I consider, or anecdotally in my profiling, the Jacks jitting, typically is the best for this kind of stuff.
And there was one example that was, Jacks was beating the work that I was working on.
And so I started doing some tuning, and I found like three different times switching the cycle iterator to basically implement a bespoke thing that didn't rely on permutation iterator and doing the same.
same thing for the replicate iterator.
Each of those led to like a performance improvement.
And I was kind of surprised, but it does, like anytime you're reaching for something that
is not in terms of like a generic reduction, because you're just passing a binary operation,
but something like a permutation iterator that has to be capable of more, similar
to when we talk about using reduced by key if you want to do like a row-wise reduction,
reaching for reduced by key, which can handle variable length segments, whereas the row.
of a matrix always have the same length.
Anytime you're doing that,
it shouldn't be surprising that it incurs
some kind of overhead.
But it is weird because
it's just this iterator abstraction,
it should just melt away.
But I'm not shocked
to hear that you did run into
perf issues with it because
we're talking about an operation that,
we're talking about two operations.
like the cycle and replicate, both of which boil down to doing integer division
or integer modulo.
And so it's possible that the kernel ends up being like integer throughput constrained
or maybe even register constrained and that some of the abstraction cost of permutation
iterator, which again should be able to boil away, may still be present.
I guess it's, I think it's maybe also possible that the use of permutation iterator
throws off like vectorization and maybe that's causing problems.
But anyways, it would be interesting to look more into that, but I think we digress.
Yeah, and I think what will happen here, folks, is we recorded the episode in Horhus last
And so I think we've figured out that you were going to be listening to that on October.
Let's check the calendar, folks.
I've got too many apps.
The calendar's way down there.
You will have already listened to the Orhus episode on October 5th.
We are going to...
Actually, so wait, this episode is going to come out on the 12th.
So maybe I was going to say, we plan on recording at least once when we're at NDC Techtown.
We might do it on the Wednesday night.
They've got like a little social afterwards, and our workshops will be done.
Your talk will be done.
So I think probably...
It'll be interesting to see whether either of us still have a voice
after two days of eight-hour courses.
That's a good point.
Pending that we have voices, we will record probably in a couple days
on the Wednesday, the first day of NDC Techtown.
And so the seven episodes ago when I mentioned that like,
oh, remind me that we got to talk about replicate,
is that it is, I think, one of the most interesting algorithms
that I've implemented because,
Because it has such an interest, like the, going from the integer overload, which is pretty
trivially done and not only done but done lazily, to the array overload, it's almost a
completely different algorithm because you can't do it lazily or you definitely can't do all
of it lazily.
And so think about it.
You know, Bryce already gave you a hint.
A scan is involved, which it's almost become, oh, I guess it's the integer.
I think it is just, I think it is strictly just run-length decode.
I mean, but run length decoding, though, is cheating.
How's it cheating?
It's the same inputs.
It's got to be just run-length decoding.
I guess it is just run-length decoding.
Well, Bryce is currently looking up.
Does Cub have a run-length decode?
Yeah, what's the API?
It takes run value and...
It takes values as an array of N,
and it takes lengths as an array of N.
But we're still going to talk about
how would you implement this in Thrust
if you were limited to using thrust.
So run length, decode, and cub,
it decodes into like a window of a limited size
and then it's repeated until all the runs have been decoded.
I'm almost certain that the way that Cub does it
is not going to be the way that you've implemented it
because you described implementing in terms of a scan,
which sounds right.
But I think what Cub does is Cub does
is Cub picks a static fixed size window
and it starts with the sequence
and it decodes
it starts decoding what it's got
until the window is filled
and then
it must write the window to the output
and then it starts doing another window
why is this optimal
that doesn't sound very parallel
That does not sound very parallel.
That sounds very sequential.
This is for the block length one.
I'm going to have to look at what the...
I would guess that using a scan,
you can figure out where in the data sequence
the beginning of each window decoding starts.
Because you can back that out with a operation
that I will not reveal here.
I think we're essentially...
Saying the same thing?
No, no, no.
I think we're essential.
So essentially we've come upon
how do you implement efficient GPU decompression.
So what I just described was for the block run-length decode.
All right, you talk about things while I look at what the device-wide one does.
Bryce's deep-reading GitHub code, and it's good to be in Denmark.
Had a lovely run today.
We did a long run.
30K.
Started off a bit rough because I did not eat anything.
but then I got a power raid at the 11K mark.
Denmark is a very beautiful country.
Copenhagen is a very beautiful city.
We'll definitely be back.
Yeah, a beautiful evening.
Good to be alive, folks.
Good to be live.
Bryce wants the mic.
So there's a device-wide run-length encode in-cub.
There's no device-wide run-length decode.
There's just this block run-length decode
that allows you
to retrieve a window from the run-length
decoded array, which is, I guess
makes sense if you want to, like, do block,
if you want to just, like, process, like, a chunk.
Like, you want to be able to work with, like, a fixed-sized chunk
when you're doing, like, Kuda programming.
But, okay, now I want to hear,
I want to hear how you implemented it.
Well, I need to, I would need to go back
and actually look, because, like I said,
the first time I implemented it,
it was incorrect, but off the top of my head, whether this is the actual implementation,
and this discussion may never see the light of day if it's not, you do a plus scan
on the integers in your integer array.
Bryce doesn't even listen anymore.
I am.
I am.
What are you doing?
He can multitask, he says.
You do a plus scan on your integer array, and that gives you the indices where each of your decoded segments start.
Yes.
That's the first step.
And then once you have that, you want to basically create an index map that corresponds to copying the first element at index zero to however many spots the first integer represents.
So if your sequence is ABC and your index map is one, two, three, you want your resulting index map to be zero, one, one, two, two, two.
how do you do this exactly
you want to
and this is the this is the tricky part
so you have your plus scan
that is going to give you
I mean you can do it with a scan right
yeah I mean the plus band gives you one
but the index map
you can make the index map with a scan
I don't
you could do a single path
hang on
I think you just described you need you need to figure out where you're right so you do your plus scan on your integer sequence
and that gives you one three six and those are the lengths that's technically the the ending index plus one of each of your segments so if you prefix that plus scan
or technically so I mean in thrust this is an exclusive scan but I don't I don't
disambiguate I only have one scan and if you want an inclusive scan exclusive scan you
just prefix it with a zero so if you add a zero to your one three six so what
one three six to repeat and be clear is the result of a one two three plus scan
so the first element is just your one your second element is one plus two plus two plus
three so one three six add a pre-panda zero so you end up with zero
one, three, six.
We're not going to need the six, folks.
That's a little secret.
And so now what you want to do
is you want to do a...
Oh, fuck.
This is a very weird scan
because...
Wait, wait, wait.
It's expanding.
I think we might have explained...
We might have explained this...
We might have explained the...
scatter completely wrong because all right i'll just think out loud i'm allowed to make mistakes
so right now what i'm thinking is that with the n minus one values of your zero prefixed plus scan so
zero one three six if we drop the six you want to scatter the indices basically zero one and two
two, those indices.
You want to scatter zero to the zero index,
one to the one index,
and two to the three index,
which is that, those indices, those...
You're going to do a pass later that...
Then you're going to do a max scan.
So if your output sequence is going to be six values in length...
So you do two scans.
You do two scans. You do a plus scan on the integer sequence.
Then you do a scatter.
Yeah.
To get the initial value of each segment in your output sequence.
And then if you do a max scan, so what your output sequence, it starts is six zeros.
Then it starts as zero in the first index, one in the second index, and then two in the third index.
So it's zero one, zero, two, zero, I believe, if I got that right.
And if you do a max scan on that sequence, basically, and you know,
that because the values are indices, so they're always increasing, you're going to end up with
0-1-2-2-2. So the 1 basically gets copied to everything up until the 2, and then the max-scan,
the max kicks in again, and the 1 resets to a 2, and then everything after that is a 2.
The thing that I've realized, though, is that I'm pretty sure, well, actually, I'm now
I'm confused, a scatter might not scatter indices to a destination.
it might
scatter the elements
to that index.
We have to look that up.
Actually, by look up, I'm just going to ask Gemini.
Actually, we'll get Gemini on the pod right now.
Write me a kernel
that does device-wide, run-length decoding, using cup.
What does the algorithm scatter
in the Thrust Parallel Algorithms Library do?
we're just
we're just two men sitting here
talking into a chat GPT
I'm a little bit surprised
because in the past it usually reads it to me
and I guess I click some button that
turned off the live reading
but it says the thrust scatter algorithm
in the thrust parallel algorithm library
is a powerful and efficient parallel primitive
that reorders a sequence of elements
it copies elements from a source sequence
to a destination sequence
but instead of a one-to-one copy
it uses a map to determine the destination
index for each element.
So, yeah, so I...
It's doing the same thing with you.
I did...
No, no, no.
I messed it up.
So it says the essence
is that thrust scatter
performs the following.
The destination
of the map
of the Ith Index
is equal to
the source
at the Ith Index.
Which means...
Write it again,
but no comment
and make it as concise
as possible.
remove all of the error handling.
So what does this mean?
Do we have to cancel the whole scatter conversation?
I don't know.
So I think in most cases,
you will do a first pass anyways
to determine the size of the decompressed sequence
because you need to allocate memory for it.
Like I struggle to imagine a use case
I struggle to imagine a use case where you want to do
the decoding, but where you don't need to know
ahead of time how large the thing is going to be when
decoded.
I mean, of course, but that's just a reduction.
Right, right, but I'm saying if you have
that in for, or if you're going to do that reduction...
And also, it's not even a reduction.
It's just the last element of the scam that you have to do on the...
Right, that's what I'm getting.
That's the point.
getting to is that is that if you if you're going to do the scan anyways like the way the way that
when I asked chat GPT it did like a reduce first so that it could allocate the temporary storage
and then it did a scan to figure out you know the the the where shit's going to go and then it did
a third pass to actually move stuff and three passes seems like overkill you don't need that
first reduce you can just do the scan um
But I think that this may be a case where you can't get it down to one pass.
And the reason you can't get it down to one pass is because I can't imagine it being useful to be like, oh, I want to decode this thing.
And here's like, you know, a gigabyte of storage decode it into.
But if it's bigger than that, I don't care.
Although now that I say it, now that I say it, if I'm, maybe actually I'm wrong about that.
maybe like if I've got like some things, some image I'm decoding or something where like I want to decode it into like a fixed size buffer and then I'm going to write it to disk, right? Because like I don't know how large this thing is going to be when I decompress it. But I want to like decompress it in stages and then write it out to disk. And so once again, I shall tilt it windmills on my quest to figure out whether you can do this on a single pass algorithm. I'm almost certain that there's a way to do this with a single pass algorithm. However, it might require the use of some things that, um,
that we've never talked about before,
in particular a completely different model of GPU computing,
which we don't typically think about,
which I think may be required in this case,
and that would be a cooperative launch,
because if we have a bunch of groups of threads
that are working on this decoding problem,
like at the same time,
we either need to be able to have an unknown,
like an arbitrarily growing number of blocks,
like of threads, right?
Like each block of thread decodes a known amount.
Okay, maybe, okay, all right, all right, all right.
Let me, let me, I'm cooking here.
So when you launch a kuda kernel,
you have to specify how many blocks you're running with.
This microphone's a little bit waterproof, right?
I don't know.
Okay.
I'm going to rain now, folks.
We might need to kill this stuff.
I just have to say, I don't know how I'm going to edit this because I'm pretty sure
every time I said scatter earlier, what I should have been saying is gather,
but I'm not even confident about that because my confidence is shot because I totally
messed up the description of an algorithm.
Get how old to yourself, man, we're cooking here.
I'm not so I can imagine two different types of implementations.
So when you launch a Kuta kernel, you need to, like you need to specify statically
how many blocks of threads are going to execute.
And the best, there's two different ways
that we could approach this problem.
Way number one is the way that we typically think about
CUDA computing where each block of threads...
Look at this ice cream.
This is...
Yes, we're going to go in there.
Let me finish my this feel.
Each block of threads decodes a fixed chunk of work.
Typically when we're talking about these problems,
we imagine a world where there's a fixed number of blocks of threads,
and each block
does a fixed amount of work
and to do that you need to be able to know
how much work there is in total
in this case we don't know how much work there is in total
however we could imagine
that we could launch a kernel
that
where we assume
we're trying to get undercover
or we're going to the ice cream plate
all right here
we're doing both but it's too loud now
it's all right
So we could imagine, one approach I could imagine would be to say we're going to do a fixed amount of work per block, we're going to launch a fixed number of thread blocks, and then if that is not sufficient, we launch a second kernel that continues the decoding.
now that might mean that we end up like wildly we might end up using a lot more compute than we planned
because maybe the maybe the decoding takes a lot less work than we planned but the second approach
the second approach is this cooperative approach in this second approach we launch a much smaller
number of blocks of threads but the amount of work that each block of thread does is
variable. That is, we don't know
ahead of time or how much work
they're going to do. They're each going to do
roughly the same amount of work, but it's going to be what
we sometimes call like a persistent
kernel or a cooperative kernel, where it's going
to continue, it's going to work
on a chunk of input, it's going
to do that chunk of input,
and then it's going to
load the next chunk of input and start doing
that next chunk of input. However,
to write that type of kernel,
you need to
be able to have all of the
blocks communicate with each other and synchronize with each other because in this case
once you're done with your chunk of work to be able to start in the next chunk of work
you need to know where everybody else wrote their chunk of work which would require you to do
a scan but in this case we would maybe want to do a a cooperative scan and the advantage of doing that
is that we can do the cooperative scan
at the same time that we're loading
the next chunk of input
because while we don't know
where the next chunk of output will go,
we know we can statically assign
where the next chunk of input
will come from,
which means that we can be loading it
concurrently with figuring out that scan.
Okay, I think I may have lost some people there,
but we're going to go get ice cream.
Yeah, he lost me,
but it's hard to focus after the devastating
mess up. I'm just really hoping that gather
is, it's possible that everything I said is true for gather
and then I can go in and every time I say scatter,
I'm just going to cut in, I'm going to cut out scatter and cut and gather.
I believe that the thing that you described
is an algorithm within thrusting cup.
Also, you got to, wait, don't move, don't move.
You're going to get rid of this thing off my head?
No, I'm just going to take pictures.
What is it?
It was a little flaw, I think.
Don't worry.
It's okay.
It wasn't anything scary.
All right, folks.
Hopefully that was audible because we're standing next to a bar with like a dance party.
And we're ending this.
Goodbye from Copenhagen.
We're not done yet.
We're not done.
Yeah, yeah.
We're turning it.
No, we're getting ice cream.
Then we're going to continue.
We're getting ice cream.
We're not continuing.
And we will pick this up again.
That's actually a good question.
If we're recording and we might have to record again before Wednesday.
We're going to pick this up again in five minutes.
We will not. We will not. This is going on my backpack. It's raining.
And a good night from Copenhagen. Maybe we'll record next time.
We're not going to have the energy after Monday and Tuesday.
This has become very ADSB. No, I'll have energy for this.
For this? All right. You'll hear from us. Today is Saturday.
Just to be clear, this entire problem we've been talking about is essentially GPU decompression.
Yes. Yes. Yes. And it is, am I not correct? It is a very interesting algorithm.
It is a very interesting algorithm.
Oh, we didn't, okay, we need to mention the final thing,
which is why we were talking about replicate in the first place,
which is because of array broadcasting.
So imagine that you've got like a vector and a matrix,
and you want to, like, apply some operation in,
like you want to apply some element-wise operation,
like adding or, you know...
You're explaining this incorrectly.
It's not when you want to do an element-wise operation,
It's when you want to do some kind of binary operation on a matrix and a vector.
So the example that we were talking about earlier, Ashwin, who were going to be meeting up with
tomorrow in Norway, posted internally something about a row-wise soft max, which is a computation
and, I don't know, machine learning stuff.
But essentially, you're performing a bunch of binary...
The fundamental operation to flash attention, which is fundamental to GPU inference,
which pays all of our bills.
It's just so sad because they give it a name like softmax and you're just like,
like all these things, soft max, Relu.
It's just a combination of a couple of binary operations.
Like it's nothing fancy.
We will do a whole episode about softmax and why it's magical and we'll talk about Argmax too.
We'll do that.
The three binary, it's three, two binary operations and a unary operation.
There's max.
There's summation.
I guess there's division.
And then there's also exponentiation.
But it's like you do those things in order.
You're done.
It's like, it's so boring.
And like they're like, soft max.
You hear soft max.
and you're like, ooh, like, how complicated that out?
Do you know that there are entire teams that exist in Vida,
whose job it is to make softmax run faster?
I'm just, I'm just, it's, anyway, I just, like, these names, like, and also RELU,
like, I can't remember what RELU is, and there's a couple different versions of it,
but, like, I'm pretty sure it's just, like, one over a number,
or something like that, or it's, like, capped a...
Anyways, I just, like, as someone who studied actual mathematics,
and I suffered through actually, like, difficult stuff,
they give all these fancy names, and it's just, like, it's just basic math.
Anyways, row-wise softmax involves basically needing to do an operation where you calculate the maximum of each row,
and then you divide each of the values in the matrix by the maximum of the corresponding row.
Bryce is dancing now.
And so that means is that if I've got a matrix, I compute however many rows, maximums,
and then I want to divide that maximum into each.
of the values in each of those rows.
So it's not an element-wise operation.
It's an element-wise operation
on a row-wise basis,
which means, like, if I've got a three-by-three matrix
and three values for the rows,
I want to divide the first row by the first max,
the second row. Anyways, I think people get it.
The point being, though, is you can use replicate
to turn that vector into...
Not a matrix, but, like,
or I guess basically a matrix.
For anybody who's familiar with, like,
numpy-style broadcasting,
Hopefully we've not lost the...
We might have blown out the gain a little.
That might have been my bad.
But for anybody who's familiar with NumPai-style broadcasting,
where when you're working with two arrays of different already,
the smaller array can be stretched.
It's arity, not arty?
Already sounds like Ariana Grande's cousin.
All right.
We're cooking here.
Where it can be stretched to the rank of the larger array.
And you can use replicate to do that stretch.
Catching. Anyways, we're going to go good ice cream now, but that's why we were interested in talking about replicas.
It's actually not why we were interested in. I only realized when I was trying to implement the row-wise softmax that, like, oh, replicate's a very convenient thing if you don't have broadcasting in your library language.
I had implemented replicate a month before because I was trying to solve some problem that clearly just needed to replicate.
And then when it came to implementing, it was like literally some Pearl Weekly Challenge or leak code problem that it was just like, oh, like, clearly what I'm missing.
missing here is replicate, which is, it's just an APL primitive, and I hadn't added it.
And this library that I've been working on, it's heavily inspired by all the array operations.
We may need to talk to the, like, the, we may need to talk to some of, like, the, like, NVJPEG and, like, the NV comp people, the compression library people.
Because they know about compression?
Well, yeah, I mean, I think the people make the compression library know about compression, yeah.
Do we need to talk to them? No. Where we talk to them? Maybe.
I do believe that a single-pass algorithm exists here.
I mean, I always believe a single-pass algorithm exists here,
but, all right, Connor's fixing his sweater.
His tan invidia sweater.
It's gold.
There is no world in which this is gold.
It's gold.
There is no world in which this is gold.
All right, we're going to go to ice cream.
And I am dancing now.
Very poor.
But I did learn some dance moves
When I was in Ethiopia
They taught me how to do the shoulder thing
Which I kind of remember how to do
Yeah
There we go folks
She fixed the sweater
My only sweater and my only shorts
That I brought to Norway at the middle
That's true I do have a
He brought shorts to Norway
He's going to be cold
Have I been cold this whole time
But right now we're in Denmark
not Norway. It's going to be colder in Norway.
I have been in Norway in September.
Trust me. It's going to be cold.
I think we're going to be just fine.
Was I slightly underdress for a couple of the restaurants we went to?
It's not a maybe.
It's a most definitely yes.
I took this man to the nicest restaurant in Orhus, or however you pronounce it,
wearing his Lulu Lemon shorts and a running t-shirt.
Yeah.
Was it a running t-shirt?
I know. I kept the hoodie on.
I kept the hoodie.
Yeah, yeah.
Listen, you know what?
I would have brought nicer clothes if I had thought about the fact that Bryce wanted to eat at fancy places.
I need you to say that again with a straight face.
Is this what, so that we can put this on the podcast?
I don't know what I'm doing with this, but...
So, Bryce asked, was it Chatsy B.T or Gemini?
Chachy B.T.
For some humorous names for Scattering Gather because of my misunderstanding earlier.
one of the initial unhumorous was scatter to and gather from which i really like i was already
thinking about scatter two but anyways the top humorous one was yeet and yoinck and then bryce thought
that was hilarious and that's what we should call it but my response was well that doesn't make
any sense because yeat means to like delete to get rid of to toss it you know when they say
yeat it means get rid of it and that's not what scatter's doing so the semantically it has nothing
to do with the actual semantics so i believe you said specifically heat so
romantically has nothing to do with the actual operation.
Yeah, yeah, something like that.
Oh, God, I'm dying.
I'm going to read some of the other, some of the other ones.
Be sure to check these show notes, either in your podcast app
or at ADSP thepodcast.com for links to anything we mentioned in today's episode,
as well as a link to a get-up discussion where you can leave thoughts, comments, and questions.
Thanks for listening.
We hope you enjoyed and have a great day.
Low quality, high quantity.
That is the tagline of our podcast.
It's not the tagline.
Our tagline is chaos with sprinkles of information.