Algorithms + Data Structures = Programs - Episode 232: Algorithms! Live from New York!
Episode Date: May 2, 2025In this episode, Conor and Bryce chat about algorithms, overload sets, libraries and more, live from New York!Link to Episode 232 on WebsiteDiscuss this episode, leave a comment, or ask a question (on... GitHub)SocialsADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein LelbachShow NotesDate Generated: 2025-04-14Date Released: 2025-05-02Thrust LibraryCUB Librarythurst::reduce_by_keythrust::permutation_iteratorClojure partitionthrust::transform_reduceHaskell divvy"Algorithm Selection" Blog (std::mismatch)thrust::discard_iteratorC++ std::partition_copythrust::unique_countthrust::tabulateHaskell TranslatemapAdjacentHoogle Translate iotaIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
Live from New York, it's Monday night.
A pine cone?
Is that a special thing in New York?
Oh, I was going to say, we got lots of pine cones in Canada.
Why are we talking about pine cones?
You know, I know the comedian who bombed tonight
tried to make a callback that failed,
but I've got a callback in this episode
that I think is going to be very successful,
which is the callback to the reduce by key
example that we gave earlier,
where you want to reduce each row in a matrix. How'd you feel about the squid fries? I expected potato
fries with the squid-like essence. Instead they were pieces of squid that
were fried. So I was very shocked. If I could get it french fries at this hour I would.
Instead of big algorithm library which is what we have today that we should
have gone with small composable algorithm library. which is what we have today, that we should have gone with small, composable algorithm library.
We're okay with a if-then situation, we're if dog-then cat.
There, I tried to loop it back into technical things, because you guys refuse to talk technical.
Whoa, whoa, whoa, whoa, whoa, listen, folks. Welcome to ADSP the podcast episode 232 recorded on April 14th, 2025.
My name is Connor and today with my co-host, Bryce, we record live from New York, we chat about algorithms, overload sets, libraries, and more.
All right, go.
Are we recording now?
Yeah, you're not exactly walking in a straight line.
What?
This line looks entirely straight.
Should we do a sobriety test right now?
Yes, we definitely should.
Boom, boom.
Actually, I think I would pass to be honest.
We're here, potentially episode 232 on the streets of New York.
Do you actually know what street we're on?
Here, you hold the mic.
You'll be doing most of the talking.
I think we're on like 23rd and something.
We just left the New York Comedy Club after seeing, I'm going to say two out of the five
comedians were really, really good.
No, I'm sorry, Shima. I'm gonna say two out of the five comedians were really really good. No, I'm sorry, Shima. I'm sorry.
The guy who bombed...
He was awful.
Yeah, but him bombing was in and of itself amusing.
Just stop talking about pedophilia, folks.
If the crowd's not into it, change topics.
Don't keep on coming back to pedophilia.
What is it? And like, I've seen people bomb at that club before and like they always
when like they get desperate, they always turn to lewdness.
It's like anyways, why are we in New York, Connor?
We're here for a actually I don't actually know the official name.
It is a I think I'm going to do my best and then Bryce will correct me.
It is for a modern CUDA C++ training, a recently developed state of the art training
for targeting Nvidia GPUs.
Is, how close was the title of quote,
modern CUDA C++ training, end quote?
How, I think I got, that is what I would call it
because the actual official name,
which is like fundamentals of accelerated computing
and modern CUDA C++ is way too long.
That's a mouthful. But anyways, Connor is here in New York because, well, you weren't planning on being here.
When did you even, when did this even materialize?
So yes, to our global listeners, I was asked, today is April 14th, I believe.
Apologies to my meetup attendees
that I had to move a meetup
that we were supposed to have tonight.
I was asked initially one week ago on Monday
if I was available to help with the training.
And I said, if you really need me,
the answer is yes.
What is this?
USPS Is that what you pointed at?
A pine cone. Is that a special thing in New York?
I was gonna say is we got lots of pine cones in Canada. Why are we talking about pine cones?
Anyways, I was asked a week ago. I said if you need me yes, but I don't once again. I have to say folks
said, if you need me, yes, but I don't, once again, I have to say folks, I'm sorry to disappoint some of our listeners, but if you can tolerate an opinion you might potentially disagree
with, we are temporarily trying to boycott the states. It's not going very well because
I'm going to be having three different trips down here over the next two months. This is
the first of the three, but yeah, things are a bit crazy. Anyway, so a week said yes if you need me but preferably no it turns out they did need me you notified me on
Wednesday I got approval on Thursday and here we are on Monday and uh yeah we're here with my
beautiful fiance do you want to say hi sweetie hello what a lovely voice and uh and so yeah we
were here for training the training I think went phenomenally.
Well, and I should explain the like, I got asked to do this training like 10 days ago,
which is a little bit short notice.
And it's not, this is not like a training I've done before.
This was like an eight hour training, but I, you know, like I give a lot of talks.
If someone asks me to give a talk at work, like I'm like, sure but they like I was initially told like other would be like 20 people and I was
like all right that's fine like 20 people like we like to have a 10 to 1
ratio of instructors to students so I'm like I'll find one person in the New York
area and so then I ask like there's a lot of people on my team who like live
in New York there's maybe like four or five people and I ask all of them
and they're like every single part like one person's like I gotta take my dog somewhere,
one person's getting work done on the house, renovations, like everybody had something. I was
like uh-oh and then and then like we have some planning meetings. Somebody's like oh yeah there
will be 80 to 100 people and I'm like wait there's a lot more people than I expect and somebody's like, oh yeah, there'll be 80 to 100 people. And I'm like, wait, that is a lot more people than I expect.
And so now like my search expands beyond New York and I'm asking like everybody
who I even remotely work within Boston.
And then I'm like, oh, Toronto's nearby.
So maybe I'll ask Connor.
Long story short, we ended up flying two people from Toronto in and then two
people from Boston came and then one person came
all the way out from California for like less than 24 hours to do this training.
But it turned out great.
We had a great time.
We taught people lots of things about Kudos, E++.
Great questions were asked.
A fun time was had by all, and I have not lost my voice yet.
Professor Bryce is, maybe that'll be a thing.
Maybe that'll be a thing.
We'll see.
And I will say, they were phenomenal students,
if you will, I mean, they're all professionals,
so maybe calling them students isn't the right word,
but yeah, great questions.
And there was one really interesting question.
So one of the examples in the talk was,
this is a problem that, a recent talk I gave
on CUDA C++, I started talking about, I gave
an example of using some of the reduced by key algorithms.
We've talked about them on the podcast before, but we were, like the canonical example of
reduced by key is you want to like reduce the rows of a row major matrices.
Like you want to do one reduction for each row. So reduced by key, what it does
is it takes a sequence of keys as an input and then a sequence of values as an input.
And for each run of consecutive keys, so for each sequence of consecutive keys that are
all the same, it'll reduce that set of values. And so if you've got a row major matrix and you want to reduce each row,
each row is contiguous. So the key that you use
is just the row that that particular element is in.
But what if you want to do a reduce by key
of the columns? Well,
the columns are not contiguous, so the key sequence for the
column of each cell would be like 0, 1, 2, 3, 4, 5, 6, 7, 8. And reduced by key,
it only works on, it only combines consecutive keys that are the same.
So if you have a bunch of keys that have the same, you know, column ID but they're
not next to each other in the key sequence,
it doesn't matter, they're going gonna be separate groups. And so someone
was asking, well how would you do reduce by key on that? And just there's, you
know, you could have a new algorithm like an unordered reduce by key or one that
like does a sort or something. But we were sort of theory crafting there and
you know, Connor, I was saying I think you can do it with reduced byte key you just need the right type of iterator and
Connor you suggested how it can be done. I mean it's quite obvious and and also
too it's obvious for me because in a research project I implemented an
algorithm that was called chunk fold where chunk is the equivalent of the C++
23 chunk that takes an integer which which is basically this, right?
You're doing a chunking by an integer size and then folding afterwards is the equivalent of doing a
row-wise summation on a 2D matrix. And the way that I do that is exactly what you did. So the column-wise
one is pretty intuitive and we have a fancy iterator in thrust called the permutation iterator and so you could just build up that index permutation sequence
that gives you the columns.
Would it be efficient?
Definitely not.
But it would get the job done.
And so the idea here I think is that you would have the key sequence where you'd have all
the column IDs contiguous and then you would, the value sequence would be the thing that would be
the permutation iterator.
So in effect you're doing like a reduce that also does a transpose of the data.
Like a reduce by key with a transpose built in there. And
it's actually interesting because there are more efficient ways to do a
transpose than with the permutation
iterator and so like maybe we need to have some sort of transpose iterator
that under the hood the library could recognize that it's a transpose and use
some efficient transpose mechanism because the problem with doing like the
the the reduce like the column summation the problem is that you're doing a gather.
So you're striding through memory with these big strides and that can be inefficient.
And so there may be a slightly more efficient way to do this.
I don't know whether there's like some way to write some transpose iterator that's going to be better or that would be recognized by the
underlying algorithms.
But it was a really good question that someone asked and now I want to go and I want to add it as another example to one of my talks.
Did you have any interesting encounters with people today? Did you learn anything new?
Did I personally learn anything new? I don't think so.
But I think my overall takeaway is that the folks
at this corporate entity that-
I don't believe you.
There's no way you didn't learn one new thing.
At least in the training.
I didn't learn one.
Oh, yes.
I mean, actually I did learn,
I thought you meant from the questions
and the interaction with the students.
But from the trainings, yes.
I mean, similar to one of the other trainee individuals, I did not know what the third
integer parameter of the triple chevron kernel launch is, and that was for dynamic shared
memory.
Admittedly, I spend very little of my time writing raw kernels most of the time.
Well, you should not.
That's a bad...
You should not do that.
You should not do that because it's very hard to do that correctly, and there's libraries,
and you should just use libraries Which is which is 67% of the point of the training is target thrust target cub
We have many very optimized by kuda ninja
libraries that you can use so that you do not have to write your own kernels and
So I don't want to throw our other TA under the bus, but someone asked. What is that third?
triple chevron parameter?
And in my head I was like, yeah, I didn't get asked that question, but I would have
also had to ask and it was for dynamic shared memory.
So yeah, I definitely did learn.
That was when TA knew the answer.
He just thought that everybody in the room should know the answer.
And so definitely I learned a few things.
What was a few of the times I was actually
working through the training because I had not
seen the training before.
I always forget that with NVCC, generic lambdas
are not supported.
So I naturally always reach for auto type
in the parameter list of a host and device annotated lambda
with NVCC, and that'll fail to compile.
It's a simple fix, but almost like every single time
for the first couple of examples,
I would write that and then be corrected.
There was a few other things too.
I can't remember them all off the top of my head,
but yes, from the training itself,
like not in the first half,
definitely all of that stuff is my expertise.
But in the second half of the training, definitely there was some stuff I picked up.
I feel like there were other things we were supposed to talk about.
God, I have a list somewhere.
But after eight hours of talking and then a comedy show,
my brain is not sufficiently there to remember things.
But what else were we supposed to talk about?
I mean, we went to dinner. I mean, Shima was there.
We can get Shima's review on the comedy club.
Uh, you said what?
Two of the?
I think two of the five were good.
The there was, uh, the first and the last, the first, the first and the second.
The first guy who, who he had, he had a joke.
His last joke was about that or his last sequence of jokes.
There was this reference to these
tortoises being intimate, and the guy was wearing two or three jackets, and he kind
of looked like a tortoise, and I swear it was intentional.
I swear that man intended to look like a tortoise.
He was funny.
And then there was this tall Jewish fellow who had a very, very long face who was talking
about his crazy neighbor, and I just thought I thought he was the best I thought it was
hilarious. What say you Shima? We're giving the mic over to my fiancee her I
don't know not first time but you've been on the podcast few times before here
we go we're sitting down which will try not to upset the other patrons.
Sharing that comedy club is crazy.
Nobody cares about what I think of a comedy.
There are many people that care. Sharing that comedy club is crazy. Nobody cares about what I think of a comedy.
There are many people that care.
Okay, let me see. So my review of the comedy show is that admittedly it's better than Toronto comedy.
We're very familiar with the Toronto comedy scene.
And I have to admit that the New York comedy is better.
So I'll give you that admission.
Other than that, there were five comedians.
Two women, three men.
Four out of five, I thought, were very funny.
And I think objectively, people would think it was funny.
There's just one guy that bombed.
Which is okay, that happens.
Yeah, nobody really related with his jokes.
And that's okay. But the other four were very funny.
So it was a good night.
Was it the other four? I mean, I know the last one, she was pretty good too.
Was the other one pretty good?
That's actually, the other one was very funny, yeah.
She was the one that was like,
yeah, like, people say like, don't you pray?
She's like, I meditate.
Oh yeah, yeah, yeah.
Just in the same direction five times.
They were all pretty good.
That woman's one good joke.
One good joke.
Wait, before I move over, because we're separated by this big white table here.
And we've got to record for at least like 50 more minutes to get a full episode out of this.
What else would you like to say?
We went to a Korean restaurant tonight with some other Nvidia folks.
A lovely dinner.
We talked shop. A lot dinner, we talked shop.
A lot of talk about the future,
the plans for the training, the company.
The singularity.
The singularity, yeah.
What was your review of the dinner?
It was a great dinner.
What else is there to say?
Lovely food, lovely company.
How'd you feel about the squid fries?
I expected potato fries with a squid-like essence.
Instead, they were pieces of squid that were fried.
So I was very shocked.
And if I could get french fries at this hour, I would.
And this is totally off-topic for this podcast,
but we had two different individuals
talk to us about our potential future acquisitions
of pets, specifically dogs, and a little bit cats.
Have we, has our, as a, you know, partnership, has our view evolved of what we will be acquiring?
We're okay with a if-then situation. We're if dog, then cat.
There, I tried to loop it back into technical things because you guys refused
to talk technical.
Whoa, whoa, whoa, whoa, whoa. Listen, listen folks. That's a very technical podcast.
I did my part. I did my part.
Yeah, also too. You know, Bryce is known for, you know, we are Mr. and Mr. Tangent, you
know, we talk about how many episodes have we had Bryce? All All right, say goodbye, sweetie, I'm gonna go over there
and we're gonna continue recording.
Goodbye.
That's such a lovely voice.
We're Mariner, folks, we're Mariner.
How many times, Bryce, have we started recording an episode
and like two minutes in,
we went in a completely different direction
and the first two minutes was,
this is what we're gonna talk about,
and then the remaining 28 minutes is we talked about absolutely nothing to do
with what we talked about in the first two minutes how many times I think that's
it's been a lot of times I think that's like that's my fault I think that's on
me for the most part oh gosh I'm trying to think about what else you're supposed
to talk about oh yeah we can spin this into the topics that we had before that
we never got to on the previous recording you wanted to so anyways this is, this is the, um, this is going to be one episode folks.
We're going to, we're going to call this, I don't know, New York, New York plus something
else.
The training was great.
Hope to do this again in the future.
And I think the, yeah, the folks at the training, they, they loved it.
I think they got a lot out of it and they were great spinning back to a couple episodes
ago, the last time we recorded, we ended up doing the Tesla,
or not the Tesla and talking about AI a bunch
and we didn't get to overload sets.
Do you wanna talk about overload sets right now?
Or do you wanna talk about the stuff that we were
supposed to bring up from earlier today?
Because we can take this in two different directions.
I mean, that was one of the topics that you want.
What?
I don't know, I think there was a few other things
that you wanted to talk about from earlier today but I'm not sure which one...
Let's talk about the overload sets. I feel like I'm slowly... Oh god I'm turning into such a C++
personality. I feel like I'm slowly over the course of many years assembling
content for a book which I know like every every person once they reach a
certain age in the C++
community starts reading a book but I've been thinking more about like design and
and oh excuse me excuse me did you guys forget a phone oh oh Connor just noticed
that somebody left a phone Connor the hero the hero, the hero. So, alright, I will, I hate to do this to my co-workers.
I hate to do this to my co-workers. And I'll preface this by saying I don't mean to shame
anyone, okay? I don't mean to shame anyone. I'm not, This is not a tactic to push forward any changes. However, I'm
going to explain a situation. So we have an interface. There's this method on a...
Should we move to these couches over here? We can move to those couches. We'll tell the
story about the cookies. All right, I'll tell the story about the cookies now. So we were
at this training all day. I was doing the lecture part, and then there
were these instructors who were going out
for the interactive part.
So I was talking a lot.
And I was told at lunch, well, first of all, at lunch,
I picked a box lunch that did not have a cookie in it.
And I was sad, so then I picked a different box lunch
that had a cookie in it, but that cookie was not
particularly good.
And then I was promised later in the day
that there would be a cookie bar.
And you know what happened?
Connor, was there a cookie bar?
There was not.
And the patrons that were attending the training did get live updates
when we came back from break and there was no cookie bar.
But then one of my co-workers who had come in from Toronto, Ashwin,
he asked where I would get cookies from.
And I mentioned, well, there's, you know from and I mentioned well there's you
know Levine, Levine's, there's Mamon and a couple other places and then he offered
to order cookies for the whole team and I was a little hesitant but I was like you know what
I could use a cookie so we got these Levine cookies and these Levine
cookies they are they're not really cookies, they're like cakes. They're like big, thick things.
It's like a thousand calories.
But I ate one of them nice and warm, and it was just perfect.
Anyways, back to Overload Sense.
So yeah, so this, again, apologies to my coworkers.
I don't mean to make anyone feel bad.
But there's this method on
a class and it's called weight.
And there's two different overloads of weight.
One overload takes no argument and the other overload takes a object.
And the first overload of weight, when you call it, it blocks the caller.
The second overload of weight does not block the caller.
It tells the object, you object should wait upon this thing that I'm passing to you.
So like the object that you're calling it on is this thing that we call a stream.
And so the first overload of weight that takes no argument, it says,
hey, I want to block until this stream of work is finished.
And then the second overload of weight says, tell this stream of work
to wait on another stream of work. But it does not block the caller.
And I think that this is not the correct design,
because I could understand if you just had the second semantics.
If you had the function, you called it wait,
but it didn't block, it just told the thing to wait upon some other thing.
But I cannot abide having two overloads,
the same thing,
where one of them will block the caller,
but the other one will not block the caller.
I think that this is confusing, and I
think that this is not good interface design.
And I feel that a good overload resolution set
should have the same semantic behavior,
and for anything that has a subtlety, like for example, whether
a parameter is modified, whether or not the function blocks, things like that, anything
that's got caveats that requires nuance to understand should be the same across all of
the members of that overload set. And if there is two different overloads in this overload set
have some subtle nuanced differences,
like one of them synchronizes and one of them does not,
then those should be two separate functions.
Or if one of them modifies a parameter
and one of them does not modify that parameter,
those should be two separate functions. And I feel like there is some fundamental design
principle here that belongs in the Book of Bryce API design.
The Book of Bryce. I look forward to reading that.
Honestly, that is what I would call it, I'm sure, because, you know, my ego is very large.
Yeah, I mean, I like it. I like it.
I mean, I don't think anyone would expect anything less.
Honestly, I have nothing to add.
I completely agree.
And when I was trying to brainstorm in my head,
what are the good overloaded functions or APIs
that I can think of?
Like, the one that came to mind is from Clojure.
It's a partition function that is kind of a sliding window or chunking function where it takes a sequence and chunks it into
different sequences. I don't remember what the default behavior is. I think
it's a chunking behavior, but it has overloads of it which lets you set the
stepping and like the windowing size, you know, and you know, it's basically that
the API that has the fewest parameters just has defaults basically for the others. So
the behavior is achievable by the one that has an overload that has more parameters.
And so definitely the what you're talking about here, it's like you can't get to the
first weight function from the second one, because it's just completely different semantics so yeah I mean like you're
telling me this and yeah it just sounds it just sounds like so I like that rule
that if it's if it's an if it's got if the overload is that there's like
multiple like a new additional parameters if it's like some of them are
defaulted that that you should be able to get to
the most defaulted version from the more explicit versions. I think that that is another good rule
to go into the book of design. Yeah, I like that rule a lot. Well, there's another rule that I
remember when we were designing my first committee paper, the binary
binary transform reduce, there was or maybe it was one of the scan algorithms
but there was it was one of these algorithms that takes a bunch of
different parameters like it takes like two operators and then like maybe it
was one of like maybe it was like the binary binary transform reduce so it
took two operators and then an initial value and there was some question about what the ordering of the whether
like whether it should be like operator one init operator two or like init operator operator and
i thought that init operator operator made more sense but stl not the not the library, but the person who maintains the Microsoft standard library.
He made this argument that when you have these additional
parameters for new overloads, that they should be additive,
that the order should be the same.
If you have a three argument version and a four argument
version, that the position of the arguments should not move.
Like if the third argument should not become the fourth argument in the next overload.
That if it was the third argument in the version that had just three arguments, that it should
remain the third argument and that the fourth argument should be added.
And that that was why you should have this somewhat unintuitive ordering where it's like operator, init,
operator, instead of having the init come before both of
the operators.
And I thought that that made sense, too.
And I guess it helps with overload resolution to avoid
ambiguities in some cases, too.
But his argument was not about the disambiguating.
His argument was just that this is the more natural way
to have it be. And that was compelling to me.
What's interesting about that is I agree with what you're saying but I
know the exact thing you're talking about and that is that the thrust
transform reduce and the C++17 stood transform reduce have a different
parameter order.
And I'm almost positive that thrust is the one that does binary operator, init binary
operator, but the std transform reduce from C++17 does init operator operator.
Which it is possible that younger Bryce was stubborn and at the time fought for and successfully advocated
for the order that made more sense to him and only later, retroactively in my mind,
I came along, I came around to agree with SDL.
And it's also possible that we only realized the mistake too late and that we weren't able
to fix it.
Well, because I mean, a knit's always necessary.
So I thought the one that was...
I think this may have been in the order of maybe the parameters for the scan because
inclusive scan and exclusive scan, because in exclusive scan, you always have to have
the initial value, whereas in an inclusive scan, it's optional. So in an exclusive scan, the order is inputs, outputs,
initial value, operator.
And then for inclusive scan, it's operator, initial value.
Because you can have the operator without an initial value with inclusive scan,
but for exclusive scan you always have to have the initial value.
So it might have been something around one of the scans or maybe around the transform scan algorithms,
the compound transform scan algorithms, that may have been it,
that the transform inclusive scan and transform exclusive scan.
I remember it was something with two operators, and the transform inclusive scan and transform exclusive scan. I remember it was something with two operators and the transform inclusive scan and transform
exclusive scan have two operators because they have the scan operator and they have
the transform operator.
Okay, yeah, that makes sense.
If you're talking about the scans, I was thinking about the transform reduces.
Somebody could go to like the meeting minutes from like 2017 C++ committee meetings and find when STL made these points.
It would have been in like the library minutes from very early committee days.
But I think that the key point is that this argument is not the specific history here,
but this argument that STL made that about that if you have an overload that takes two
arguments and an overload that takes three arguments and an overload that takes three arguments
and an overload that takes four arguments,
and that it's additive,
that the first two arguments are the same
in all four of these overloads,
that you should not move,
or if it's the same argument,
you should not move around its position
across different overloads.
And I think that was compelling to me.
Yeah, I mean, that completely syncs with my sense
of like you should be able to replicate like for instance in in the same kind of partition function
enclosure that leads to a couple of these sliding chunking functions. Haskell only provides you the
most generic version where you provide a step value and a window value
Which you can spell the stride chunk and slide versions by specifying one and a value of n for either or the parameters
But the point being is like at the generic level you have you can spell everything
You know, it's like it's when you I posted that blog post once that talks about how stood mismatch is
the most generic version of a bunch of, you know, is sorted and whatnot.
What? Like sometimes you don't need two sequences.
You just need one sequence.
And so you can ignore the second iterator to, you know, you're kind of bending and what a what a bend call it.
It's like you're you're bending the algorithms in a perverse way.
That's not how they're intended to be used,
but you can spell a single iterator reduction
with a two iterator algorithm
by just ignoring one of them or discarding one of them.
And it's this idea that at the top level,
you wanna be able to spell everything.
And so anyways, to getting back to the wait API,
if you've got different semantics
and it's doing different things that, yeah, just,
I have nothing to add to that in that other than that.
I agree.
If it's got different semantics, that
sounds like it should be two different APIs, right?
And what you were talking about about discarding there
made me think of, we have this special iterator in Thrust called a discard
iterator.
And it's something that I don't think we ever really see.
We don't have in the standard library.
And it's because there's not really algorithms that need it.
And the reason is, I don't think there's
any algorithms really in the standard library that
take two outputs, are there?
There is only a sync.
Actually, let's play a game.
It came up in a past episode, probably the listeners might know.
There's only one algorithm that outputs to two different output iterators.
Can you name it?
Probably not.
Is it one of those set ones or something?
Or like the partition or something?
It is, it is is it is partition copy so
partition part takes a unary predicate and everything that passes it writes to the beginning and then
everything that fails writes to the end so that you end up with a partitioned kind of predicate
sorted sequence partition copy does the same thing but it just's a linear pass. Anything that passes writes to the first output.
Anything that fails writes to the second output.
Other than that, though, so yeah, potentially,
if you only cared about the first,
but if you only care about the first predicate passing
elements, that's just a copy of.
You don't need to use a partition copy
with a discard iterator on the second output iterator.
We have an algorithm for that, which is why it probably doesn't exist.
And, you know, I know the comedian who bombed tonight tried to make a callback that failed,
but I've got a callback in this episode that I think is going to be very successful,
which is the callback to the reduce by key example that we gave earlier, where you want to
reduce each row in a matrix.
And these by key algorithms in Thrust that take with a keys
input and a values input, they take two outputs, a keys
output and a values output.
And this is useful because in some cases, you want to keep the association of this
was the key for this group, this group of values
that I reduced, this value that's the aggregate of this group.
I want to know what the key is.
In the case of the row-wise reduction,
if our key is what row am I on, then in this output vector,
the index of each element in the output vector
corresponds to what the row was.
And so you don't need the key's output here.
And you don't want to write the key output to somewhere in memory if
you don't need it because that's gonna waste memory bandwidth and it's
gonna waste storage and so thrust has this discard iterator and what a
discard iterator does is it when it's dereferenced it returns this proxy
object that when you assign to it, it just does nothing.
So it just ignores the write.
And because we have these by-key algorithms in Thrust,
and we also have a couple other types of algorithms
that take two outputs, and because of these,
we have this discard iterator in Thrust.
And I think it's most commonly used with the by-key algorithm.
So there's a few other places where it's used.
But until we introduce by-key algorithms
to the standard library, it's not something
that we would need.
But this is one of the fancy iterators
that we have in thrust that we showed in the course today.
But that's just not something that's
in the standard library yet.
Because there's a bunch of algorithms in thrust
that aren't in the standard library that really ought to be like these by-key algorithms
that let you do these batched operations
and are really quite powerful.
Well, I mean, we're gonna have to revisit this topic
because, I mean, it's been too long
that I've been working on the stuff
that I've been working on that hasn't been published,
but it is.
Eminently.
Eminently this year, especially after the most recent stuff
that I've been working on.
It's gonna be out in the wild.
People are gonna be able to use it,
but two of the motivational examples that I talk about
are Unique Count, which we spent much time in Venice
many years ago.
Was it many years ago?
We don't know, but it feels like a long time ago.
But Unique Count, which is really just a zip iterator with a not equal to pass to a
reduce, which ultimately I'm not sure if we actually ever revisited that topic.
But anyways, unique count is a is a hand-rolled specialization of fancy
iterator plus algorithm. Transform reduce, a hand rolled similarly spelled version of a fancy iterator,
transform iterator, plus an algorithm reduce.
And I realized there's a third one today in Thrust
and it is tabulate, which is just a counting iterator,
AKA IOTA for my array language fans out there,
plus a transform. And it's one of my array language fans out there, plus a transform.
And it's one of my big motivating points
is that we have these hand-rolled bespoke algorithms
that are a combination of a fancy iterator and algorithm.
But the algorithms that we provide you that do these
are limited.
It's tabulated.
It's unique account.
It's transform reduce.
And then outside of that, you have to go and compose them
yourself.
Whereas we could be providing you with an API that does this all behind the scenes and
makes it a lot easier.
Like it's still a fantastic resource.
Don't go write your kernels by yourself.
Reach for this stuff.
But I think in the future we could be providing you a more ergonomically to spell this stuff
that does it automatically behind the scenes.
Stay tuned 2025.
Hopefully going to get published.
So one of the reasons why we do the, the composed operations, one of the arguments,
like one of the arguments for things like transform reduce and transform inclusive
scan and transforming exclusive scan, um, is that it's more efficient, uh, on some
platforms, some backends, uh, for like the parallel version to have these fused algorithms
instead of being able to... well, okay, the argument at the time is this is before we
had ranges. So the reason that we had like transform reduce was because it's more efficient
than doing a transform followed by a reduce and we didn't have a transform iterator at
the time. I think that was the argument. And there's also this claim that maybe thrust
should have a tabulate because maybe there's
some faster way under the hood that we could do transform
plus counting in the radar.
I think in practice, it's not.
I think in practice, it just maps down to that.
But certainly, there were some cases
where there's something that could be expressed
as composition of other algorithms
where there's a more efficient way to do it.
However, and there's an argument that we
don't know what all of those cases are today,
and that for common use cases, it
makes sense to have a named algorithm so that maybe we come
up with some faster way of implementing it.
And there's a bunch of cases in thrust
where we know that there's a faster way of implementing
the algorithm, but we implement it through composition today
just because it's not been an important enough use case
to implement the faster path.
Even despite all of that, I think
that there's an argument to be made that instead of big
algorithm library, which is what we have today,
that we should have gone with small, compos big algorithm library, which is what we have today, that we should have gone with small composable algorithm library.
And one of the reasons I feel this way
is that if you think about the surface
area of the algorithm library as being just
like a couple of key entry points that are in composition
on top of them, one, I think it makes it easier
to port to interesting new targets and back-ends,
especially for parallelism. But also, I think it would give people a stronger intuition about how
to compose. Like, if instead of having count, if you just had, you And instead of having max element,
you just had reduce with max.
I think that if people had to write reduce with max,
instead of writing max element,
people might think more about composition
in customizing algorithms
with composable iterators and ranges.
And that might mean that they would be more likely
to use reduce with some compositional thing
instead of writing their own function.
And so maybe we made a mistake in going with
the big algorithm library and that a smaller set
of composable primitives would have taught people
a different programming model that would have made them
think more about how to compose things and that people would spend less time writing raw for loops and reinventing
the world.
I 1000% agree with this and I have thought a lot about to like introducing this kind
of library, the pedagogy around it and you know there are these high-level generic, scan, reduce, map.
But there are other algorithms in that category.
And literally in the docs, I'm not
sure if you noticed on the hyperlinks
on the side of that screenshot I sent you,
there's a star next to, like in each category, one index maps.
There's a star at the top with map.
And then everything else is just a specialization of that.
And there's certain algorithms that like fall by the wayside
that, you know, my who go translate website,
map adjacent, which is what it's called in Haskell.
It was initially called the JSON difference in C++ 11.
Then it's called the JSON transform in C++ 23.
It's called prior in queue.
It's called zip with and Haskell.
It's this algorithm that falls by the wayside. And so people don't even recognize it by name 23, it's called prior and Q, it's called zip with and Haskell.
It's this algorithm that falls by the wayside and so people don't even recognize it by name
but the point being is that teach people the generic version, give them specializations
but let them know that like if we're not providing you the specialization it's just a custom
binary operation that you can pass to one of these things.
I see, she must turn and leave.
Do not leave.
I'm gonna put this in the episode.
Do not leave me.
Do not leave me.
I thought she was happy, folks, with her head on my chest
and getting head pets, but she could not suffer through.
One more word uttered about binary operations and algorithms.
So we're gonna wind this down in the next...
We said two, but I'm gonna extend it to three.
I mean, I think there's maybe a middle ground
where we could have just explicitly defined the algorithms that
are compositions as saying like these are compositions and in fact in thrust
because that's often how we define things like we when we're teaching
thrusts or in the documentation we like explicitly say like you know like in
fact like like this is equivalent to that and in fact it is implemented that
way like I commonly when I'm talking about transform reduce and thrusts I Like, you know, like, in fact, like, like, this is equivalent to that. And in fact, it is implemented that way.
Like I commonly, when I'm talking about transform, reduce and thrust, I often say,
yeah, in, in thrust, the speed of implementation of transform reduce is
just reduced called with the transform iterator.
We do not customize the transform reduce, you know, algorithm for the CUDA backend.
We just rely on the generic one that just calls reduce with a transform iterator.
And I think telling people that, they can still
call the easy transform reduce version,
but being able to tell them that, I
think it makes people really get the compositional aspect.
It's the most important algorithm,
and we rely on the composition for that speed of light code
path.
Something that's crazy, though, and hopefully this
doesn't go to four minutes when I, we initially said two,
but then I extended it to three, but it's that,
I see, the love of my life is nicking my leg,
telling me that she wants to go brush her teeth.
We're gonna leave it all on the pod, folks.
Is that tabulate, is the name for counting iterator,
which is a poorly named iout, so like, the name in C++.
You think, wait, you think the counting iterator is a poorly named iota? I think iota is a poorly named counting iterator, which is a poorly named iota. So like the name in C++. You think the counting iterator is a poorly named iota?
I think iota is a poorly named counting iterator.
The point is there's disagreement.
So here's what's crazy, folks.
In C++11, it's called iota.
In thrust, it's called sequence.
As an iterator in thrust, it's called counting iterator.
So we've got sequence, iota, counting iterator.
And then if you add the counting iterator to the transform algorithm you get tabulate. So
now we've got iota, we've got sequence, we've got counting, plus transform, equals
tabulate, and on top of that in rapids the counting plus transform iterator
was such a common pattern we created a new iterator called
the counting transform iterator.
And all of this is to say is that like,
yes, I agree that hard coding some of this,
if you got a happy path, whatever,
but the point being is you wanna do
all the different variations.
So just build the composable version.
Chef's kiss.
Final words, we're signing off.
We're in New York, we're at the Evelyn Hotel.
I'm not sure how Bryce is getting home. What are your final thoughts Bryce? Tell the people what they need to hear.
Well, I was just wondering about like, you know, okay,
what's the archaeology of why do we have sequence and tabulate and counting iterator?
And I realized, and this is a great cliffhanger, I realized we could get answers to all these questions
because we, our co-worker Jared I realized we could get answers to all these questions because our coworker Jared creates
the library that we talk about all the time.
We could talk to Jared.
We need to talk to Jared and we're also going to bring on Georgie who gave, in my opinion,
what might have been the best.
It's up there with prices from the GTC 2025 conference. We got to get Georgie on because this new
hands on start with Thrust, go to Cub, write your own kernel and how do you write your
own kernel that matches the speed of life that's offered by Thrust and Cub. We're going
to have them on. We're going to talk about it. Links in the show notes. Final thoughts
once again, signing off.
I was able to hold my sneeze in until you finished. All right, Bryce needs to see
you. We're going to see you live from New York. It's Monday night. And she was asleep
now. She has been a super trooper. Bye. Be sure to check these show notes either in your
podcast app or at adspthepodcast.com for links to anything we mentioned in today's episode,
as well as a link to a GitHub discussion where you can leave thoughts, comments, and questions. podcast.