Algorithms + Data Structures = Programs - Episode 258: 🇳🇴 An Algorithm Taxonomy (Serial, Parallel, Cooperative)
Episode Date: October 31, 2025In this episode, Conor and Bryce record live from Norway! Bryce explains the taxonomy of algorithms: serial, parallel, and cooperative!Link to Episode 258 on WebsiteDiscuss this episode, leave a comme...nt, or ask a question (on GitHub)SocialsADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterDate Recorded: 2025-09-23Date Released: 2025-10-31MPIIPCRow-wise Softmax in TritonRow-wise Softmax in ParrotCCCL - Parallel and Cooperative AlgorithmsIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
There are three types of algorithms.
There are serial algorithms, and there are parallel algorithms.
The third category, the third thing, is a cooperative algorithm.
I guess I was in the nascency of my parallel thinking,
because before I joined invidia, I was a serial thinker.
I only knew about Category 1, and then I joined MVIDIA.
And it probably took me a year or two, really to convert to parallel algorithm thinking.
But I still have not developed cooperative.
algorithm thinking.
The beautiful thing about the parallel algorithms level is the seemingly sequential nature.
The seemingly sequential nature is key.
When I do reach for Cub, I reach for device algorithms.
Welcome to ADSP, the podcast, episode 258 recorded on September 23rd, 2025.
My name is Connor, and today with my co-host, Bryce, we record live from Norway right before NDC TechTown.
In this conversation, Bryce explains his taxonomy of algorithms, parallel, serial, and cooperative.
What's your mental model for the taxonomy of algorithms?
Because I guarantee you, because thrust, I feel like our listeners can relate to, even if they've never touched thrust, because a lot of our listeners are C++ people.
and even if they're not C++ plus people,
they're Russ people, they're Python people.
They know the equivalence of a rotate, a copy, or whatever.
But then as soon as you drop down into block-aware algorithms,
so what's the taxonomy?
And we've got to start there because you lost me
and probably some of our listeners.
Bryce looks excited, folks.
There are three types of algorithms.
There are serial algorithms in which a single thread calls the
algorithm and a single thread performs the algorithm and the result is returned to a single
thread. That's what we're all familiar with. Yes. Then there are parallel algorithms in which
let's start with cooperative algorithms said. No, we'll start with parallel algorithms. We'll
make more sense of people here. Then there are parallel algorithms where you have the
appearance of serial execution and a parallel algorithm and in our nomenclature.
in like thrust and in the C++ standard parallel algorithms you have a single thread that a single caller that invokes the algorithm and the algorithm then creates or acquires parallel execution resources and then runs the result runs the computations on those parallel resources that collection of threads and then that produces a result and then that result is
is returned to the caller in thrust and in the C++ standard algorithms,
it is returned asynchronously.
It's just the return value of the invocation,
and so it's completely seemingly sequential.
But you could also imagine that this model also could apply to an asynchronous API,
something that returns a future or a sender.
But the idea is that there's a single caller,
a single thread initiates the operation.
The third category, the third thing,
is a cooperative algorithm.
In a cooperative algorithm,
multiple threads
decide to participate.
So multiple threads call the primitive
at the same time.
And no execution resources
are created or acquired
by a cooperative algorithm.
Instead, the calling threads
that have called into this primitive,
they donate their resources.
So, like, examples of this is, like,
if people are familiar with MPI programming,
like MPI collectives, like an MPI reduce,
or, you know, an MPI barrier,
those are collective algorithms
where all of the different ranks call it together.
Or in...
All of the different processes
in a distributed computation,
which are called ranks,
in the distributed world
or you know
if you're doing like shared
memory IPC
like you've got a
you know you have a
interprocess communication
like you're using shared memory
to have multiple processes communicate
and they're all invoking the same thing
they're all participating in some
collective operation together
that's a cooperative algorithm in the case of
CUDA
when CUDA you launch these
kernels which are parallel functions that are
executed by all the threads at once, within those kernels, you can invoke a cooperative
algorithm, like a reduction, for example, where a bunch of threads all decide to participate
in the reduction. Instead of having an input array, the inputs are distributed among the
threads, and each thread contributes a value into the reduction, or maybe it's reduced some
values locally. But each thread gives one piece of the input.
to the reduction, and all of the threads have to contribute their piece of the input and then
contribute to the calculation. And these cooperative algorithms are the building blocks that we
use to implement the parallel algorithms. And I was speaking in terms of the cooperative level
because that's how I often think about algorithms these days. I suppose that's always been how I've
thought about algorithms, I think that's how I default to thinking about it. You default to thinking
about it at the serial or the parallel level. I've been, I always think about it at this
level, but then when we converse, I often raise it up to the level that you're used to thinking
about it. And you're usually thinking about it at a higher level, a functional level, and then
lowering it down for me. Well, so what I'll say, folks, this
is an inflection point in the podcast when you i mean actually i don't know when you started listening
but back when we started this podcast in 2020 was it 20 was it 19 right 19 i think it was definitely
well actually i think we said this before and then we looked it up and it was like oh no it was
actually in midst of the pandemic but i thought you we decided we were going to
to start it right after we went to Belfast. And that was pre-pandemic. That was 2019. But we're
about to find out, folks, 252 episodes. 2020. I was wrong. I knew I was wrong because we've done this
before. 2020 November, which was post-pandemic. I think we talked about it at Belfast, but then I didn't
start it until a year later. Anyways, when we started recording this, I was, I guess I was in the
nascency of my parallel thinking because before I joined invidia I was a serial thinker.
I only knew about category one and then I joined invidia and it probably took me a year
or two really to convert to parallel algorithm thinking but I still have not developed
cooperative algorithm thinking because I primarily work in thrust and when I do work in
cub i feel like it's the uh captain morgan you know when i do drink i drink whatever he is the
life of parties he has never attended if he were to punch you in the face he would have to fight
off the strong urge to thank him sharks have a week dedicated to him he is the most interesting
man in the world i don't always drink beer but when i do i find
for those sackies.
Stay thirsty, my friends.
When I do reach for Cub, I reach for device algorithms, which is the equivalent of thrust.
The only algorithms I reach for are the device segmented reduce and device fixed segment
reduce, et cetera.
Bryce has something to have.
When we started the podcast, I was at the place that you are currently at.
I was a few years into Nvidia, but I was still thinking at the parallel.
algorithm level, not at the cooperative algorithm level.
Or maybe to some degree I was starting to think at both.
But having, now five years later, I think about things first from the cooperative
algorithm level and then from the parallel algorithm level.
And I think you are very close to jumping off the deep end because of the
project that you're working on, I think that a year or two from now, you're going to be where I
am now. Because right now, you are where I was a couple of years ago.
I can almost guarantee you that's true, because the cub block primitives are what you use
when you're writing kernels. And I am about to start doing that, folks. We've only got
segmented scans and reductions, and we need more.
And also, we don't need the segments.
We need fixed segments.
We need row-wise and column-wise, everything.
The way we want people to program
is at the parallel algorithm level,
because that is the level that's purest and closest
to functional programming and array programming.
At the cooperative level,
you have to think about threads in synchronization.
At the cooperative level, you're programming threads.
And so it's always going to be an implementation detail for the work that we do.
It's never going to be the programming model that we expose up to people.
And so what you just said, you think that I'm right, that you'll have to do that because you need X, you need Y and Z.
That's because you know that you need those things to be able to implement the backend.
of, like, to be able to implement the algorithms that you want,
not as a programming model that you'd expose to users.
So I think that in terms of this podcast,
we will continue to primarily talk about things
in terms of the parallel algorithms level,
because the beautiful thing about the parallel algorithms level
is the seemingly sequential nature.
The seameas and the sequential nature is key.
I mean, and this is, Bryce just admitted something
that he may or may not have wanted to admit,
is that he said, you know,
We want people to program at the functional or array level.
And so he just said, of course, off mic.
And this is the point.
You know, we are C++, we are Kuda C++ programmers, developers, developers.
But the array programming model, the functional programming model, it's just easier to reason about.
Everybody that's hate non-all, Haskell, O'Connell, guess what?
It's a better model.
It's just not performant.
But guess what?
we hear a promise to the listener
is we're going to make the easy way
be the efficient way
and that's the dream that's the dream
and yes there are trade-offs
but we're going to make it happen
bice's got something to add
it's not always the case that it's
less efficient to write the functional
or the array or just the higher
level code if you write the lower
level code you are baking in
certain details and you're making certain
assumptions about how the
about like the
nature of what you're executing on. Like, if you're programming to threads, as you do in a
cooperative model, then you're making certain assumptions about quantities of threads and
how those threads are going to communicate with each other. Whereas if you express things at the
array slash functional level, you're expressing things purely in terms of like the requirements.
You're not saying anything about how it needs to be implemented. Right. Like, like it's,
it's just it's descriptive versus prescriptive right array and functional is descriptive you're
describing what it does whereas cooperative is prescriptive you're saying it must be done this way and
I think the history of parallel programming has actually shown that descriptive models tend to
win over prescriptive models there it is folks array functional it's the future and we'll probably
talk about cooperative and block
algorithms, but
I currently don't understand them, so I'm here
to hold your hand.
You've got to talk about your soft max.
I know, so I was, Bryce just said off, Mike,
we got to talk about your soft max.
Guess what? We're at the 47 minute mark.
And the next time, Bryce and I
are one-on-one recording.
Tomorrow night, today is Tuesday,
September 23rd.
Tomorrow will be September 24th.
Actually, wait, I'm making a promise I can't keep.
The next time where we're going to record is actually
probably going to be in person at, uh, under the
C in the Netherlands.
So once we finished recording there, I mean, maybe we might steal away and talk about
softmax, but it might be after C++ under the sea, maybe we'll steal away, find some
time to talk about softmax, but I've been working on Ashwin posted internally about a row-wise
soft max and numpy and what his aspirations for that were.
I tried to do it in, I think I can mention it now.
It's...
You just heard a record scratch.
Bryce made me cut everything I just said.
We'll be talking about softmax and a special project that we will be announcing on the pod.
Bryce has something to say.
Avid listeners will know that I have been hinting and suggesting at least two or three times
that eventually on this podcast, where we are clearly obsessed with AI, we are going to have
to train some sort of machine learning model.
We're going to have to go and build something and get under the hoods.
And Connor doesn't even realize that he started down this.
exploration, because he's gotten nerd sniped into implementing this softmax.
Softmax, which is one of the key ML algorithms at the heart of many attention mechanisms,
which is the...
I'm not saying I nerd snipe. I'm saying Ashwin nerdsnipe you, but softmax is at the key of
many attention algorithms, such as flash attention.
Four operations. Exponential division. I understand. But, and I told Connor,
this is cool. This will get people very interested in your project if you show them an efficient
softmax. But what it'll get people even more interested is if you can show them how to implement
flash attention in your thing. And then Connor's going to go and do that. And then next after that,
we're going to have to go and implement some sort of open source model, inference for some sort of
open source model and his thing. And then we're going to end up talking more and more about this.
And before you know it, folks, this is going to be a machine learning podcast. And Connor's going to have
implemented some machine learning models in his thing i tell you i predict right now one to two years
from now that's where we'll be training will come after the infants though training will come after the
inference i got nothing to say folks we got a conference to go to tomorrow oh man do you think you have
do you think you have the energy for it starting to rally i'm starting to rally i was i was feeling
drained over the weekend but now i'm starting to rally today i slept until 730 the day before that i was up at
545.
Luckily, my plan tomorrow is also to sleep in a 7.45.
So I should be good.
I do plan to attend at least the start of the keynote.
It's not Hana.
You predicted it correctly.
At one point, was it yesterday, two days ago?
You saw that there was a second keynote that had been added.
And I kid you not, Bryce said,
hmm, I wonder if Hana can't make it anymore.
And sure enough, a day later, Hana's keynote was removed
and it was replaced by another keynote.
Anyways, how are you feeling about the conference tomorrow?
You got the energy?
I'm relian.
I'm relian.
All right, folks.
You will hear from us in a week on Wednesday, September 24th,
live from the NDC Social.
Until then.
Be sure to check these show notes,
either in your podcast app
or at ADSP thepodcast.com
for links to anything we mentioned in today's episode,
as well as a link to a get-up discussion
where you can leave thoughts, comments, and questions.
Thanks for listening.
We hope you enjoyed.
and have a great day.
Low quality, high quantity.
That is the tagline of our podcast.
That's not the tagline.
Our tagline is chaos with sprinkles of information.
