Algorithms + Data Structures = Programs - Episode 258: 🇳🇴 An Algorithm Taxonomy (Serial, Parallel, Cooperative)

Starting point is 00:00:00 There are three types of algorithms. There are serial algorithms, and there are parallel algorithms. The third category, the third thing, is a cooperative algorithm. I guess I was in the nascency of my parallel thinking, because before I joined invidia, I was a serial thinker. I only knew about Category 1, and then I joined MVIDIA. And it probably took me a year or two, really to convert to parallel algorithm thinking. But I still have not developed cooperative.

Starting point is 00:00:30 algorithm thinking. The beautiful thing about the parallel algorithms level is the seemingly sequential nature. The seemingly sequential nature is key. When I do reach for Cub, I reach for device algorithms. Welcome to ADSP, the podcast, episode 258 recorded on September 23rd, 2025. My name is Connor, and today with my co-host, Bryce, we record live from Norway right before NDC TechTown. In this conversation, Bryce explains his taxonomy of algorithms, parallel, serial, and cooperative. What's your mental model for the taxonomy of algorithms?

Starting point is 00:01:20 Because I guarantee you, because thrust, I feel like our listeners can relate to, even if they've never touched thrust, because a lot of our listeners are C++ people. and even if they're not C++ plus people, they're Russ people, they're Python people. They know the equivalence of a rotate, a copy, or whatever. But then as soon as you drop down into block-aware algorithms, so what's the taxonomy? And we've got to start there because you lost me and probably some of our listeners.

Starting point is 00:01:50 Bryce looks excited, folks. There are three types of algorithms. There are serial algorithms in which a single thread calls the algorithm and a single thread performs the algorithm and the result is returned to a single thread. That's what we're all familiar with. Yes. Then there are parallel algorithms in which let's start with cooperative algorithms said. No, we'll start with parallel algorithms. We'll make more sense of people here. Then there are parallel algorithms where you have the appearance of serial execution and a parallel algorithm and in our nomenclature.

Starting point is 00:02:28 in like thrust and in the C++ standard parallel algorithms you have a single thread that a single caller that invokes the algorithm and the algorithm then creates or acquires parallel execution resources and then runs the result runs the computations on those parallel resources that collection of threads and then that produces a result and then that result is is returned to the caller in thrust and in the C++ standard algorithms, it is returned asynchronously. It's just the return value of the invocation, and so it's completely seemingly sequential. But you could also imagine that this model also could apply to an asynchronous API, something that returns a future or a sender. But the idea is that there's a single caller,

Starting point is 00:03:22 a single thread initiates the operation. The third category, the third thing, is a cooperative algorithm. In a cooperative algorithm, multiple threads decide to participate. So multiple threads call the primitive at the same time.

Starting point is 00:03:43 And no execution resources are created or acquired by a cooperative algorithm. Instead, the calling threads that have called into this primitive, they donate their resources. So, like, examples of this is, like, if people are familiar with MPI programming,

Starting point is 00:04:06 like MPI collectives, like an MPI reduce, or, you know, an MPI barrier, those are collective algorithms where all of the different ranks call it together. Or in... All of the different processes in a distributed computation, which are called ranks,

Starting point is 00:04:24 in the distributed world or you know if you're doing like shared memory IPC like you've got a you know you have a interprocess communication like you're using shared memory

Starting point is 00:04:38 to have multiple processes communicate and they're all invoking the same thing they're all participating in some collective operation together that's a cooperative algorithm in the case of CUDA when CUDA you launch these kernels which are parallel functions that are

Starting point is 00:04:53 executed by all the threads at once, within those kernels, you can invoke a cooperative algorithm, like a reduction, for example, where a bunch of threads all decide to participate in the reduction. Instead of having an input array, the inputs are distributed among the threads, and each thread contributes a value into the reduction, or maybe it's reduced some values locally. But each thread gives one piece of the input. to the reduction, and all of the threads have to contribute their piece of the input and then contribute to the calculation. And these cooperative algorithms are the building blocks that we use to implement the parallel algorithms. And I was speaking in terms of the cooperative level

Starting point is 00:05:43 because that's how I often think about algorithms these days. I suppose that's always been how I've thought about algorithms, I think that's how I default to thinking about it. You default to thinking about it at the serial or the parallel level. I've been, I always think about it at this level, but then when we converse, I often raise it up to the level that you're used to thinking about it. And you're usually thinking about it at a higher level, a functional level, and then lowering it down for me. Well, so what I'll say, folks, this is an inflection point in the podcast when you i mean actually i don't know when you started listening but back when we started this podcast in 2020 was it 20 was it 19 right 19 i think it was definitely

Starting point is 00:06:43 well actually i think we said this before and then we looked it up and it was like oh no it was actually in midst of the pandemic but i thought you we decided we were going to to start it right after we went to Belfast. And that was pre-pandemic. That was 2019. But we're about to find out, folks, 252 episodes. 2020. I was wrong. I knew I was wrong because we've done this before. 2020 November, which was post-pandemic. I think we talked about it at Belfast, but then I didn't start it until a year later. Anyways, when we started recording this, I was, I guess I was in the nascency of my parallel thinking because before I joined invidia I was a serial thinker. I only knew about category one and then I joined invidia and it probably took me a year

Starting point is 00:07:30 or two really to convert to parallel algorithm thinking but I still have not developed cooperative algorithm thinking because I primarily work in thrust and when I do work in cub i feel like it's the uh captain morgan you know when i do drink i drink whatever he is the life of parties he has never attended if he were to punch you in the face he would have to fight off the strong urge to thank him sharks have a week dedicated to him he is the most interesting man in the world i don't always drink beer but when i do i find for those sackies. Stay thirsty, my friends.

Starting point is 00:08:23 When I do reach for Cub, I reach for device algorithms, which is the equivalent of thrust. The only algorithms I reach for are the device segmented reduce and device fixed segment reduce, et cetera. Bryce has something to have. When we started the podcast, I was at the place that you are currently at. I was a few years into Nvidia, but I was still thinking at the parallel. algorithm level, not at the cooperative algorithm level. Or maybe to some degree I was starting to think at both.

Starting point is 00:08:55 But having, now five years later, I think about things first from the cooperative algorithm level and then from the parallel algorithm level. And I think you are very close to jumping off the deep end because of the project that you're working on, I think that a year or two from now, you're going to be where I am now. Because right now, you are where I was a couple of years ago. I can almost guarantee you that's true, because the cub block primitives are what you use when you're writing kernels. And I am about to start doing that, folks. We've only got segmented scans and reductions, and we need more.

Starting point is 00:09:50 And also, we don't need the segments. We need fixed segments. We need row-wise and column-wise, everything. The way we want people to program is at the parallel algorithm level, because that is the level that's purest and closest to functional programming and array programming. At the cooperative level,

Starting point is 00:10:17 you have to think about threads in synchronization. At the cooperative level, you're programming threads. And so it's always going to be an implementation detail for the work that we do. It's never going to be the programming model that we expose up to people. And so what you just said, you think that I'm right, that you'll have to do that because you need X, you need Y and Z. That's because you know that you need those things to be able to implement the backend. of, like, to be able to implement the algorithms that you want, not as a programming model that you'd expose to users.

Starting point is 00:10:54 So I think that in terms of this podcast, we will continue to primarily talk about things in terms of the parallel algorithms level, because the beautiful thing about the parallel algorithms level is the seemingly sequential nature. The seameas and the sequential nature is key. I mean, and this is, Bryce just admitted something that he may or may not have wanted to admit,

Starting point is 00:11:15 is that he said, you know, We want people to program at the functional or array level. And so he just said, of course, off mic. And this is the point. You know, we are C++, we are Kuda C++ programmers, developers, developers. But the array programming model, the functional programming model, it's just easier to reason about. Everybody that's hate non-all, Haskell, O'Connell, guess what? It's a better model.

Starting point is 00:11:43 It's just not performant. But guess what? we hear a promise to the listener is we're going to make the easy way be the efficient way and that's the dream that's the dream and yes there are trade-offs but we're going to make it happen

Starting point is 00:11:58 bice's got something to add it's not always the case that it's less efficient to write the functional or the array or just the higher level code if you write the lower level code you are baking in certain details and you're making certain assumptions about how the

Starting point is 00:12:13 about like the nature of what you're executing on. Like, if you're programming to threads, as you do in a cooperative model, then you're making certain assumptions about quantities of threads and how those threads are going to communicate with each other. Whereas if you express things at the array slash functional level, you're expressing things purely in terms of like the requirements. You're not saying anything about how it needs to be implemented. Right. Like, like it's, it's just it's descriptive versus prescriptive right array and functional is descriptive you're describing what it does whereas cooperative is prescriptive you're saying it must be done this way and

Starting point is 00:12:57 I think the history of parallel programming has actually shown that descriptive models tend to win over prescriptive models there it is folks array functional it's the future and we'll probably talk about cooperative and block algorithms, but I currently don't understand them, so I'm here to hold your hand. You've got to talk about your soft max. I know, so I was, Bryce just said off, Mike,

Starting point is 00:13:23 we got to talk about your soft max. Guess what? We're at the 47 minute mark. And the next time, Bryce and I are one-on-one recording. Tomorrow night, today is Tuesday, September 23rd. Tomorrow will be September 24th. Actually, wait, I'm making a promise I can't keep.

Starting point is 00:13:38 The next time where we're going to record is actually probably going to be in person at, uh, under the C in the Netherlands. So once we finished recording there, I mean, maybe we might steal away and talk about softmax, but it might be after C++ under the sea, maybe we'll steal away, find some time to talk about softmax, but I've been working on Ashwin posted internally about a row-wise soft max and numpy and what his aspirations for that were. I tried to do it in, I think I can mention it now.

Starting point is 00:14:05 It's... You just heard a record scratch. Bryce made me cut everything I just said. We'll be talking about softmax and a special project that we will be announcing on the pod. Bryce has something to say. Avid listeners will know that I have been hinting and suggesting at least two or three times that eventually on this podcast, where we are clearly obsessed with AI, we are going to have to train some sort of machine learning model.

Starting point is 00:14:33 We're going to have to go and build something and get under the hoods. And Connor doesn't even realize that he started down this. exploration, because he's gotten nerd sniped into implementing this softmax. Softmax, which is one of the key ML algorithms at the heart of many attention mechanisms, which is the... I'm not saying I nerd snipe. I'm saying Ashwin nerdsnipe you, but softmax is at the key of many attention algorithms, such as flash attention. Four operations. Exponential division. I understand. But, and I told Connor,

Starting point is 00:15:12 this is cool. This will get people very interested in your project if you show them an efficient softmax. But what it'll get people even more interested is if you can show them how to implement flash attention in your thing. And then Connor's going to go and do that. And then next after that, we're going to have to go and implement some sort of open source model, inference for some sort of open source model and his thing. And then we're going to end up talking more and more about this. And before you know it, folks, this is going to be a machine learning podcast. And Connor's going to have implemented some machine learning models in his thing i tell you i predict right now one to two years from now that's where we'll be training will come after the infants though training will come after the

Starting point is 00:15:51 inference i got nothing to say folks we got a conference to go to tomorrow oh man do you think you have do you think you have the energy for it starting to rally i'm starting to rally i was i was feeling drained over the weekend but now i'm starting to rally today i slept until 730 the day before that i was up at 545. Luckily, my plan tomorrow is also to sleep in a 7.45. So I should be good. I do plan to attend at least the start of the keynote. It's not Hana.

Starting point is 00:16:23 You predicted it correctly. At one point, was it yesterday, two days ago? You saw that there was a second keynote that had been added. And I kid you not, Bryce said, hmm, I wonder if Hana can't make it anymore. And sure enough, a day later, Hana's keynote was removed and it was replaced by another keynote. Anyways, how are you feeling about the conference tomorrow?

Starting point is 00:16:45 You got the energy? I'm relian. I'm relian. All right, folks. You will hear from us in a week on Wednesday, September 24th, live from the NDC Social. Until then. Be sure to check these show notes,

Starting point is 00:16:59 either in your podcast app or at ADSP thepodcast.com for links to anything we mentioned in today's episode, as well as a link to a get-up discussion where you can leave thoughts, comments, and questions. Thanks for listening. We hope you enjoyed. and have a great day.

Starting point is 00:17:11 Low quality, high quantity. That is the tagline of our podcast. That's not the tagline. Our tagline is chaos with sprinkles of information.

Algorithms + Data Structures = Programs - Episode 258: 🇳🇴 An Algorithm Taxonomy (Serial, Parallel, Cooperative)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.