Algorithms + Data Structures = Programs - Episode 82: GPUs - Responding To Reddit Comments

Starting point is 00:00:00 We're gonna we're gonna now call Bryce Dr. Doolittle here because I like like like I legitimately would like to own a horse now and go riding on said horse. Welcome to ADSP the podcast episode, recorded on June 4th, 2022. My name is Connor, and today with my co-host Bryce, we respond to Reddit comments on episode 80, where we talked about GPUs and multidimensional iterators. So my girlfriend's dog, we got spayed yesterday. Thoughts and prayers go out to the dog. was that we got her a dog bed for my apartment, which I'm very excited about because the spot she likes in my apartment is right under my desk.

Starting point is 00:01:13 And that dog bed is just very comfortable for my feet. Like I'm very, very happy with this setup. You know, having like a warm dog there to keep my feet warm is like also a benefit yeah i didn't think that was where that was going but where did you think that was going i thought you were going to say that you were unhappy um because this is good we could lose a couple listeners here but i am not the biggest dog fan fan. I do know this based upon some past conversations you and I have had. And to redeem myself a little bit, I actually do love dogs. I just don't like untrained dogs, which happens to be the majority of dogs.

Starting point is 00:01:59 And if you're a dog owner and you're like, well, my dog's trained. If, for instance, I'm in an elevator and you are also in that elevator with your dog and you think that we're losing a couple listeners here. And you think that, oh, everyone loves dogs. I'll just let my dog jump all over this person and start licking his pants and blah, blah, blah. To me, that is an untrained dog or maybe an untrained owner that thinks that like their dog is the world's greatest gift, et cetera, et cetera. I love all dogs and I am completely fine with your dogs jumping on me in elevators and other places. I'm also very good with dogs. Yeah, like Olivier Gouraud's dogs love me.

Starting point is 00:02:46 Those dogs love me because I give the best belly rubs. Like we go, we go, we go months without seeing each other, but then I see his dogs again and they remember me because they remember how much I love them. This is, this is extremely true because one time the both of us went over to, um, Olivier's house for dinner and yeah those dogs absolutely they were like oh our our kindred spirit has arrived um and like the dogs were just laying their lap in their heads in your lap and yeah yeah so yeah definitely bryce is the dog person of the two of us i'm i'm good i'm great with too. Don't tell the girlfriend that she does not like cats. But yeah, I'm just I'm just an animal whisperer. I'm good. I'm good with horses, too, apparently, because when we were in Utah, we did horseback riding, which I've never done before. And I thought it was going to be scary. But no, apparently I am a natural with horses, too. And I now love horses.

Starting point is 00:03:40 We're going to we're going to now callryce dr doolittle here because uh i like like like i legitimately would like to own a horse now and go riding on said horse i mean i could see that for you i could see you having like a ranch that you visit on weekends or like live on a house on the ranch and have people to look after the horses and then visiting that part of your property on the weekends yeah did you did you see these pictures of the dog mugging that i oh yeah yeah those were great those were great um i thought that really added a lot a lot of color to um to the story there were there were a few that were even better which was my girlfriend like trying to shoo them out out of but i could not um tweet those because she was in her pajamas

Starting point is 00:04:25 and I determined that she would murder me if I tweeted those pictures. Well, maybe if we ever start like a Patreon podcast model where we charge our listeners for the juiciest content, which don't worry, listener, we're never going to do that.

Starting point is 00:04:43 You mean an OnlyFans. You're describing i mean that's not what all the other podcasts call it but sure i mean it is kind of the same transactional model um yeah where you're paying money for a certain type of content um but uh anyways to to bring it back to where i thought your story was going because that's where this all started. I'm not a big fan of dogs. So when you were talking about the dog being under your desk, my first thought was obviously Bryce does not want that dog there. And so that he's happy that

Starting point is 00:05:15 he's getting this bed so that he can put the bed somewhere else. So the dog will no longer want to go under the desk, you know, win-win um and then when you said uh and it's fantastic because it's very fluffy and it makes my feet comfortable i was like damn i got that one wrong um regardless of whether or not you own a dog i i highly recommend that everybody go get a dog bed to put your feet on when you're sitting at your work desk because it's amazing and then once you've got the dog bed you may as well get a dog too so yeah but if you thought that i was if i was unhappy with anything that my girlfriend's dog was doing do you think that i would mention it here in a place that she could theoretically have access to i mean you did say last episode that uh or is it two episodes ago that she

Starting point is 00:06:05 neither she nor your mother listens so yes yes um those are both true points but I'm I'm I'm well I guess we're on apple podcast so my mother could get access yeah she she doesn't my my mother does not have the attention span to put up with all of the the the tech nonsense and random weird tangents that we have to get far enough to to hear any of the things that she may take offense at it's true it's true even when even when there's like non-technical episodes sometimes i'll send them to my family and no one listens because they just don't care. They're like, Connor, we hear your voice enough when you come by like one week out of the year. That's way too much already. It's not actually entirely true about my mother.

Starting point is 00:06:57 Throughout my career, my mother has tried very hard to understand what it is exactly that I do. So she might have this game but i just i just don't i think that if she did she would start at episode zero i don't think she would start at the latest episodes or like even if she did like like like she's in italy right now so i wouldn't need to worry about her listening to like like this particular episode but she might get through some of them. I do not think she would get through any episodes. That's true.

Starting point is 00:07:29 Prove me wrong, mom. Prove me wrong. Maybe one day we'll release like the most technically titled episode and then we'll just talk about like the juiciest gossip, like family gossip and hope that none of our families listen to it.

Starting point is 00:07:43 Anyways, we need to get back to these these so you've been waiting for a week for us to talk about these reddit comments and uh i'm gonna read the first one connor edited this together in a way that did not no like i did not i just cut it straight in half i read some other reddit comments where people were talking about what podcasts would replace CPPcast. And there were some people who suggested ours. And there were some people who were like, yeah, but like half the time they just go off on tangents. That's true. And Bryce is too.

Starting point is 00:08:15 Oh, God. What was the word? It wasn't demure. It was another word. Nonchalant. Bryce is too nonchalant on the podcast. Nonchalant. Bryce is too nonchalant on the podcast. Nonchalant? Yes, which is a very true assessment, I think.

Starting point is 00:08:33 We're supposed to be – I don't know what the opposite of nonchalant – I know what nonchalant means, but I don't think chalant is a word. But like this is – listen, listen. I'm all about constructive criticism, whoever made that comment. But this is our podcast. No, chalant is a word. Is chalant a word? But, well, no, no, no, I don't think it's a real word. Wiktionary defines it as not non-chalant, which is about the dumbest definition i've ever heard i mean um i think that's the that's the

Starting point is 00:09:09 beauty of uh our podcast i mean that's the thing is i kind of thought that like it's a real shame that cpp cast disappeared because they're kind of in my mind like this kind of like voice or like core where you know a couple news articles that's a good c++ podcast you know what the one that like the actual the actual quality podcast yeah yeah they they they have like a format as to whatever this is yeah we are not that and that's the thing i was like should we start doing like a couple news articles and then my first thought was just like no like that's i don't want to do that uh we if we did we'd have to start another podcast for that purpose. Not this one.

Starting point is 00:09:48 I like the fact that we got three minutes into our last episode 81 and we were like, all right, we're going to talk about Topic X. And then I was like, oh, wait a second. We got to do some corrections. And now that I've said that. Yeah, I was going to stop you. I was like, what, what are you doing? Like.

Starting point is 00:10:03 Technically, I won't go read the other one, but that was only one thing I needed to mention. There's a list of them, but we'll do that on another episode. And Bryce still has to go to the gym. And Bryce still has to go to the gym. And eat food. But anyways, go on. Listen, the nonchalant list is here to stay, even though I don't think that's the right word. Nonchalant is kind of like adjacent to—

Starting point is 00:10:23 They didn't say you were nonchalant they said i was nonchalant but doesn't doesn't nonchalant it's i i think it's adjacent to like apathetic it's like you know you're not you know yeah yeah i don't you think you're nonchalant on this i think you get overwhelmingly excited and blow people's eardrums out with your i am sufficiently seasoned that I have learned to not care about a number of things that I deem not sufficiently important. Whereas you – maybe it's not seasoned. I'm traumatized enough. No, that's not the right word. I've seen enough things.

Starting point is 00:11:13 I've seen enough things that like, I have learned to be like, yeah, you know, I don't really care about that, that particular thing. Whereas you, you're just like, you're like, uh, a young school girl. You're just always so excited by the world and the the singing of the birds the blossoming flowers you gotta stop to smell every rose and i'm just like yeah whatever like let's get to the important things here so yes all right it's a fair criticism i don't i don't know how to take being called a young school girl but um sure i'll take it as a compliment anyways we're like 20 minutes into our 30-minute episode talking about comments on episode 80. In episode 82, here's our first comment from James Barrow. Hopefully, I'm pronouncing that correctly. The discussion

Starting point is 00:11:56 around GPU iteration order feels a little like something's being lost in translation. If you have a 2D array in row major order and you want it to some column-wise, each thread will be assigned to one element in a row and then iterate down columns. So if you consider the thread index to be your loop variable, it's not all that different. The mismatch here, I suspect, is actually in the definition of a thread. GPU threads are essentially one lane in a SIMD architecture. That's S-I-M-D. So I think the mix up here is essentially definition in nature. In a CPU architecture, you'd give each thread a chunk of data to process, loading rightwards slash contiguously slash row-wise with SIMD, and then summing downwards, which isn't different to a GPU.

Starting point is 00:12:40 It's just a terminological, it's terminological at that point. In a language like OpenCL, running on a CPU, each thread maps to one SIMD lane, and it's very much like programming a GPU. He's absolutely right, but it, every lane is an independent threat. And now with modern NVIDIA GPUs, truly an independent threat. this sort of joined vector style execution for operations where all the threads are at the same place and doing the same thing, the way that you write the code is different from how you'd write the code on the CPU. And as Olivier used to describe it,

Starting point is 00:13:48 you know, the GPU compiler, in the GPU compiler, we vectorize so late, we vectorize in the hardware. And, you know, the difference in the programming model does change how you write the code. Like under the hood, yeah, like the code is, you know, the way that you optimally process things may be the same but the way that you write the code and the way that you logically

Starting point is 00:14:13 move through the data and the code in each thread is a bit different awesome well james barrow you heard it here you're 100 correct but that that doesn't mean programming to each of these, a GPU and a CPU, is the same. Moving on to comment number two. I think James was the best comment. But yeah, we will move on to comment number two. Really enjoyed. This is from Kemanachi.

Starting point is 00:14:43 Don't know who that is because it's Reddit. Really enjoyed this episode. A lot of the multidimensional reordering just sounded like flattening a matrix, quote unquote, to me. I believe APL calls it raising a matrix. I think raising is J terminology. APL calls it raveling, but same difference. Also, the way Bryce described how GPUs work sounded pretty intuitive. I think a great diagram or animated visual might help clearing that up within the array

Starting point is 00:15:12 language community. This episode has me even more interested in seeing how MD-SPAN and multidimensional iterators will work on GPUs in the future. Comments? I'm not sure I understand the flattening part. I mean, it is certainly true that I think almost all ways of performing operations on multidimensional structures do tend to lead to some flattening, you know, in some way, shape, or form. And by that, I mean that sort of we, you know, you move through in some sequence in time through the data. And like that is like a linear thing um that you do um

Starting point is 00:16:07 but i'm not sure that i understand the flatten where what he's talking about the flattening in this context so flattening i mean i think what he's probably referring to is when you describe the two different um ways you can index yeah when dealing with the 2d matrix the first one was just uh or i can't remember if it was first or second but one of them was just yeah yeah it was just you know looping from zero to uh your dimensions time your dimensions and then doing some modulus arithmetic that in a sense is basically just looping from if you if you're storing your array contiguously in memory, just like from the beginning to the end, and then behind the scenes,

Starting point is 00:16:51 you're doing some modulus arithmetic to get I and J. Yeah, exactly. Yeah. And like, as we discussed in that episode, that approach is usually not the right way to do things. Yeah. yeah yeah because that that the the there's information loss when you go from a from you know a multi-dimensional position to a linear index into storage um there's information loss you can reconstruct the multi-dimensional index from the linear location in storage but that's often an expensive operation. And that's maybe something that we didn't get into in the last time. And this is sort of something that's

Starting point is 00:17:30 core to the MD-SPAN model. An MD-SPAN, this is core to a lot of like matrix models, but MD-SPAN illustrates it particularly well because an MD-SPAN is a non-owning view into something. So when you construct an MD-SPAN, you give it a pointer to some data somewhere. In the MD-SPAN model, that data is just a linear data. It's not inherently multidimensional indices, from some multidimensional space, into single-dimensional indices into that storage. And that storage array, it may be contiguous, it may not be contiguous.

Starting point is 00:18:20 In a lot of the simple cases, you can think of it being contiguous, but if you're, or you can think of the, let me rephrase that, the backing storage is contiguous, but the elements that the order of the elements that you might access may not be a mapping from not every element in that contiguous block of storage may be mapped to from one of the multidimensional indices. But so you have this operation in MD-SPAN that's defined by the layout, which is the operation where you give it a multidimensional index and it gives you back a single dimensional index, which is the location and storage.

Starting point is 00:19:11 And so when you're iterating, your iterator can choose to store one of those two things, either the multidimensional index, the thing that you would give to this layout function, or the storage index. And the problem is that the layout function that does this mapping is usually pretty quick. There are some cases like sparse where it may be more expensive, but let's just assume that it's pretty quick. Especially in dense cases, it's almost always pretty quick. But the inverse operation of that going from storage index to multidimensional index, that's almost always slow. And at the very least, it's almost always slower than the other way around. And so it's almost always wrong to store the storage index in discard the multidimensional index if you ever think that you're going to need to reconstruct and reuse that multidimensional index.

Starting point is 00:20:13 Yeah, we're probably going to end up having a bunch of conversations around this topic based on where my career is going. Because I've been thinking about this ever since starting to work at NVIDIA, which was now back in 2019. Is that we have a, like, and this is getting very NVIDIA specific. So, you know, I apologize to the listener if you care less about this topic. But there's a bunch of different, and some of it is C++. But like, so C++, MD-SP md span and md array are coming and in terms of the libraries that nvidia offers there's like a a bunch of different like we've got kublas

Starting point is 00:20:51 we've got matx we've got you know python front ends to c++ backends that are sort of you know numpy copies and i've always wondered uh like what's actually like the most efficient backend to a kind of like array language? Because, you know, all the indexing that you're talking about, it only it doesn't actually matter all the time, like for scalar operations is what they call an array languages, like when you are just adding a scalar to your matrix. Exactly. Yeah. It actually has, it has no, it makes no difference how you're, if you're storing it linearly in memory or structurally in memory, because if it's stored linearly, you're just basically doing a transform, a C++ stood transform on every single element. And so I just, I'm just, there's another aspect here, which is quite significant. When you're iterating through multidimensional data, you're usually either doing something element wise or the second case. And I'm going to, I'm just going to call it the second case for right now.

Starting point is 00:21:59 Let's focus on that element wise case. That's, that's the case of like, you're adding a scaler to every element or, you know, you wanted to do something like a stood reduce on every element. That's something we talked about the last time. Now... Wait, pause. What do you mean by a stood reduce? Because assuming every element is like an integer, what does a reduction on a single integer do? I'm like,

Starting point is 00:22:27 if I have a multidimensional, if I have a 2d matrix and I want to sum up, you know, all of the, all, all of the elements of the matrix. Oh, so you mean,

Starting point is 00:22:40 you mean regardless of the rank of your matrix, if you want to perform a reduction across every single rank down into a single scale or or or a count if if you want to count all the non-zeros in your matrix okay okay so basically any operation that yeah the structure the rank or the shape doesn't impact if you're reducing down to a single element or you're doing an element wise operation on every element okay gotcha and and importantly you know um importantly i think these operations they almost only make sense if they're all independent of each other like i can't you can imagine a stood reduce right being like that that being a useful thing like i want to sum up all of the values in this matrix or i want to count all the non-zeros in this matrix. How about adjacent difference?

Starting point is 00:23:31 Yeah, basically you're saying you need a monoidal binary operation where you can do it in any order. Now, the reason that adjacent difference isn't a particularly useful operation to apply is because you you're you're taking the difference between you know two adjacent elements but what does adjacent here mean in this context right now and so so this first class of of things you'd of iterations you want to do something to every one of the elements and that thing is, you know, kind of independent of other elements and or at the very least it doesn't care about like position of one element relative to another. The key thing here, the key insight is these are operations on elements. The other class of things that you want to do to a matrix, I will posit, you don't really want to do them, you don't really need to consider multiple elements that have some relation to each

Starting point is 00:24:46 other, then you're going to care about indexes, not about the particular elements. And so, like the example of this, like, you know, adjacent difference is basically, for the 1D case, it's a simple, like like stencil operator. But anybody who's done any amount of like numerical programming knows that if I'm writing some sort of, you know, stencil loop that's doing some numeric computation, I want to, or, you know, I want to be able to do like the value at like IJ plus the value at, you know, I plus one J plus the value at I minus one J, you know, something like that. Now, the distinction here is in the second case of things, you care about iterating indexes. So I will posit that the two different categories of iteration that you might

Starting point is 00:25:57 want to do over a multidimensional structure is one, element-wise by that, I mean, iterating over all of the elements or two, something where you're iterating through an index space, not particular elements. So the couple translations that I would give for the array language listener is that you're sort of your element wise operations, we would separate into two different categories. So only the shape preserving ones that basically like the reduction version in an array language, what you would do is we heard the word raise earlier, which is the J term or ravel, which is the APL term, which when you ravel, say a matrix that has two dimensions, so a shape of length two, you know, a two by three matrix. When you ravel that, it actually, to my knowledge, doesn't change anything in memory.

Starting point is 00:26:51 It just destroys the shape because everything's stored linearly in a raveled form anyway. So actually a ravel, which you might think is an expensive operation, the way that they're stored, you basically just change the shape to the product of all your elements of your shape, your dimensions, and then you don't actually have to change anything in memory. And then you perform a reduction. And then I think when you say indexes, I typically think more of like sub shapes, but it's the same ideas like where I'm, you know, performing something on a certain rank, um, that equates to just, uh, you know, an IJ index. Um, but so was there, was there a follow-up point to like, the point I was getting at is that

Starting point is 00:27:32 in a very concrete C plus plus space, I think there are two categories of useful iterators that you would have to a multidimensional structure. The first category of iterator is an iterator that gives you values and does not tell you what index the things live at. And there is no way to get the index out from it. But that doesn't matter because the way that you want to use this is to do something element-wise to each one of the elements of it. And then there is a second class of index, which is not an index over elements. There's a second class of iterator, which is not an iterator over elements, returning you the elements of the matrix, but it's an iterator that produces indexes that you would then use to index into the matrix. And there is a both programming model difference here, and also a significant performance difference because if we only have to build efficient index iterators,

Starting point is 00:28:53 that's a much more tractable problem than building efficient element iterators from which we can recover the multidimensional index for a variety of reasons that have to do with how compiler vectorizers work that um i do not remember all the details of and even if i could would not get into in depth uh on this podcast because that would be that would be hairy even for us well i have more questions but we're gonna let bryce go exercise and eat food oh there were two more comments there were two more comments oh there was only

Starting point is 00:29:32 one more comment we were at comment number two and very quickly the last comment was a response to the previous one by kemanachi greg cpp was responding to a sentence in the last comment which was i think a great diagram or animated visual might help clearing that up with the array language community. And Greg CPP responded to that. Yes, exclamation mark. I'm not a GPU programmer, but I would really appreciate a good reference for how GPU programming works at a conceptual level, not how to program in CUDA or with any other tools, but how they

Starting point is 00:30:01 work generally. For example, I believe that most GPUs have a CPU user space, quote unquote, driver, a CPU kernel space driver, and code on the GPU proper, but I have no idea what the responsibilities of each are. For example, when you dynamically allocate GPU memory, where does the allocation algorithm run? Or when the vendors talk about GPU cores,

Starting point is 00:30:21 what does that really mean? Is that answerable in a couple minutes? Well, I'm going to work backwards. So GPU cores, like often as a marketing term, there is a, you know, on NVIDIA GPUs, we have an actual execution unit called an SM. And I think of that as being the closest analogy to a core. On GPUs, we virtualize a whole lot of threads and SMs can be heavily oversubscribed with threads. And in fact, that's how they like to operate because, you know, while the way that the GPU SMs are signed to work, while they are waiting for data for one, you know,

Starting point is 00:31:15 group of threads, they can go and execute for the next group of threads. And, you know, this is very heavily pipelined. So, you know, you want to do what's called latency hiding, where you never actually have to wait for something because there's always more work available. So, you know, you maybe have one group of active threads and then you have, you know, 30 or 60, um, and those may be slightly excessive numbers, um, you know, four or 16 other groups of threads, which, um, uh, you've initiated loads on or stores on, or some, some operation that needs to wait, um, on, um, memory traffic. Um, and, uh, you know, those are sort of pipelined. And so as one of those loads comes in, then that group of threads becomes available. And then you go and you execute the compute code until

Starting point is 00:32:14 it hits the next memory load. And then by that time, somebody else has become available to start computing. So they are sort of designed for this barrel threading, these SM cores. I don't remember what the rest of that question was. It seemed like it was more, you know, this plea for a desire to understand the basics of the GPU compute model, which is something that I too have a desire for, as we talked about in the last time, I'm by no means a GPU expert. I will say, time permitting, it is likely that I will scrap my original plan for my Array 2022 keynote, which was just to recycle a good chunk of my existing slide deck on C++ standard parallelism. But I'm thinking about just scrapping that plan and instead just solely talking about

Starting point is 00:33:10 MD-SPAN and talking in greater detail about the problems related to multidimensional iteration in C++ and in compilers. However, that will require... The time that I have between now and the start of that keynote is quite limited and made more limited by the fact that I have lots of other things going on. And I have a friend coming into town to visit. Well, I can say it's hot. Han is coming to visit next week. And yeah, we'll see whether I have time to do that. I do also technically have a day job where I have to do things.

Starting point is 00:33:50 I mean, you could make a strong argument that giving a talk about multidimensional iterators and MD-SPAN to the academic community of array language and multi-dimensional array libraries could be of immense value, especially if you're focusing on how it impacts programming to a GPU, to NVIDIA. So, you know, could be a part of your J job. Yeah, yeah, yeah. But it requires more work than I may have time to do. So we'll see. see we'll see if you get a bunch of meeting cancellation uh notices from me um in the next 24 hours it's because i've decided to execute on this and have decided that uh that uh we are postponing all meetings until post i assume you're talking to me and not the listener no i mean some of the listeners uh

Starting point is 00:34:46 are on you know meetings that i would send to the cancellation notices for gotcha gotcha and if i send those meeting cancellation notices i'm gonna put in the note listen to as ads for details i mean this won't be out until after. Right. People will have to wait a week or two to find out why they didn't get to know. Listen to episode 82, which will be released on June 17th or something like that. All right. We'll wrap it up here.

Starting point is 00:35:21 Thanks for listening. We hope you enjoyed and have a great day.

CODACE Plant Stand

Algorithms + Data Structures = Programs - Episode 82: GPUs - Responding To Reddit Comments

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Algorithms + Data Structures = Programs - Episode 82: GPUs - Responding To Reddit Comments

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.