Algorithms + Data Structures = Programs - Episode 1: Our Favorite Algorithms - Part II

Episode Date: November 27, 2020

In this episode, Bryce and Conor revisit std::transform_reduce and other unfinished topics from Episode 0.Date Recorded: 2020-11-22Date Released: 2020-11-27NVIDIANVIDIA GithubRAPIDS.aiRAPIDS cuDF Gith...ubPython pandasstd::transform_reducestd::inner_productthrust::transform_reducethrust::inner_productHaskell zipWithA Plan for C++23 RangesConor's recent CppCast EpisodeIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8

Transcript
Discussion (0)
Starting point is 00:00:00 Connor's too healthy for his own good. Connor is like this crazy runner. Pause. I need to go put coconut milk or coconut milk and kale in this curry thing. And I just want to wring his little neck. I'm bad at zero-indexing. Yeah, yeah, this is just a bit to put into the intro part. Welcome to ADSP, the podcast, episode one, recorded on November 22nd, 2020. My name is Connor, and today with my co-host Bryce, we're going to be revisiting Stood, Transform, Reduce, and other unfinished topics from episode zero. The original question for episode zero, the original question for episode zero was,
Starting point is 00:01:08 what is your favorite algorithm? Like, favorite algorithm singular. And we have feature creeped way past that. I think we covered like four algorithms in episode zero. So this is, I guess, part two. Yeah. So, yeah. So before we get into that, because we're going to talk more about Transform Reduce, because this podcast is called Algorithms Plus Data Structures Equals Programs, but it's really Transform Reduce. This podcast is called Algorithms plus Data Structures equals Programs, but it's really TransformReduce equals Programs. So we're going to talk about TransformReduce,
Starting point is 00:01:33 I'm sure, a lot throughout this podcast, but today we're going to talk specifically about it a bit more. But before we do that, we should explain a little bit about our jobs and our background, because that's relevant to our passion for Transform Reduce. Have we started the podcast? Yeah, this is the podcast. This is the podcast right now. All right. I had no idea we started. So what are we talking about now? We're explaining all right, where we work. All right, you go ahead. Okay. So both of us now work at NVIDIA. That wasn't always the case. You didn't work at NVIDIA. And then I was like, Connor needs to work at NVIDIA.
Starting point is 00:02:08 But so I work at NVIDIA and my team works on what's called our C++ core compute libraries. And one of those libraries is a algorithms library called Thrust. So in particular, it's a parallel algorithms library. And for me, the reason why Transform Reduce is so exciting is because it is, I'd argue, the most powerful of the C++ parallel algorithms. And we'll talk a little bit more about that later. But you should explain...
Starting point is 00:02:42 So I'm coming from this as the implementer of the algorithm, whereas you are actually one of my users. True. So also too, before I start talking, explain to folks, because I envision that over time, less and less of like, I assume at the start, a high concentration or percentage of our listeners will be C++ developers. But over time, that proportion will decrease as other listeners join. So I started my career as a C++ developer. I spent a year at Amazon doing a bit of C++, Java, Python, a tiny bit of Go, a tiny bit of Perl. And now I'm back to doing C++ and a little bit of Python.
Starting point is 00:03:31 So what's your background? Have you been C++ from start to finish? So in episode zero, we talked about how I flunked out of college and started teaching myself to program in C++ to create a game. And then I discovered Boost through the Boost. The first algorithm I was exposed to was Boost's algorithm split. And then I quickly got distracted from the project I was working on by just contributing to Boost. And then I needed a job because I had, you know, dropped out of college. My parents were not super thrilled about me just like sitting in their basement. So I knew this guy on IRC, Hartmut Kaiser, and I was like, hey, I'm looking for a job. And he's like, hey,
Starting point is 00:04:19 why don't you come down to Louisiana? then I somehow talked my my parents into buying me a one-way me a 19 year old fairly sheltered child a one-way ticket to Louisiana and that worked out pretty well so I was there for a number of years working at Louisiana State University doing high performance computing, building parallel runtimes for scientific applications that run on supercomputers. And I did that for a few years. I worked on this thing called HPX.
Starting point is 00:04:58 And then I went to Lawrence Berkeley National Lab in Berkeley, California, and I did some other HPC stuff. And then I came to work at NVIDIA. And throughout that entire time, I have almost exclusively programmed in C++. My old mentor, Hartmut, he used to say, like, there's like three acceptable programming languages, C++, JavaScript, and Python. And so that's basically all that I've ever really worked with. Maybe with the exception of I interact a bit with Fortran. I have a weird passion for COBOL, in particular the history of COBOL.
Starting point is 00:05:44 And I write a lot of bash scripts and uh I think bash is actually like a pretty great programming language so we'll save we'll save bash for a whole episode we'll save the history of COBOL for another episode uh but do you know why uh Hartmut said that those three javascript python and c++ are the only three that you what did he say they're only three that are uh real c++ are the only three that you what did he say they're only three that are real programming languages or the only three you need to know like in his mind like those are the three the three acceptable programming language like like if you knew those three programming languages like c++ wasn't the answer for all problems he
Starting point is 00:06:21 he admitted that he was first and foremost a c++ person but it wasn't always the answer for all problems. He admitted that. He was first and foremost a C++ person. But it wasn't always the answer for everything. And so if it wasn't the answer for something, then you should do that thing in either Python or JavaScript. And I don't particularly know the rationale. I mean, Python, it's a nice language. It's very popular in data science and in HPC, so maybe that's why. I don't know about JavaScript. I mean, I guess if you need to deal with some web technology, it's one of them. Okay. So to summarize, you're basically start to finish C++ with a little bit of Bash, Fortran, COBOL in the midst, and I'm the more of
Starting point is 00:07:08 the polyglot one, or polyglot developer, I should say. I take it back. There is one common theme between all those languages, and I suspect it's part of the reason why my mentor, Hartmut, liked them all. All of those languages are Python, C++ and JavaScript. They're all fundamentally multi-paradigm languages. They're not purely functional languages or purely imperative languages or purely object oriented languages. Instead, they support multiple different paradigms, which gives you, the programmer, a great deal of freedom in how you write and structure code.
Starting point is 00:07:53 And that was something that I think was always valuable to Hartmut. you know, if he goes and looks at a problem, he'll come up with a solution for it, which combines the power of functional programming, object oriented programming, and, you know, the efficiency of just writing low level imperative code. So I think that's what he probably liked about all those languages. Yeah, that's a good point. Multi paradigm. I prefer multi paradigm languages, because then I get to write code the way that I want. So before our tangent into the three acceptable programming languages, according to Hartmut Kaiser, we were pointing out the fact that I am an implementer of a C++ algorithms library, and you are now a consumer. So why don't you explain a little bit about where you come from?
Starting point is 00:08:53 So, yeah, in a nutshell, I work for a team called Rapids. That's all capital letters. At NVIDIA. Yeah, at NVIDIA. And this is sort of a newer kind of team for NVIDIA in that we're entirely open source. So Thrust is obviously an open source library. But I think that's probably the first open source or like the major, at least the first major. It was the first open source library at NVIDIA. It is funny that both, you know, NVIDIA is a company not historically known
Starting point is 00:09:26 for open source, but both you and I work at NVIDIA and we both pretty much spend 100% of our time working on open source code or on open standards. Yeah. So the Rapids team, like from the bottom up, was built to be like a remote open source team. So all of our development happens on and through GitHub. I rarely have to like log into VPN because my day job is like working on PRs and issues that are posted on the Rapids project. So specifically, I work on CUDIF. The CU, the CU standsU stands for CUDA, and DF stands for DataFrame. But Rapids is a number of projects, not just CUDIF. But basically what Rapids is designed to be is an end-to-end data science pipeline that
Starting point is 00:10:16 runs on the GPU. And each of the projects or repos within Rapids sort of targets an existing library that exists in sort of the data science ecosystem, but that doesn't run on the GPU. So specifically, KU-DF targets or tries to sort of maintain parity with a Python library called Pandas, which is extremely popular. So if you're a Python dev, I'm sure you've heard of Pandas. I believe we have QML. The ML in that repo stands for machine learning. That targets scikit-learn, if I believe. And then there's a ton of other repos and projects, QSpatial, QGraph. I won't go through them all, but it's a super, super exciting team.
Starting point is 00:10:58 I absolutely love the work that we're doing there. I write in C++14. We should be upgrading to C++17 in the next, I don't know, year or so. I think it depends. And it plays well into your background too, because you actually originally were schooled in actuarial science, which is a good fit for data science. Yeah, I went to school to study actuarial math, and there is a lot of sort of finance applications, I guess, that leverage data science. I would say, though, a lot of the work that I do is sort of abstract, with no specific domain in mind, which I actually love because I really care more about code quality and the code that I'm writing than I do actually on the product
Starting point is 00:11:59 that's being built at the end of the day. I think it's awesome that there's a lot of like national labs and a lot of like really, really big corporations that need to crunch a ton of data that are leveraging this library. But at the end of the day, like I am at heart, like a library writer. I'm not like a app developer. Yeah. We're both fundamentally framework people. You know, one of the analogies I like to use is that, you know, software is sort of like a library. There's people that write, that author the books, the library authors, and then there's people who come in and read the books and use knowledge from the books to build actual concrete things. That's people who build, you know, end applications. And then, of course, there's, you know, the people who define the languages that you use
Starting point is 00:12:52 to write the books, which we also do a part of because a large part of, you know, what we do is, or a large part of what I do, less so you, is work on programming language standards. Specifically, I spend a lot of my time on the C++ standard. But, you know, the data science world is actually sort of a good intro into why we both think transform-reduce is so awesome. So before we do that though i just realized i went on a ramble of you know rapids and why i think rapids is awesome and failed to mention how that makes me a consumer of thrust um which is the connection so uh rapids
Starting point is 00:13:38 is heavily built on top of thrust um which is fantastic because when i came to NVIDIA and even still now I do not consider myself like a a GPU computing expert I'm not like anywhere near like the top of the distribution when it comes to writing like CUDA kernels but most of the time we're not writing CUDA kernels in Rapids we're just leveraging the work that's already been done in Thrust and so if you're familiar with the STL algorithms you basically just instead of writing std colon colon, you just write Thrust colon colon, and you're programming on the GPU. Fantastic. So yes, that's how I'm a consumer of, or Rapids is a consumer of Thrust. But yes, back to TransformReduce. Well, and like, let's say, here's a data science problem. Let's say that you've got census data, let's say,
Starting point is 00:14:27 US census data as one big data set. And you want to count some property of that data set. For example, you want to determine how many people are over 30, 30 years old in this huge data set? Well, one way you can do this is with transform reduce. So in that case, your transformation function is a function that takes, you know, a record for a person in this data set and returns true if that person is older than 30 and false otherwise. And then your reduction operation just, you know, adds up all the trues
Starting point is 00:15:20 into some integer value. And you probably are using some big int type if your data set is really truly large. And this is a good approach to implementing this because transform reduce is inherently something that we can easily paralyze. The transformation is embarrassingly parallel. And then the reduction part, well, we know how to paralyze reductions. That's a fairly well-known science. And it can be done fairly efficiently. And so that's one really good example of how you can use transform reduce anytime that you need to count a property.
Starting point is 00:16:16 So here's a question. While I'm listening to you explain that, you've got some record, you're either turning that record, you either turning that record into basically a one or a zero based on some property, and then you're just adding up those ones and zeros to get a total count. In my head, and I'm sure in some of the listeners' heads, they're thinking, why not just use a std reduce or a std accumulate, where your accumulator is the total count, and each element that's being passed into your reduction binary operation is doing the transform and the reduction at the same time. So it's taking the record and using some ternary expression or if statement, if greater than 30 or whatever your age is, add one, otherwise add zero. And this is something that I misunderstood, that there's a reason that that actually doesn't work when you are parallelizing your algorithm. You can't just take a std accumulate that's doing that reduction and change it to a std reduce.
Starting point is 00:17:19 Do you want to explain why that is? Well, yeah, because in that case, that sort of reduction that you're describing, that reduction depends upon things happening in a certain order, right? It depends upon a left fold, right? Correct. Yeah. And a parallel reduce cannot be constrained in that way. If you always have to fold from the left, then you can't actually really parallelize things, right? Because then, you know, the second reduction that you're doing in the set is always going to depend on the first one, and the third one will always depend on the second one, etc., etc. And so the algorithm
Starting point is 00:18:15 becomes inherently sequential. And so you need to, like, a parallel reduce cannot make that guarantee, and thus something that might be a very natural way of writing the code sequ uh you have to adhere to like the two properties of the reduction operation for both std reduce and std transform reduce which is that they both need to be associative and commutative and in the binary operation that was doing the reduction using a std reduce when your first element element is the total count which is some sort of integer or big integer type and the second element is like your record type like there's no way that's going to work so you can say yes you'd need a fold left for that another way to say is like you don't have an associative and commutative operation so you need the transform from the transform reduce to like modify the data that you have such that the reduction that you do doesn't depend on order or isn't commutative or associative or is commutative and associative.
Starting point is 00:19:35 So in this case, we're turning our record into just an integer or a big integer, which then when we have sort of like the running sum and a one or a zero, it doesn't matter what order you add those up, you're always going to end up with the correct answer at the end of the day, which is like, it's a super important insight, which I didn't have for, I don't know, a number of weeks. And if you don't know what, you know, associativity or commutativity means, here's a simple mental test that you can use to think about whether a reduction in your code is going to be paralyzable or not. If that reduction has heterogeneous types, so the left-hand side is a different type than the right-hand side. That probably means that
Starting point is 00:20:28 you are doing some stateful reduction into one of those two types. You know, the left-hand side has some accumulation of the running state of the algorithm. And that means that it's not an algorithm that you can, or that's not a pattern that you can paralyze. Yeah, that is a good way to think about it. Yeah. So another thing we should talk about is the R-ity of transform reduce. So actually, the first proposal I ever wrote on the C++ committee a sequential version of transform reduce inner product um uh takes two
Starting point is 00:21:30 input sequences and the transformation function that it applies takes one element from each of those two input sequences um and uh the parallel version we didn't have that the parallel version, we didn't have that. The parallel version, we were just going to ship transform reduce, just one form of it that took only a single input sequence. And I wrote this paper saying, hey, no, it's actually really useful to have the binary one. You can do very cool things with it, like a word count algorithm or something like doing a dot product with this binary transform reduce that takes two input sequences. users of the thrust library that I work on do is something that I think you've come to call a zip reduce where what if you have three input sequences or four input sequences etc yeah zip reduce is one of the names that I mentioned in a couple of my talks I think the transform reduce at the end of the day is the correct name, because as you point out, there is both a version that takes a single sequence or a single range and one that takes two sequences or two ranges. And in the single range version, you're not doing any zipping operation.
Starting point is 00:22:58 But internally, I think of the two range version as an algorithm called zip reduce and in fact that is exactly what it's called um in certain functional programming languages like haskell the two range zipping and sort of transforming algorithm is called zip with um which i think is uh it's great that they highlight zip i'm not sure about the width, but definitely I like the zip. Yeah. And we don't, or we did not in the original C++ algorithms have an algorithm like zip,
Starting point is 00:23:36 although now we have that with ranges, right? It is proposed in the, you know, a plan for C++ 23 ranges paper. So depending on how well you do getting stuff through LOOG, it might end up in C++23. We should explain. I am the chair of the C++ Committee's library design group. So my ability to manage a standardization process can have a very real impact on whether or not we're able to get cool stuff into the next revision of the C++ standard. But no pressure or anything. Yeah, no pressure. So a question, it sounds like, actually, I always thought, this came up, and I was on
Starting point is 00:24:28 CppCast last week, and this came up that I had mentioned that TransformReduce was sort of the revamped parallel version of the STL algorithm inner product. And I pointed out the two differences, one being that, you know, there's tighter restrictions on transform reduce, and that transform reduce also takes a single range, whereas inner product, you can only take two ranges. And so I had always thought that transform reduce was actually just like an evolved version of inner product, where they added the extra single range. But it sounds like if you're writing a proposal to add the two ranges, that transform reduce started out as just like an entirely,
Starting point is 00:25:13 like it was designed as an algorithm on itself, not as an evolution of stood inner product. Do you know if that's the case? Yeah. So the C++ parallel algorithms are based on the Thrust library that I now work on. But at the time that they were being standardized, I worked on this parallel runtime called HPX and didn't work at NVIDIA and wasn't particularly familiar with the Thrust library. And the Thrust library had introduced transform reduce and i think the mentally the model there was it's an extended version of reduce it's a more efficient thing
Starting point is 00:25:54 to do than to do a transform or to do yet a transform uh followed by a reduction um and uh i think that's why it takes only a single input sequence because reduce only took a single input sequence. Whereas in HPX, we always thought of it as naturally being the parallel version of inner product. Although I should mention, thrust does have a parallel version of inner product under the name inner product. And that may have been a part of it, too, that the original plan was for transform reduced to just take a single input and for inner product to take two inputs, but for there to both be a sequential and a parallel version.
Starting point is 00:26:50 And this is actually one of the big design questions about the parallel algorithms. Should we have parallel versions of the algorithms with the same name, even if the parallel version cannot provide the same guarantees as the sequential version. And in some cases, in most cases, we said, no, if the parallel version requires fundamentally different guarantees, then we should give it a new name and that's why in c++ we have both accumulate um a sequential algorithm that does reductions um but in a certain order um uh and reduce a uh which has a both a sequential and a version, but makes no guarantees about the ordering and requires commutivity and associativity. Interesting.
Starting point is 00:27:52 So I actually did not know Thrust had, I'm staring at the docs right now. We'll put it in the show notes, but Interproduct. So Interproduct is a parallel version of the STL Interproduct without the requirements on the binary operations. Is that, that's correct? Yeah, that's right. Yeah. Interesting. Yeah. So I would argue that that is something of a historical mistake in Thrust that we probably didn't really mean to do that um uh but it's you know it's a little bit too late to fix it now do what specifically have both inner product and transform reduce or
Starting point is 00:28:32 oh no to have the to have a a a parallel inner product under that name was probably a mistake um because we don't have in thrust an algorithm called accumulate that's parallel. We don't have in thrust an algorithm called iota that's parallel. Out of recognition that those algorithms were like the name, those named algorithms in the standard library were inherently sequential. And so when we introduced those into our parallel algorithms library, we gave them different names so that people would not associate the semantics of the sequential one with the semantics of the parallel one. There's other algorithms like transform where there is no difference in guarantees aside from guarantees around race conditions between the sequential and the parallel version.
Starting point is 00:29:33 And those guarantees around race conditions being essentially, you know, okay, if your transformation function, your transformation functions need to be thread safe. But there's no big difference like, hey, you know, the accumulate supports non-commutative, non-associative operations, whereas, you know, reduce doesn't. And so, you know, there's a lot of power in names because names associate semantics with them. And I think it was probably a mistake for us to use the name inner product, which has an association with those guarantees that std inner product makes.
Starting point is 00:30:21 To use that name for a parallel algorithm that doesn't make those guarantees is probably a mistake interesting yeah well i feel like we've gone over time once again and i feel like there was there was a third topic we were supposed to because we were we were supposed to revisit all the stuff we said we would talk about in episode one and then we just sort of talk about triangle products. Oh, yeah, yeah. I can very quickly wrap up this episode by just saying that if you recall from episode zero what outer product was, to recap, basically a Cartesian product where you take two lists and you generate all possible combinations of the elements of those two lists, except a Cartesian product just produces a flat list,
Starting point is 00:31:09 whereas outer product produces a matrix with the dimensions of the lengths of your two lists. A triangle product basically avoids the duplicates. So if you imagine yourself with a matrix, just looking at the upper triangular matrix of that sort of matrix that you started with is what triangle product does. Because a lot of times when you're using outer product to perform some binary operation on pairs of elements from two lists, you don't actually want to do the operation twice for, you know, the pair at index 2, 10. You don't want to also repeat the operation for the pair of elements at 10, 2, which is basically the same elements just reversed. Obviously, if your binary operation is something
Starting point is 00:32:02 like minus, that's not associative, you're going to want to perform that in both directions. But for something like multiplies or plus, you don't actually need to do that. And you can save yourself like twice as many computations. So a triangle product is basically just an outer product that only takes the upper triangular of the matrix, which is a common pattern in array-oriented languages. But as I mentioned in episode zero, I have not found an algorithm in any language that does this kind of thing. But probably our listeners are thinking it's super niche, which it is. But when you want it, it's irritating to have to compose it yourself.
Starting point is 00:32:44 And I've run into it many times so i think with that we talked about transform reduce more comprehensively we actually talked about where we work and what languages we've worked in and we talked about triangle product anything else we need to mention before we call it i think we should call it. All right. And are we going to be releasing these on Fridays or Tuesdays? I think that's... I'm going to delegate that decision to you. Okay.
Starting point is 00:33:15 Our first one was released on Friday. We might release this one on Tuesday because apparently Tuesdays might be better than Fridays for some reason. Yeah, there's some data to that effect. Alrighty. With that, thanks for listening. We'll see you in the next one. you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.