Algorithms + Data Structures = Programs - Episode 2: Our Favorite Data Structures

Starting point is 00:00:00 I'll try saying loud things because I'm a loud person. Stop. Stop laughing. It's just great because I'm the one that gets to cut this stuff. And you know I'm going to put in saying loud things. Saying loud things. Oh, that's unfortunate because that would have been like perfect research for this episode, right? Yeah, yeah.

Starting point is 00:00:20 But that's the thing. It's like it's one search away. Yeah, I found it. I found it. I'm good at searching. I'm good at searching. Oh, yeah, yeah. This is it. Exactly, yeah, but that's the thing. It's like it's one search away. Yeah, I found it. I found it. I'm good at searching. I'm good at searching. Oh, yeah, yeah. This is it. Exactly, yeah. And there are two, I recall, yeah, there's two, recorded on November 28th, 2020. My name is Connor, and today with my co-host Bryce, we're going to be talking about what our favorite data structures are. Hey, Matt, how's it going?

Starting point is 00:01:09 It's pretty good, how are you doing? I'm enjoying the one true American Thanksgiving down here while you Canadian heathens are doing your thing. We had the real Thanksgiving a month ago, so I don't know what you're talking about. You know that there's more people in the state that I live in than in your entire country? That, is that true? Yeah, like, I think there's, like, 35 million Canadians and there, British Columbia, our most western province, is larger than the combined geography of your three coastal western states, California, Oregon, and Washington. So you win some, you lose some. I'm not surprised.

Starting point is 00:02:05 I actually really like British Columbia My mom My stepdad and I Went up to visit Like Victoria Island Vancouver Island Yes, Vancouver Island Vancouver Victoria Victoria is the name of the capital

Starting point is 00:02:22 That's on Vancouver Island, is that right? Yes, that is true yeah yeah and it was like we actually we did it in October so and it was like rainy and dreary and kind of a little bit like depressing weather but I had a nice time I really liked it and I really like Seattle um I really like that whole that whole vicinity yeah Seattle reminds me of uh reminds me of New England where I grew up. Yeah, I studied in Vancouver. Oh, you did? That's right.

Starting point is 00:02:52 It is a very, very beautiful... Yeah, for those that, I guess, because we're just chatting now, but I currently live in Toronto. I was in the Bay Area, which is where Bryce and I met for just over a year, but I grew up in British Columbia in the middle of nowhere and then studied in Vancouver, which is, yeah. What was the name of your town? You once told me it was the equivalent of the New Jersey of Canada. It was like the armpit of Canada. So the armpit of Canada, that's a joke that refers to Hamilton, which is in Ontario.

Starting point is 00:03:24 Which is where you currently live. No, well, yeah, I live in Hamilton, which is in Ontario. Which is where you currently live. Yeah, I live in Ontario, but not in Hamilton. Hamilton has got its redeeming qualities, though. It's also known as the waterfall capital of Canada because it's got a lot of waterfalls. But my hometown is, I'm sure most of the listeners have no idea where this is, but it's a small city called Prince George, which which yeah it's uh it's a city of trees and trucks a lot of a lot of trees and a lot of trucks yeah so i think today we were supposed to talk about our favorite data structures right yes the follow-up we ended up having to do two episodes on

Starting point is 00:03:59 our our favorite algorithms because apparently we can't stay on topic and we start topics and don't finish them um but the follow-up to that is what is uh each of our respective favorite data structures or data i have a feeling that this is going to end up taking two episodes too potentially uh who's going first are you going well so i first? I think first we should talk about the word data structure. So a couple months ago, we were creating this new library at NVIDIA, which is now called KU Collections. And we were trying to figure out what to name this library. It was a library of containers or or data structures or collections whatever your term is for it um and we we wanted to pick a name that would be a good name and like we uh we couldn't

Starting point is 00:04:53 call it koo containers because that would be too confusing and ambiguous with um you know like docker and kubernetes etc um like there's already like an nvidia docker project which is you know, like Docker and Kubernetes, et cetera. Like there's already like an NVIDIA Docker project, which is, you know, not that far away from like maybe, sorry, there's already an NVIDIA container project, which is not that far away from a KU collections or KU containers project. And so we wanted something that was like, you know, not going to be ambiguous with containerization.

Starting point is 00:05:28 And so we asked you to do some research on what languages call data structures. And you came up, you looked at, I think, 12 or 15 languages. And it seems like most languages call them collections so like java python kotlin only c++ and ocaml call them containers and only racket and closure call them data structures and i'm sure there's other languages that weren't in your in the set that you researched but it seems like collections is the the most popular term and that's why, but it seems like collections is the most popular term, and that's why we called our library KU Collections. But to me, there's always been a different distinction

Starting point is 00:06:15 between these terms. It's like, sorry, to me, there's always been a distinction between these terms, that they're not just the same thing. Like mentally my model is that data structure refers to a particular implementation strategy, like a particular way of implementing some sort of container. Whereas container refers to an abstract interface that has,

Starting point is 00:06:49 you know, certain guarantees. And maybe this is a very C++-y way of looking at things. But like as an example, I think of std colon colon map as being a container. It's like that defines an interface for something. Whereas the way that you might implement a std colon colon map is with a red black tree, which is a particular data structure. And this mindset, I think, is more useful if you consider things like the container adapters in C++, like std stack, where std stack, you can plug in multiple different template parameters to change how it's implemented. And then it exposes out this container interface. So that's always been the distinction for me, that a container is about the interface it exposes,

Starting point is 00:07:48 whereas data structure is the term for the actual implementation of it. And collections is just never a term that I've ever used, so I've never really thought about it. But I guess that term seems to me like it's more closely, it just seems like an analogy for containers, not an analogy for the meaning of data structure that I just defined. Yeah, I think that's,

Starting point is 00:08:12 I did not until now have that mental model, but that makes complete sense to me. And for those of our listeners that aren't C++ developers, the red-black-tree data structure that Bryce mentioned is exactly, I think, how most of the standard libraries implement std map. And if you do have done any C++ development,

Starting point is 00:08:34 you might have seen when you run into template errors every once in a while, there's like underscore, underscore, RB, colon, colon, and the RB in that case stands for the red-black-tree. But it doesn't need to be implemented that way, so that I think makes complete sense. Yeah. Can you think of other ways that you can implement std map? I know that most of the standard libraries do it with a red-black tree, but I think there are other ways that you could do it, right?

Starting point is 00:09:03 I think any kind of balancing tree that says they're element-sorted, right? There's a bunch of them, but I... Yeah, I think that's right. I never took that course in school. I only have a math degree, and as we've established in previous episodes, I was a very bad student, so... But yeah, I think any data structure that is sort of self-balancing, or doesn't even need to be self-balancing, as long as the elements stay sorted. And I guess std map does make guarantees about the time complexity on operations.

Starting point is 00:09:37 So I guess you can't actually, if you want to be standard compliant, you can't just use any data structure. You have to use ones that adhere to the complexity guarantees that the standard specifies. But other than that, yeah, there's a bunch of options, I'm sure. Well, and that gets into an interesting notion from C++ world, which is this idea that a concept or an interface isn't just about syntax, but it's also about semantics. You could have something that provides all of the operations that StoodMap does and even does the same thing in all the cases as what StoodMap would do, but if it has a different complexity um for some of the operations um

Starting point is 00:10:27 we might say that it doesn't satisfy this you know the the concepts that std map satisfies because those concepts have certain complexity requirements associated with them because uh concepts have uh have uh semantic requirements as well. Yeah. And while you were saying that, I actually did think of, so I'm not actually sure if you can swap out this data structure for the Red Black Tree that's currently used. But even if it doesn't work, it refers to a great talk that I saw.

Starting point is 00:10:59 I believe it was at the 2019 meeting C++, or sorry, the 2019 Codive conference in Wroclaw, Poland. I could get this wrong, but I'm almost positive it was Bjorn Fowler that gave a talk on sort of things that I wish I had have known. a single example that is really inefficient and progressively like modifies things about it and switches the structured data structure that he's using. And ultimately, he ends up with something that's implementing a B tree, which is I think, similar to a red black tree, the difference being that it takes advantage of, you know, adjacent elements stored in memory so that you can have better cache coherency. And it's a fantastic talk. So even if you can't use a B-tree, I'll put it in the show notes and everyone should go

Starting point is 00:11:56 watch it, even if you're not a C++ dev. Cool. Okay, so back to our question. What is your favorite data structure? Am I going first? Yeah, you're going first. A quote came to mind by a prolific individual from sort of the history of computer science, Alan J. Perlis. For those of you that don't know, he was the first recipient of the Turing Award in 1966. And he's famous for sort of a number of reasons, but one of them that a lot of people know is that he has this sort of collection of quotes called

Starting point is 00:12:45 epigrams on programming. And one of those quotes is the following. It's better to have 100 functions operate on one data structure than 10 functions operate on 10 data structures. And this quote is actually what I care more about than the particular data structure. And so this leads me to sort of my favorite data structure in C++, or I guess if this is the case, it's more of a container, not a data structure, but it's Stood Vector, partially because, you know, of when Chandler said, you know,

Starting point is 00:13:16 always reach for Stood Vector, basically half the time, it's going to be more efficient than whatever data structure you're reaching for, unless if it's like a niche case. And I think Tony Van Eerd, another individual in the C++ community, he also said like anytime you are not using a std vector, you should add a comment and say why, because std vector is so versatile. But yeah, it's my favorite data structure, mostly just because it's got, you know, constant time complexity for most things. And then insertion is amortized constant, which is fantastic. And when you combine it with that quote,

Starting point is 00:13:52 it's even more than just like having 100 functions, aka our STL algorithms that can work on it. We almost have in C++ like 100 functions slash algorithms that operate on 10 data structures. So like in the quote, it's 100 on one or 10 on 10, when this sort of is implicitly referring to something known as like the m times n versus the m plus n problem, where due to the fact that we have inner iterators as this sort of layer between our algorithms and our data structures. They're generic, and so a lot of the algorithms work on different containers or data structures,

Starting point is 00:14:32 if you will, because of the iterators. So we almost have like 100 operating on 10, and of those 10, StoodVector is the one I reach for the most. Yeah. And to explain that a little bit more, one of the reasons why std vector is almost always the right choice is because of performance. A std vector gives you a contiguous block of memory. And in particular on, you know, sort of our modern contemporary hardware, that locality is almost always going to beat out any theoretical advantages that another data structure might have. You know, there are weaknesses to std vector. You know, it has to, every now and then, it has to go and do these reallocations to grow. You have to have a good growing algorithm. It can sometimes, because of that growing algorithm, it sometimes can be memory inefficient.

Starting point is 00:15:41 And like doing insertion in the middle is obviously, you know, not a very cheap operation, but you might go and look at using a different data structure that might be better at some of these things like stood list, which is, or a doubly linked list. You might think, okay, well, you know, I know I need to do a bunch of insertions in the middle or insertions at the front. So I'll use list in this algorithm instead of vector because I'm anticipating that. But with a doubly linked list, you're doing pointer chasing, right? All the memory is disjoint. And so because of that, you're paying this regular penalty of you're not walking through memory contiguously,

Starting point is 00:16:26 which is what modern hardware is meant to do. And so, yes, you might get this benefit with the doubly linked list when you're, you know, inserting an element in the middle, but you're paying this penalty in every memory access and in the memory access pattern itself. And so that often, that overhead of that less efficient memory access pattern often makes it a wash or makes using the list slower. So sort of unintuitively, there are some cases where the things that a vector is bad at, it would be better than the competition solely because it has this nice property that it's contiguous. And it being contiguous means it's easier for it to be cached and it's easier for it to be prefetched. If you look at modern

Starting point is 00:17:17 CPUs, they have special hardware that's designed for streaming through memory. That if you're in a loop and you're accessing an array of memory contiguously, there's like special hardware that's designed to figure out, hey, we're in a loop and we're sequentially accessing this contiguous range of memory. So like we should prefetch and cache accordingly.

Starting point is 00:17:41 And so, yeah, so that's why oftentimes the sort of naive data structure of vector is the right answer as opposed to something more complex. And it's, I really like that the first thing you mentioned about this idea that, you know, sort of less data structures might be better. That oftentimes you want more algorithms to operate on less data structures. I think a lot of times people question why doesn't C++'s standard library have a richer set of data structures like some other languages. And I think it's largely about that philosophy, that this idea that you don't necessarily need more complex data structures

Starting point is 00:18:26 and that those tend to be very specialized for particular use cases. Instead, what you need is a core set of really solid data structures and then a set of composable algorithms that build on top of them. One of the key things about the design of the C++ standard library is separation of concerns. For data structures in particular, with the exception of std string, we don't have these big classes that have a billion different algorithms as member functions. Instead, we have fairly simple data structures that expose abstractions like iterators and ranges. And then we have a separate set of algorithms that consume iterators and ranges. Yeah, so there's a bunch of things I want to comment on. So the first one where you were talking about pointer chasing, that ties in perfectly to the talk that I mentioned by Bjorn Fahler, in that one of the very first things in

Starting point is 00:19:21 his talk that he mentions is that he didn't really have a notion of, you know, the difference between data structures and pointer chasing and like ones that have, you know, contiguous memory and others that don't. And he said that like a good rule of thumb, you know, this isn't always true, but just like his mental model that he has is whenever you are chasing a pointer, you can basically consider that a cache miss, which exactly ties into what you're thinking. It leads to this counterintuitive, well, like, oh, I'm reaching for a data structure that has constant insertion time, which brings me to the second thing.

Starting point is 00:19:55 I said a std vector has amortized constant insertion. I should have been more specific in saying that. Specifically, that's referring to pushback, the method pushback that inserts at the end of the array. An insertion into the middle is going to have linear time complexity. But that ties in exactly to what you said is that it leads to this counterintuitive, well, I'm reaching for a data structure that supposedly has constant insertion versus linear insertion for a std vector. But due to the cache miss, it's going to erode at some of the, you know, quote unquote, complexity gain that you're getting from switching to a data structure that doesn't have, you know, contiguous memory like a std vector. And yeah, well, and, you know, part of that is, is we spent, I think, a lot of people and a lot of traditional literature spends a lot of time looking at time complexity

Starting point is 00:20:45 of algorithms, right? Whereas in practice, you do need to think about what the constants are. You know, okay, you know, this thing has O1 insertion, but what is the constant? You know, like, what is that, the actual constant latency? Or, okay, this thing has linear, you know, like, what is that, the actual constant latency? Or, okay, this thing has linear, you know, insertion in the middle, but if I only have 100 elements, how expensive actually is that? And when you start looking at those constants, that's when you see, oh, I only have 100 elements. Insertion in the middle is just going to be cheap for me if this thing is always going to be like that size.

Starting point is 00:21:24 So, and whereas using the list, all of the operations are going to have this greater overhead so i may as well just use the vector and like sure it's it's linear insertion here but in some cases that's fine yeah that's a it's that's a fantastic point and it applies not just to um uh data structures it also applies to algorithms like. I think it was a couple months ago, a coworker had came and asked me, I'm trying to do a binary search, but I need to look at two elements at a time. My immediate thought was adjacent find, but that's not a binary search. And then I said, I followed up and I said, the binary search equivalent of that is partition point. But then my next question was, do we really need a binary search here? Like, do you know, on average, how many elements is going to be in the list or that you're searching

Starting point is 00:22:15 over, or vector that you're searching over? Because I, you have to profile, I've heard a rule of thumb is that if you have less than 80 elements, a linear time algorithm is going to be faster than a binary search algorithm. I don't know where I heard that. rule of thumb is that if you have less than 80 elements, a linear time algorithm is going to be faster than a binary search algorithm. I don't know where I heard that or if that's true. You know, profile, you know, before you're making these claims. But his response was, oh, it's less than 10. In which case I said, yeah, like we don't need a binary search for that. Just use a JSON find and call it a day.

Starting point is 00:22:40 So, yeah, it's like thinking like like you said, you know, in academia, you always drop the constants or the coefficients. But in real world, those constants matter. And profiling is very important. So it's funny that you said std vector because when we talked about algorithms, we both, we said, oh, we thought that both of our answers was going to be transform reduce. But then it turned out that, in fact, that wasn't both of our answers for what our favorite algorithms were. And for this, I sort of figured that both of us would, our initial gut reaction would be Stood vector or a a dynamically dynamic array would be our first choice. And it was for you. But I actually am going to argue that there's that that it's not mine and that a dynamic array is often is often too heavyweight.

Starting point is 00:23:44 And I actually think that you're going to agree with me. I don't think that a dynamic array like std vector is your favorite algorithm. I think you've just been tricked into thinking that because the actual container that you need the most frequently doesn't exist in an easily accessible form in C++. So let's talk about how we use a dynamic array. So one of the key properties of a dynamic array like std vector is that it's growable. That is that after I've created it, I need to add more elements to it. But I think in a lot of cases, we know ahead of time how many elements we're going to put into that vector. And there is a pattern that is very common that I see a lot. I call it like the default construct and reserve pattern

Starting point is 00:24:58 when you're using a vector where you will default construct the vector, and then you'll immediately call reserve on it with the capacity that you need. And for those who may not be aware, when we talk about capacity, that's different from size. So capacity is how many slots for elements does the vector or the dynamic array have. And size is how many slots are currently in use. And every time the size grows above the capacity, you have to reallocate the entire vector storage and move everything into the new capacity. So typically capacity is grown with, you know, some internal algorithm, like you double capacity every time so that you don't have

Starting point is 00:25:53 to constantly be reallocating. But so let me just ask you, Connor, how many times have you constructed a std vector and then like either immediately called reserve or immediately called resize and then never added never increased the capacity after that uh probably more times than i can count although i will say as a small caveat to this i've discovered a new pattern using thrust fancy iterators which also exist in boost as boost iterators, where you can set up like transform iterators and then just call the range constructor and avoid like the resize or the reserve, which I think is a super nice pattern. Because then you can also declare your data structure container const by using those fancy iterators. But that's only been recent in the last couple of months that I've discovered that pattern. And before that, yeah, probably more times than I can count.

Starting point is 00:26:51 Yeah. And so this sort of gets to one of the defects that std vector has in C++, which is there is not a good way. One, there's not a simple way with std vector to reserve on construction. There is no constructor that lets you reserve capacity without also allocating elements. So you can construct a vector with an initial set of elements, but then you have to pay to initialize those elements. So like if I'm creating a one gigabyte vector of doubles, I could just call the vector constructor that says I want in elements that have this initial value.

Starting point is 00:27:37 But then I'd have to pay the cost of initializing that whole gigabyte of memory with this initial value when all I really wanted to do was just reserve that storage, and then later I was going to push back into the vector. And that's why you have this common pattern of default construct it, and then you call reserve. And there's also the case where you want to get uninitialized elements. And there's actually a C++ committee proposal regarding this. For example, let's say you've got like a std string and you want to call a C library function that's going to initialize the constants of that std string.

Starting point is 00:28:24 Well, there's not really a good way to do that. The only thing you can do is you can resize the string, which then zero initializes all the characters. And then you can call that library function or you can copy into it from some external API. And that's a little bit unfortunate. And so there's this proposal P1072 that tries to address this for originally for std vector and std string, but now just for std string.

Starting point is 00:28:57 But it is a slightly hazardous pattern where you want to say, oh yeah, I want to like add elements to this container, but I don't want them to be initialized. Like that's how you get, that's how you lead yourself towards, you know, memory safety violations and all sorts of unpleasant, uninitialized variable hazards. But if you really care about performance, it can matter. So I'm going to argue that dynamic array is, in many cases, more than we need because in many cases, we know what the capacity is going to need to be up front and it's never going to change. And so what we really need is we need a container that's just a C++ equivalent of, you know, a C style array where I want to allocate it, I want to give

Starting point is 00:29:52 it a size, and I want to have control over whether and how those elements get initialized. Now there are a few options in the C++ standard library if you want this data structure. So if you know a compile time, the size, you can use stdarray, which is a compile time sized array with more or less the same interface as vector. Of course, it can't be resized, etc. But I think the more common pattern is that I just want a runtime-sized array. And there's no specific class for this in C++, but you can express this with unique pointers array specialization. Are you familiar with Unique Pointer's array specialization? Walk us through it. So Unique Pointer in C++ is an RAI type that owns some pointer that you, programmer have allocated and it has a customizable deleter parameter which is used to clean up this pointer that you've given it when the unique pointer is destroyed.

Starting point is 00:31:16 And it's called unique pointer because it is uniquely owning. Only that unique pointer can have access to it. You can't copy the unique pointer. You can move the unique pointer, though. And so this gives us some guarantees about, you know, who has access to this pointer. And in particular, because it's uniquely owned, you don't have to worry about who's going to clean it up.

Starting point is 00:31:45 The answer is always going to be there's just one person that there's just one instance of this unique pointer. And that's the one that's going to clean it up. So you don't have to do reference counting or anything like that. But unique pointer is usually used for managing the lifetime of like a single pointer. But there's this specialization where if you write like unique pointer, template brackets, yeah, Chevron, and then like int empty braces, square braces, that will give you this specialization that is intended for arrays. And it will have an array access operator. And by default, it will use an array delete. And so this is a pretty nice way to proxy or to implement just a container that is just,

Starting point is 00:32:37 I want to allocate this array that has n elements and that number of elements is never going to change. And it's also useful because you fully control the allocation of these elements. All that you do is you give a pointer to unique pointer. It doesn't create the pointer itself. So you can control how you allocate that memory, how you initialize it. So you don't have to worry about this case of, oh, well, I want to use a std string, but I want to control how these elements get initialized and created.

Starting point is 00:33:18 With UniquePointer, you have full control. All that it provides you is the safety of the resource acquisition is initialization pattern, where when the unique pointer goes out of scope, you know that it's going to clean up the memory using the deleter. So that's, I think, something that I reach for very frequently. And there's a related facility called allocate unique that's proposed been proposed for the past two c++ standards but hasn't been very high priority um uh and that just gives you a convenience for saying hey i have this c++ allocator and i want to create a unique pointer that uh uses this allocator to allocate and deallocate the memory. Interesting.

Starting point is 00:34:08 If we want to point our listeners at more resources on this pattern, what would you recommend? So we'll put a link in the show notes to the allocate unique paper. In our thrust library, we have an implementation of allocate unique, and then we have a few different overloads. We have one for, you know, arrays, and I use it all the time. Like, that's my go-to thing. The paper is P0316. And it's, it's, it doesn't seem like it's a particularly important thing. But I think if you start, I think if you start thinking about all the places where you can use a unique pointer,

Starting point is 00:34:56 the array version, instead of a std vector, you'd be surprised how commonly this comes up. Yeah, it sounds like a common pattern that potentially a lot of people just aren't reaching for because it's not, you know, stated right next to std array and std vector in sort of the containers, docs, slash blogs that are written. Well, and it's not perfect because like it doesn't give you iterators, et cetera.

Starting point is 00:35:21 Like we could build a version of this. And at one point in C++'s history, we sort of did. There was this thing called DynArray that we tried to do, but it ended poorly. That was before my time. I don't remember all of the details, but we could do a better job here. Yeah, it sounds like an interesting direction

Starting point is 00:35:41 for the future of C++ once we get everything that's supposed to go into 23 and the bleed over into 26. That, you know, this could be a future proposal for C++. Yeah. What's the thing that you're most will and will not be in C++23, but assuming that everything that's being quote-unquote targeted by individuals for C++23, I'm most excited about seeing the extension to the range of stuff. On top of that, I think pattern matching is going to be awesome. Executors, you know, everything that's going in, honestly, like library support for coroutines,

Starting point is 00:36:33 I think that's going to open up the world of coroutines to a lot more people. But yeah, at the top of the list, it's definitely the range of stuff. How about yourself? That's hard to say. I would probably say I'd probably say executors, but I also think that things like allocate unique and any invocable, which is a better version of std function. I think those are going to have a lot of impact on people's day-to-day lives so i think that that those will be uh those will be big wins too all righty i think we're once again past our 30 minute uh target time but uh it's all right is that our target time? yeah well you know

Starting point is 00:37:25 20, 25, 30 minutes it's never a hard limit though it's our podcast we get to do what we want and I think we're going to have to do another one next week because there's a whole other set of data structures that at least I want to talk about

Starting point is 00:37:41 and yeah so maybe we'll do part two next week. Sounds like a plan. All right. Awesome. Thanks, everyone, for listening, and we'll catch you guys in the next one.

Pet Camera - EBO Air 2

Algorithms + Data Structures = Programs - Episode 2: Our Favorite Data Structures

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Algorithms + Data Structures = Programs - Episode 2: Our Favorite Data Structures

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.