Algorithms + Data Structures = Programs - Episode 2: Our Favorite Data Structures
Episode Date: December 4, 2020In this episode, Bryce and Conor talk about each of their favorite data structures.Date Recorded: 2020-11-28Date Released: 2020-12-04C++ | ContainersOCaml | ContainersJava | CollectionsPython | Collec...tionsKotlin | CollectionsScala | CollectionsRust | CollectionsGo | CollectionsHaskell | CollectionsTS | CollectionsRuby | CollectionsJS | CollectionsF# | Collection TypesRacket | Data StructuresClojure | Data StructuresWhat do you mean by “cache friendly”? - Björn Fahller - code::dive 2019Alan J. Perlis’ Epigrams on Programmingstd::vectorP1072 basic_string::resize_default_initstd::arraystd::unique_ptr (Array Specialization)P0316 allocate_unique and allocator_deletethurst::allocate_uniqueIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
I'll try saying loud things because I'm a loud person.
Stop.
Stop laughing.
It's just great because I'm the one that gets to cut this stuff.
And you know I'm going to put in saying loud things.
Saying loud things.
Oh, that's unfortunate because that would have been like perfect research for this episode, right?
Yeah, yeah.
But that's the thing.
It's like it's one search away.
Yeah, I found it.
I found it.
I'm good at searching. I'm good at searching. Oh, yeah, yeah. This is it. Exactly, yeah, but that's the thing. It's like it's one search away. Yeah, I found it. I found it. I'm good at searching. I'm good at searching.
Oh, yeah, yeah. This is it. Exactly, yeah. And there are two, I recall, yeah, there's two, recorded on November 28th, 2020.
My name is Connor, and today with my co-host Bryce, we're going to be talking about what our favorite data structures are.
Hey, Matt, how's it going?
It's pretty good, how are you doing?
I'm enjoying the one true American Thanksgiving down here while you Canadian heathens are doing your thing.
We had the real Thanksgiving a month ago, so I don't know what you're talking about.
You know that there's more people in the state that I live in than in your entire country?
That, is that true?
Yeah, like, I think there's, like, 35 million Canadians and there, British Columbia, our most western province, is larger than the combined geography of your three coastal western states, California, Oregon, and Washington.
So you win some, you lose some.
I'm not surprised.
I actually really like British Columbia My mom My stepdad and I
Went up to visit
Like Victoria Island
Vancouver Island
Yes, Vancouver Island
Vancouver
Victoria
Victoria is the name of the capital
That's on Vancouver Island, is that right?
Yes, that is true
yeah yeah and it was like we actually we did it in October so and it was like rainy and dreary
and kind of a little bit like depressing weather but I had a nice time I really liked it and I
really like Seattle um I really like that whole that whole vicinity yeah Seattle reminds me of uh
reminds me of New England where I grew up.
Yeah, I studied in Vancouver.
Oh, you did? That's right.
It is a very, very beautiful... Yeah, for those that, I guess, because we're just chatting now, but I currently live in Toronto.
I was in the Bay Area, which is where Bryce and I met for just over a year,
but I grew up in British Columbia in the middle of nowhere and then
studied in Vancouver, which is, yeah.
What was the name of your town?
You once told me it was the equivalent of the New Jersey of Canada.
It was like the armpit of Canada.
So the armpit of Canada, that's a joke that refers to Hamilton, which is in Ontario.
Which is where you currently live. No, well, yeah, I live in Hamilton, which is in Ontario. Which is where you currently live.
Yeah, I live in Ontario, but not in Hamilton.
Hamilton has got its redeeming qualities, though.
It's also known as the waterfall capital of Canada because it's got a lot of waterfalls.
But my hometown is, I'm sure most of the listeners have no idea where this is,
but it's a small city called Prince George, which which yeah it's uh it's a city of trees and
trucks a lot of a lot of trees and a lot of trucks yeah so i think today we were supposed to talk
about our favorite data structures right yes the follow-up we ended up having to do two episodes on
our our favorite algorithms because apparently we can't stay on topic and we start topics and don't
finish them um but the follow-up to that is what is uh each of our respective favorite data
structures or data i have a feeling that this is going to end up taking two episodes too
potentially uh who's going first are you going well so i first? I think first we should talk about the word data structure.
So a couple months ago, we were creating this new library at NVIDIA, which is now called KU Collections.
And we were trying to figure out what to name this library.
It was a library of containers or or data structures or collections whatever your term
is for it um and we we wanted to pick a name that would be a good name and like we uh we couldn't
call it koo containers because that would be too confusing and ambiguous with um you know like
docker and kubernetes etc um like there's already like an nvidia docker project which is you know, like Docker and Kubernetes, et cetera. Like there's already like an NVIDIA Docker project,
which is, you know, not that far away from like maybe,
sorry, there's already an NVIDIA container project,
which is not that far away from a KU collections
or KU containers project.
And so we wanted something that was like, you know,
not going to be ambiguous with containerization.
And so we asked you to do some research on what languages call data structures.
And you came up, you looked at, I think, 12 or 15 languages.
And it seems like most languages call them collections so like java python kotlin
only c++ and ocaml call them containers and only racket and closure call them data structures and
i'm sure there's other languages that weren't in your in the set that you researched but it seems
like collections is the the most popular term and that's why, but it seems like collections is the most popular term,
and that's why we called our library KU Collections.
But to me, there's always been a different distinction
between these terms.
It's like, sorry, to me,
there's always been a distinction between these terms,
that they're not just the same thing.
Like mentally my model is that data structure refers
to a particular implementation strategy,
like a particular way of implementing some sort of container.
Whereas container refers to an abstract interface that has,
you know, certain guarantees. And maybe this is a very C++-y way of looking at things. But
like as an example, I think of std colon colon map as being a container. It's like that defines an interface for something.
Whereas the way that you might implement
a std colon colon map is with a red black tree,
which is a particular data structure.
And this mindset, I think, is more useful if you consider things like the container adapters in C++, like std stack, where std stack, you can plug in multiple different template parameters to change how it's implemented.
And then it exposes out this container interface. So that's always been the distinction for me,
that a container is about the interface it exposes,
whereas data structure is the term for the actual implementation of it.
And collections is just never a term that I've ever used,
so I've never really thought about it.
But I guess that term seems to me like it's more closely,
it just seems like an analogy for containers,
not an analogy for the meaning of data structure
that I just defined.
Yeah, I think that's,
I did not until now have that mental model,
but that makes complete sense to me.
And for those of our listeners
that aren't C++ developers,
the red-black-tree data structure that Bryce mentioned
is exactly, I think, how most of the standard libraries
implement std map.
And if you do have done any C++ development,
you might have seen when you run into template errors
every once in a while,
there's like underscore, underscore, RB, colon, colon,
and the RB in that case stands for the red-black-tree.
But it doesn't need to be implemented that way, so that I think makes complete sense.
Yeah. Can you think of other ways that you can implement std map?
I know that most of the standard libraries do it with a red-black tree,
but I think there are other ways that you could do it, right?
I think any kind of balancing tree that says they're element-sorted, right?
There's a bunch of them, but I...
Yeah, I think that's right.
I never took that course in school.
I only have a math degree, and as we've established in previous episodes, I was a very bad student,
so...
But yeah, I think any data structure that is sort of self-balancing, or doesn't even need to be self-balancing, as long as the elements stay sorted.
And I guess std map does make guarantees about the time complexity on operations.
So I guess you can't actually, if you want to be standard compliant, you can't just use any data structure.
You have to use ones that adhere to the complexity guarantees that the standard specifies.
But other than that, yeah, there's a bunch of options, I'm sure.
Well, and that gets into an interesting notion from C++ world, which is this idea that a concept or an interface isn't just about syntax,
but it's also about semantics.
You could have something that provides all of the operations that StoodMap does
and even does the same thing in all the cases as what StoodMap would do,
but if it has a different complexity um for some of the operations um
we might say that it doesn't satisfy this you know the the concepts that std map satisfies
because those concepts have certain complexity requirements associated with them because uh
concepts have uh have uh semantic requirements as well.
Yeah.
And while you were saying that, I actually did think of,
so I'm not actually sure if you can swap out this data structure
for the Red Black Tree that's currently used.
But even if it doesn't work, it refers to a great talk that I saw.
I believe it was at the 2019 meeting C++, or sorry, the 2019 Codive conference in Wroclaw, Poland.
I could get this wrong, but I'm almost positive it was Bjorn Fowler that gave a talk on sort of things that I wish I had have known. a single example that is really inefficient and progressively like modifies things about it and
switches the structured data structure that he's using. And ultimately, he ends up with something
that's implementing a B tree, which is I think, similar to a red black tree, the difference being
that it takes advantage of, you know, adjacent elements stored in memory so that you can have better
cache coherency.
And it's a fantastic talk.
So even if you can't use a B-tree, I'll put it in the show notes and everyone should go
watch it, even if you're not a C++ dev.
Cool.
Okay, so back to our question.
What is your favorite data structure?
Am I going first?
Yeah, you're going first. A quote came to mind by a prolific individual from sort of the history of computer science, Alan J. Perlis.
For those of you that don't know, he was the first recipient of the Turing Award in 1966.
And he's famous for sort of a number of reasons, but one of them that a lot of people know is that he has this sort of collection of quotes called
epigrams on programming.
And one of those quotes is the following.
It's better to have 100 functions operate on one data structure than 10 functions operate
on 10 data structures.
And this quote is actually what I care more about than the particular data structure.
And so this leads me to sort of my favorite data
structure in C++, or I guess if this is the case, it's more of a container, not a data structure,
but it's Stood Vector, partially because, you know, of when Chandler said, you know,
always reach for Stood Vector, basically half the time, it's going to be more efficient than
whatever data structure you're reaching for, unless if it's like a niche case. And I think Tony Van Eerd, another individual in the C++ community, he also said like anytime you
are not using a std vector, you should add a comment and say why, because std vector is so
versatile. But yeah, it's my favorite data structure, mostly just because it's got, you know,
constant time complexity for most things.
And then insertion is amortized constant,
which is fantastic.
And when you combine it with that quote,
it's even more than just like having 100 functions,
aka our STL algorithms that can work on it.
We almost have in C++ like 100 functions slash algorithms
that operate on 10 data structures. So like in
the quote, it's 100 on one or 10 on 10, when this sort of is implicitly referring to something known
as like the m times n versus the m plus n problem, where due to the fact that we have inner iterators
as this sort of layer between our algorithms and our data structures.
They're generic, and so a lot of the algorithms work on different containers or data structures,
if you will, because of the iterators.
So we almost have like 100 operating on 10, and of those 10, StoodVector is the one I
reach for the most. Yeah. And to explain that a little bit more,
one of the reasons why std vector is almost always the right choice is because of performance.
A std vector gives you a contiguous block of memory. And in particular on, you know, sort of our modern contemporary hardware, that locality is almost always going to beat out any theoretical advantages that another data structure might have.
You know, there are weaknesses to std vector. You know, it has to, every now and then,
it has to go and do these reallocations to grow. You have to have a good growing algorithm. It can
sometimes, because of that growing algorithm, it sometimes can be memory inefficient.
And like doing insertion in the middle is obviously, you know, not a very cheap operation,
but you might go and look at using a different data structure that might be better at some of
these things like stood list, which is, or a doubly linked list. You might think, okay, well,
you know, I know I need to do a bunch of insertions in the middle or insertions at the front.
So I'll use list in
this algorithm instead of vector because I'm anticipating that. But with a doubly linked list,
you're doing pointer chasing, right? All the memory is disjoint. And so because of that,
you're paying this regular penalty of you're not walking through memory contiguously,
which is what modern hardware is meant to do. And so, yes, you might get this benefit with the
doubly linked list when you're, you know, inserting an element in the middle, but you're paying this
penalty in every memory access and in the memory access pattern itself. And so that often,
that overhead of that less efficient memory access pattern often makes it a wash or makes
using the list slower. So sort of unintuitively, there are some cases where the things that a
vector is bad at, it would be better than the competition
solely because it has this nice property that it's contiguous. And it being contiguous means
it's easier for it to be cached and it's easier for it to be prefetched. If you look at modern
CPUs, they have special hardware that's designed for streaming through memory. That if you're in a loop
and you're accessing an array of memory contiguously,
there's like special hardware
that's designed to figure out,
hey, we're in a loop
and we're sequentially accessing
this contiguous range of memory.
So like we should prefetch and cache accordingly.
And so, yeah, so that's why oftentimes
the sort of naive data structure of vector is the
right answer as opposed to something more complex. And it's, I really like that the first thing you
mentioned about this idea that, you know, sort of less data structures might be better.
That oftentimes you want more algorithms to operate on less data structures.
I think a lot of times people question why doesn't C++'s standard library have
a richer set of data structures like some other languages. And I think it's largely about that
philosophy, that this idea that you don't necessarily need more complex data structures
and that those tend to be very specialized for particular use cases. Instead, what you need is
a core set of really solid data structures and then a set of composable algorithms that build
on top of them. One of the key things about the design of the C++ standard library is separation of concerns.
For data structures in particular, with the exception of std string, we don't have these big classes that have a billion different algorithms as member functions.
Instead, we have fairly simple data structures that expose abstractions like iterators and ranges.
And then we have a separate set of algorithms that consume iterators and ranges. Yeah, so there's a bunch of things I want
to comment on. So the first one where you were talking about pointer chasing, that ties in
perfectly to the talk that I mentioned by Bjorn Fahler, in that one of the very first things in
his talk that he mentions is that he didn't really have
a notion of, you know, the difference between data structures and pointer chasing and like ones that
have, you know, contiguous memory and others that don't. And he said that like a good rule of thumb,
you know, this isn't always true, but just like his mental model that he has is whenever you are
chasing a pointer, you can basically consider that a cache miss, which
exactly ties into what you're thinking.
It leads to this counterintuitive, well, like, oh, I'm reaching for a data structure that
has constant insertion time, which brings me to the second thing.
I said a std vector has amortized constant insertion.
I should have been more specific in saying that.
Specifically, that's referring to pushback, the method pushback that inserts at the end of the array.
An insertion into the middle is going to have linear time complexity.
But that ties in exactly to what you said is that it leads to this counterintuitive, well, I'm reaching for a data structure that supposedly has constant insertion versus linear insertion for a std vector. But due to the cache miss, it's going to erode at some of the, you know,
quote unquote, complexity gain that you're getting from switching to a data structure that doesn't
have, you know, contiguous memory like a std vector. And yeah, well, and, you know, part of
that is, is we spent, I think, a lot of people and a lot of traditional literature spends a lot of time looking at time complexity
of algorithms, right? Whereas in practice, you do need to think about what the constants are.
You know, okay, you know, this thing has O1 insertion, but what is the constant? You know,
like, what is that, the actual constant latency? Or, okay, this thing has linear, you know, like, what is that, the actual constant latency?
Or, okay, this thing has linear, you know, insertion in the middle, but if I only have
100 elements, how expensive actually is that?
And when you start looking at those constants, that's when you see, oh, I only have 100 elements.
Insertion in the middle is just going to be cheap for me if this thing is always going
to be like that size.
So, and whereas using the list, all of the operations are going to have this greater overhead
so i may as well just use the vector and like sure it's it's linear insertion here but in some
cases that's fine yeah that's a it's that's a fantastic point and it applies not just to um
uh data structures it also applies to algorithms like. I think it was a couple months ago, a coworker had came and asked me, I'm trying to do a binary search, but I need to look at two elements at a time.
My immediate thought was adjacent find, but that's not a binary search.
And then I said, I followed up and I said, the binary search equivalent of
that is partition point. But then my next question was, do we really need a binary search here? Like,
do you know, on average, how many elements is going to be in the list or that you're searching
over, or vector that you're searching over? Because I, you have to profile, I've heard a
rule of thumb is that if you have less than 80 elements, a linear time algorithm is going to be
faster than a binary search algorithm. I don't know where I heard that. rule of thumb is that if you have less than 80 elements, a linear time algorithm is going to be faster than a binary search algorithm.
I don't know where I heard that or if that's true.
You know, profile, you know, before you're making these claims.
But his response was, oh, it's less than 10.
In which case I said, yeah, like we don't need a binary search for that.
Just use a JSON find and call it a day.
So, yeah, it's like thinking like like you said, you know, in academia, you always drop the constants or the coefficients.
But in real world, those constants matter.
And profiling is very important.
So it's funny that you said std vector because when we talked about algorithms, we both, we said, oh, we thought that both of our answers was going to be transform reduce.
But then it turned out that, in fact, that wasn't both of our answers for what our favorite algorithms were.
And for this, I sort of figured that both of us would, our initial gut reaction would be Stood vector or a a dynamically dynamic array would be our first choice.
And it was for you.
But I actually am going to argue that there's that that it's not mine and that a dynamic array is often is often too heavyweight.
And I actually think that you're going to agree with me.
I don't think that a dynamic array like std vector is your favorite algorithm.
I think you've just been tricked into thinking that because the actual container that you need the most frequently doesn't exist in an easily accessible form in C++.
So let's talk about how we use a dynamic array.
So one of the key properties of a dynamic array like std vector
is that it's growable. That is that after I've created it, I need to add more elements to it.
But I think in a lot of cases, we know ahead of time how many elements we're going to put into that vector. And there is a pattern
that is very common that I see a lot. I call it like the default construct and reserve pattern
when you're using a vector where you will default construct the vector,
and then you'll immediately call reserve on it with the capacity that you need.
And for those who may not be aware, when we talk about capacity,
that's different from size.
So capacity is how many slots for elements does
the vector or the dynamic array have. And size is how many slots are currently in use. And every
time the size grows above the capacity, you have to reallocate the entire vector storage and move everything into the new capacity. So typically capacity is grown
with, you know, some internal algorithm, like you double capacity every time so that you don't have
to constantly be reallocating. But so let me just ask you, Connor, how many times have you
constructed a std vector and then like either immediately called reserve or immediately called
resize and then never added never increased the capacity after that uh probably more times
than i can count although i will say as a small caveat to this i've discovered a new pattern
using thrust fancy iterators which also exist in boost as boost iterators, where you can set up like transform iterators and then just call the range constructor and avoid like the resize or the reserve, which I think is a super nice pattern.
Because then you can also declare your data structure container const by using those fancy iterators.
But that's only been recent in the last couple of months that I've discovered that pattern.
And before that, yeah, probably more times than I can count.
Yeah.
And so this sort of gets to one of the defects that std vector has in C++,
which is there is not a good way.
One, there's not a simple way with std vector to reserve on construction.
There is no constructor that lets you reserve capacity without also allocating elements.
So you can construct a vector with an initial set of elements, but then you have to pay to
initialize those elements. So like if I'm creating a one gigabyte vector of doubles, I could
just call the vector constructor that says I want in elements that have this initial value.
But then I'd have to pay the cost of initializing that whole gigabyte of memory with this initial
value when all I really wanted to
do was just reserve that storage, and then later I was going to push back into the vector.
And that's why you have this common pattern of default construct it, and then you call reserve.
And there's also the case where you want to get uninitialized elements.
And there's actually a C++ committee proposal regarding this.
For example, let's say you've got like a std string and you want to call a C library function
that's going to initialize the constants of that std string.
Well, there's not really a good
way to do that. The only thing you can do is you can resize the string, which then zero initializes
all the characters. And then you can call that library function or you can copy into it from
some external API. And that's a little bit unfortunate.
And so there's this proposal P1072
that tries to address this
for originally for std vector and std string,
but now just for std string.
But it is a slightly hazardous pattern
where you want to say,
oh yeah, I want to like add elements to this container, but I don't want them to be initialized.
Like that's how you get, that's how you lead yourself towards, you know, memory safety violations and all sorts of unpleasant, uninitialized variable hazards.
But if you really care about performance, it can matter. So I'm going to argue
that dynamic array is, in many cases, more than we need because in many cases, we know what the
capacity is going to need to be up front and it's never going to change. And so what we really need is we need a container that's
just a C++ equivalent of, you know, a C style array where I want to allocate it, I want to give
it a size, and I want to have control over whether and how those elements get initialized.
Now there are a few options in the C++ standard library if you want this data structure.
So if you know a compile time, the size, you can use stdarray, which is a compile time sized array with more or less the same interface as vector.
Of course, it can't be resized, etc. But I think the more common
pattern is that I just want a runtime-sized array. And there's no specific class for this in C++,
but you can express this with unique pointers array specialization. Are you familiar with Unique Pointer's array specialization?
Walk us through it. So Unique Pointer in C++ is an RAI type that owns some pointer that you, programmer have allocated and it has a customizable deleter parameter which is used to clean up
this pointer that you've given it when the unique pointer is destroyed.
And it's called unique pointer because it is uniquely owning.
Only that unique pointer can have access to it.
You can't copy the unique pointer.
You can move the unique pointer, though.
And so this gives us some guarantees about, you know,
who has access to this pointer.
And in particular, because it's uniquely owned,
you don't have to worry about who's going to clean it up.
The answer is always going to be there's just one person that there's just one instance of this unique pointer.
And that's the one that's going to clean it up.
So you don't have to do reference counting or anything like that.
But unique pointer is usually used for managing the lifetime of like a single pointer. But there's this specialization where if you write like unique pointer,
template brackets, yeah, Chevron, and then like int empty braces, square braces,
that will give you this specialization that is intended for arrays. And it will have an array access operator.
And by default, it will use an array delete.
And so this is a pretty nice way to proxy or to implement just a container that is just,
I want to allocate this array that has n elements
and that number of elements is never going to change.
And it's also useful because you fully control the allocation of these elements.
All that you do is you give a pointer to unique pointer.
It doesn't create the pointer itself.
So you can control how you allocate that memory, how you initialize it.
So you don't have to worry about this case of, oh, well, I want to use a std string,
but I want to control how these elements get initialized and created.
With UniquePointer, you have full control.
All that it provides you is the safety of the resource acquisition is initialization
pattern, where when the unique pointer goes out of scope, you know that it's going to clean up
the memory using the deleter. So that's, I think, something that I reach for very frequently.
And there's a related facility called allocate unique that's proposed been proposed for the past two c++ standards but
hasn't been very high priority um uh and that just gives you a convenience for saying hey
i have this c++ allocator and i want to create a unique pointer that uh uses this allocator to allocate and deallocate the memory.
Interesting.
If we want to point our listeners at more resources on this pattern,
what would you recommend?
So we'll put a link in the show notes to the allocate unique paper.
In our thrust library, we have an implementation of allocate unique, and then we have a few different overloads.
We have one for, you know, arrays, and I use it all the time.
Like, that's my go-to thing.
The paper is P0316. And it's, it's, it doesn't seem like it's a particularly important thing. But I think if you
start, I think if you start thinking about all the places where you can use a unique pointer,
the array version, instead of a std vector, you'd be surprised how commonly this comes up.
Yeah, it sounds like a common pattern
that potentially a lot of people just aren't reaching for
because it's not, you know,
stated right next to std array and std vector
in sort of the containers, docs, slash blogs that are written.
Well, and it's not perfect
because like it doesn't give you iterators, et cetera.
Like we could build a version of this.
And at one point in C++'s history, we sort of did.
There was this thing called DynArray that we tried to do,
but it ended poorly.
That was before my time.
I don't remember all of the details,
but we could do a better job here.
Yeah, it sounds like an interesting direction
for the future of C++ once we get everything that's supposed to go into 23 and the bleed over into 26.
That, you know, this could be a future proposal for C++.
Yeah. What's the thing that you're most will and will not be in C++23,
but assuming that everything that's being quote-unquote targeted by individuals for C++23,
I'm most excited about seeing the extension to the range of stuff.
On top of that, I think pattern matching is going to be awesome.
Executors, you know, everything that's going in, honestly,
like library support for coroutines,
I think that's going to open up the world of coroutines to a lot more people.
But yeah, at the top of the list, it's definitely the range of stuff.
How about yourself?
That's hard to say. I would probably say I'd probably say executors, but I also think that things like allocate unique and any invocable,
which is a better version of std function. I think those are going to have a lot of impact on people's day-to-day
lives so i think that that those will be uh those will be big wins too
all righty i think we're once again past our 30 minute uh target time but uh it's all right
is that our target time? yeah well you know
20, 25, 30 minutes
it's never a hard limit though
it's our podcast we get to do what we want
and I think
we're going to have to do another one next week
because there's a whole other set of
data structures that
at least I want to talk about
and yeah so maybe we'll do
part two next week.
Sounds like a plan.
All right.
Awesome.
Thanks, everyone, for listening,
and we'll catch you guys in the next one.