Algorithms + Data Structures = Programs - Episode 3: Our Favorite Data Structures - Part II

Episode Date: December 11, 2020

In this episode, Bryce and Conor talk about Scrabble, DAWGs and TSTs.Date Recorded: 2020-12-06Date Released: 2020-12-111988 Scrabble [DAWG] Paper - The Worlds's Fastest Scrabble ProgramTI-BASICAn...drew AppelSML/NJ (Standard ML of New Jersey)SML (Standard ML)ML (Meta Language)Ternary Search TreeBoost SpiritYacc and Bison parser generatorsIntro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusicCreative Commons — Attribution 3.0 Unported — CC BY 3.0Free Download / Stream: http://bit.ly/l-miss-youMusic promoted by Audio Library https://youtu.be/iYYxnasvfx8

Transcript
Discussion (0)
Starting point is 00:00:00 I is not less than a half, bro. Oh, this episode is such a mess. I will find a way to edit this and make it. Probably I'll cut out a huge chunk where it says, uh, and then Connor and Bryce rambled for 20 minutes being confused about how a ternary search tree effectively is constructed. And, uh, we're leaving that out and resuming at the point where they understood. Welcome to ADSP, the podcast, episode three, recorded on December 6, 2020. My name is Connor, and today with my co-host, Bryce, we're going to be talking about Scrabble,
Starting point is 00:00:59 directed acyclic word graphs, and Ternary Search Trees. So, okay, yeah. So I promised Twitter to embarrass you about Scrabble. Yeah. So we were going to do it in the intro part, but you said that it was the perfect lead-in for the esoteric data structure that you're going to talk about, because that is today's topic. Last week, we talked about our favorite data structures, but we didn't get through all of our favorite data structures. So we decided we'd continue this week. And this week, we're going to talk
Starting point is 00:01:41 about more esoteric data structures. So tell everybody how you became a Scrabble savant. Well, so to be clear, I'm sure there's several listeners that are much, much better than me at Scrabble. Connor, do you recall at CppCon 2019, I seem to recall you playing Scrabble with a group of, say, three people. Does that seem about right? It was three, including me, yes. Yes. And the deal was that the other two people got to combine their score. So each of them, to win, they just had to get more than your score.
Starting point is 00:02:24 So they each just had to get slightly above half of your score. Who won that game of Scrabble? Nope. Nobody won. It ended up being a exact tie. That's right. But that doesn't, that there's,
Starting point is 00:02:40 there's levels. There's people that play casually. Then there's people that play competitively. And then there's people that play competitively and then there's people that like compete in tournaments I'm not like tournament level but the the short version of the story is I was not allowed any computer games or console systems when I was a kid except for one computer game and that was Scrabble. And also my father's a journalist, so he very much liked word games, so we played quite a bit, and I got pretty good, and it got to the point where my dad said I could no longer use words that I couldn't define and use in a sentence,
Starting point is 00:03:21 because anyone that's played any amount of Scrabble knows that there are cute little two-letter words that are useful like Q-I and Z-A and, you know, A-A and A-I and A-E. And so slowly but surely, I started memorizing all of the definitions for these words. So Z-A, Z-A is short for pizza. Q-I is qi, which stands for Chinese energy. A-A energy AA I believe is cindery lava and then there's one other one that's like a three-toed sloth anyways I know I have all this useless information so that I could use these words and scrabble with my dad and at some point he just stopped playing with me
Starting point is 00:03:57 but the point is I'm a big Scrabble fan and it definitely ties in well to what we're talking about today. And it's interesting. I, as a kid, also did not have any consoles. I didn't even have a computer in the house until my mom married my stepdad, which must have been when I was, I don't know, maybe 11 or 12 or so before I had a computer in the house. I was sort of computer savvy before then. But you talk to a lot of people who are programmers now and they were like, you know, a lot of them were programming from a very young age.
Starting point is 00:04:39 And I was also deprived of consoles. I never had like a console in my house. So we're both late bloomers for tech, I guess you could say. Yep. And coincidentally, if I'm not mistaken, we actually both started programming in the exact same, quote, esoteric programming language, which is in high school in TI Basic on our graphing calculators. Yeah, it is really fascinating to me how prevalent TI Basic was as a first language for programmers of our generation. Yeah. So whichever engineers at TI designed TI ti basic kudos to you yeah it was my even though it
Starting point is 00:05:29 was a waste of time that was like my favorite thing to do in math class because they would they would clear your stored programs uh before exams but then if you like knew how to program it it took like you know 60 seconds to write the quadratic formula program and then like even if it was only one question i would do it just to like spite the teachers. I would write the program and then use it during the exam. So in all the games that I like played or made, there was an emulation mode where you could press a button while you were playing the game on the calculator. And then it would, it would pretend to be like in regular calculator mode. So if your teacher came over and like thought you were like playing a game, you just hit the button
Starting point is 00:06:09 and then it would just be like, they wouldn't be able to tell. And some of the emulation modes were pretty impressive. That is fantastic. I did not know about that. Yeah. But we should get back to Scrabble and your favorite data structure. So yeah, I should have sent you the link to this. In fact, I will quickly do that. How should I send it to you? I'll DM it to you on Twitter. But so at some point, I think in university, at some point in university, I wanted to implement a lightweight Scrabble game. And sure enough, I ended up doing some research, and I stumbled across this paper that was written, I believe, in 1988 by two individuals, Andrew Appel and Guy Jacobson. At the time, I had no idea either of
Starting point is 00:07:07 these names, but in the last year or so, I actually think I became aware of Andrew Appel's series of books on compilers. So this is sort of a tangent, but he's also was a prominent individual on developing the SML New Jersey language. So for those of you not familiar, SML New Jersey is, I think, an implementation of SML, which stands for standard ML, which is a dialect, a modern dialect, or at least was at the time of ML, which is a programming language that stands for meta-language. Anyways, at the time, I had no idea who Andrew Appel is, but cool to sort of learn, however many years later that is, that he's a notable figure in sort of compilers and programming languages.
Starting point is 00:07:49 But this paper introduces a variant of a, I'm not sure how popular it is, a data structure called a trie. I believe that's how I pronounce it, but I've heard some people call it a tree, which is confusing because that's also a sort of other node-based data structure, but it's spelled T-R-I-E to disambiguate. And the variant of it is called a DAWG, which is D-A-W-G. So it's W added to another acronym for a data structure called the DAG, which DAG, I'm sure some of you are familiar with, is a directed acyclic graph. And the W that's added to the DAG stands for word. So the variant of the TRI is what is known as a directed acyclic word graph. So I'm not sure we should pause here to, I guess, first explain what a trie is.
Starting point is 00:08:48 Do you want to explain what a trie is or should I explain what a trie is? I think you should explain what a trie is. All right. So basically, and I should summarize first, the point of a dog is to be able to efficiently look up words that are in a dictionary, which is why it's stated in the Scrabble paper, but also to very efficiently store these words. So naively, if you want a dictionary of words,
Starting point is 00:09:14 you could just create a hash set of strings and store each word that's in your dictionary. I believe in the Scrabble Dictionary 5th Edition, there's like 178,000 words or something that vary from 2 characters to 15 characters. I believe in the Scrabble dictionary fifth edition there's like a hundred seventy eight thousand words or something That vary from two characters to fifteen characters. That's the maximum Length word that you can play on a Scrabble board And so you can imagine on average that's you know seven characters time one hundred seventy eight thousand. That's quite a bit of space
Starting point is 00:09:40 and there's a data structure called a try that basically takes advantage of repeated suffixes in words so that you're not storing sort of duplicate information. So in the paper, they have the following words. Car, cars, cat, cats, do, dog, dogs, done, ear, ears, eat, and eats. Which is, I don't know, roughly like 12 words. This is on page three. For any of you, this will be in the show note links and Bryce can follow along. But so the point is, is that like car and cat and cars and cats, the four of those words, they share the same first two characters, C and A.
Starting point is 00:10:21 So basically you can visualize a tree where at the top of the tree are the first letters of any given word, so A through Z. And then for storing the word car, you basically have like a node-based structure where at every node, a single letter is stored. So you start at C, then the next node points to an A, and then the next node points to an R. And then you sort of indicate that there's a word ending at the sequence of C-A-R. And then when you want to insert cars with an S into your try, you basically just add another node pointing from your R to an S and indicate that there's a word ending there,
Starting point is 00:10:59 and all that cost you was an extra character on top of car. It didn't cost you four new characters. And the same thing with cat, you can just add a T onto the A from the CA of car, and once again an S onto that. So that's the basic idea of a try. You have this sort of node-based tree where every node represents a character, and you start at the top, work your way down, and any time you get to a node that has sort of a flag that indicates a word ends here, you know that from the characters that you traverse node to node, that sets up sort of the word. The difference between a trie and a directed acyclic word graph is that you're not only
Starting point is 00:11:37 sharing the suffixes, the common suffixes, but also the common, or I've been saying this wrong, sorry, prefixes is what we were sharing to begin with. Suffixes are at the end. So a trie combines common prefixes. A directed acyclic word graph combines both common prefixes and common suffixes. And this is a little bit hard to visualize, like at least for me in my head, but if you look at the paper, it makes it super, super clear that for the try, you basically just have this sort of top-down, what looks like any sort of tree that you might imagine. It's quite easy in this case because there's only 12 words, but you can imagine for a full-blown dictionary, each node is going to roughly point to, you know, 15 to 26 other letters. But for a dog, you could imagine words that end in the same suffix. So for instance, like graduation, commotion, election, they all end in T-I-O-N. The same way that you're sharing nodes
Starting point is 00:12:37 for your prefixes in a try, you share nodes for your suffixes. And this leads to a huge savings in space. In terms of performance, it's about the same as the try, but in terms of space, it's a lot more compact, which is why I think this paper is noting it as an interesting data structure to use. Well, and being space efficient can also mean that you are more compute efficient because the smaller data structure is going to be more cache friendly. Right. Yeah. Yeah. So it's interesting that you picked a dog because the esoteric data structure I want to talk about is of that same family. It's called a ternary search tree. This is something I first came across when I was working on Boost Spirit
Starting point is 00:13:36 towards the start of my career. For those who don't know, Boost Spirit is a C++ parsing framework. It's an alternative to something like YAC and Bison, which are two common parser frameworks for C or C++. But the advantage of Boost Spirit is that it does not require some external tool to generate your parser. It's all done in C++ using a domain-specific embedded language and C++ expression templates. And so one of the things that we have as part of Boost Spirit is there's this primitive that's called symbols. And that's for storing a set of keywords or something like that, some set of known strings that you want to match against. And we use a ternary search tree to store these symbols tables. And a ternary search tree,
Starting point is 00:14:55 it's a special type of try. So each node has a value. In our case, the value is just going to be a character. And then three children. And one of those children is the low child. And you go down that branch if the value that you're searching for is less than the node's value. Then you have an equal child. And you go down that branch if the value that you're searching for is equal to the node's value. Then you have an equal child and you go down that branch if the value that you're searching
Starting point is 00:15:27 for is equal to the nodes value. And then you have a greater child and you go down that one if the value you're searching for is greater than the nodes value. And so because it's a try, it's very nice and space efficient. For search, insertion, and deletion, it's, I think, O log N average and O N worst. And you might wonder, well, shouldn't we just use a hash map for something like this? And in a lot of cases, a hash map might be faster, but in this particular case of Boost Spirits use case, a ternary search tree, or a TST as they're abbreviated to, is a little bit of a better fit. So one of the reasons is because a TST search can be done incrementally. We can take some sequence of inputs and sort of incrementally search down the tree for them.
Starting point is 00:16:25 And this is a really good fit when we're working with iterators, which is what Boost Spirit deals in for input. Another nice property of TSTs is that mismatches are discovered earlier. So in a hash table, a mismatch or a failure to find something, you know, hey, I'm looking in this hash table to see if this symbol is in there. Those can be quite expensive in hash tables. Whereas with a TST, you can get a false answer very quickly. Imagine that you had a symbols table where you had no symbols that started with the letter X. So then if you've got some input that starts with the letter X, you immediately know on the first node that you search
Starting point is 00:17:23 that you're done. Because you go to the first node, you see that it's not an X, you go to whichever is the applicable child, and you see that it's a null node. And you immediately know, hey, there is no symbols in this symbol table that start with letter X. So I know that there's no matches matches and I should go on to try to match the next component in the parser. And TSTs are also nice because they can provide partial results and nearest neighbors. And so that's useful if, for example, you want to provide some diagnostic if there's a typo. You know, if somebody misspelled a keyword by one letter, a TST can give you some close matches. And TSTs also, you know, they use less memory, so they're more space efficient. And they can also be more compute efficient because a compact data structure is going to be more cache friendly. And you can use TSTs, you see them in a lot of places.
Starting point is 00:18:36 They're used for things like spell check or auto completion. They're used a lot in IDEs and text editors and in parser frameworks like Boost Spirit. So in the case of Boost Spirit, you said that each of the characters in your TST sort of is one character of a symbol. What's an example of the set of symbols that are being stored in this, if I understood correctly? What what's an example? Is it like keywords and identifiers from like a programming language or some DSL or something like that? Yeah, it would be something like keywords or identifiers or, you know, it might be something that you use for –
Starting point is 00:19:21 so normally the simple table is used to map. It's not necessarily something that you use for keywords, actually. You might, but a better example might be something like built-in functions. Because the symbol table will map some strings to some values. So, like, let's say that you had some little language that had some built-in, like, names for different colors. Like, it was some graphics programming language. You might have a symbol table that maps all those names for colors to an object that represents that color. Or if you have a set of built-in functions, like, you know, a set of built-in math functions, you might have a symbol table that maps all the names of those math functions to, you know, a function object that
Starting point is 00:20:20 implements those semantics. Although in that case you might do it slightly differently in Boost Spirit. But it's generally used for any time where you've got some sort of enumeration or some large set of things that you want to map to some value. Whereas something like a keyword you probably would handle a little bit differently because each keyword may appear in a different context in a programming language, whereas a symbol table in Boost Spirit is something that you're going to use in a place where,
Starting point is 00:20:52 hey, any one of these sets of things could appear. Interesting. Have you ever used Boost Spirit or you've just worked on it from an implementer side of things? Oh, no, yeah, I've used it a lot. You know, we used it in uh well when i first got started i used it to build um you know this little like toy lisp like language um we used it in hpx this parallel runtime that i worked on to implement our um like yaml metadata parser
Starting point is 00:21:20 and to implement our um command line option parsers. You know, any time that I need to do some parsing in C++ where I can't get away with regular expressions, I'd pick up something like Boost Spirit. Interesting. Is it still a recommended library these days? Oh, yeah, definitely. At least in my book. I mean, there are some people who are not a fan of it because it does rely very heavily on C++ templates and expression templates. But I think it's really amazing technology. It gives you relative to sort of pre-C++11 compile times, which could be quite painful for these large template codebases. Oh, there's another place where TSTs are quite useful, which is if you're doing something like case-insensitive parsing, they can be a really good fit for that. Like, hey, I want to match these words
Starting point is 00:22:33 regardless of whether they're capitalized or lowercase, etc. Okay, yeah, that makes sense. The one thing that I'm not entirely clear on is how exactly does a TST differ from just a regular tree? Try, I guess we're, we're pronouncing it that way. Well,
Starting point is 00:22:53 so it sounded like from your explanation that like there's only, there's always a maximum of three children per node. Right. Right. Whereas in, in a, in a try or a directed acyclic word graph, um, basically your boundary is the set of characters.
Starting point is 00:23:19 If it's just alphabetical Arabic characters, lowercase, that's 26. But if it's alphanumeric, that's going to be 36. And then if you include whatever punctuation or capitalize obviously that increases but so like typically for you know if you're implementing it for a scrabble game uh you're going to have a maximum of 26 children node because that's you don't have to deal with cases and there's no such thing as numbers whereas in your description of a tst you have like the lower bound the equal equal node child, and then the upper bound, and that's like the max. So if I understood your explanation correctly. Yeah, no, no, I think you're right. Now the question is, what are the implications of that tradeoff?
Starting point is 00:23:58 And off the top of my head, I don't know. I think we'd have to like work through it. Is it perhaps the case that one of these two is more space efficient than the other um or that searching is quicker in one or the other requires hopping through less more or less nodes unfortunately i have to interrupt episode three because a few minutes later, Bryce would say the following.
Starting point is 00:24:29 I don't know. We're going to probably have to cut this part because this is a little bit rambly. And due to that statement, it seems that the podcast god smiled down upon Bryce while he was exporting his audio and they erased the last 20 to 30 minutes of his recording. Fortunately for us, we were going to heavily edit out most of what we were discussing because
Starting point is 00:24:51 clearly at the time, we didn't understand how ternary search trees were constructed. Unfortunately, what that means is that the edited version of this conversation is just going to be about four to five minutes of my voice. So we apologize for this. Next time, we will make sure that the podcast gods do not interfere, and hopefully this will not happen again. Back to the episode. Let's go to the Wikipedia page. Which reads, in computer science, a ternary search tree is a type of trie sometimes called a prefix tree where nodes are arranged in a manner similar to a binary search tree but
Starting point is 00:25:32 with up to three children rather than the binary tree's limit of two. Like other prefix trees, a ternary search tree can be used as an associative map structure with the ability for incremental string search. However, ternary search trees are more space efficient compared to standard prefix trees at the cost of speed. Common applications for ternary search trees include spell checking and auto-completion. And the description says each node of a ternary search tree stores a single character, an object, or a pointer to an object depending on implementation,
Starting point is 00:26:11 and pointers to its three children conventionally named equal kid, low kid, and high kid, which can also be referred respectively as middle child, lower child, and higher child. If we go on, it says the low kid pointer must point to a node whose character value is less than the current node. The high kid pointer must point to a node whose character is greater than the current node. The high kid pointer must point to a node whose character is greater than the current node. The equal kid points to the next character in the word. The figure below shows a ternary search tree with the strings cute, cup, at, as, he, us, and I. Definitely every leaf node is the end, every leaf node in the TST represents the
Starting point is 00:26:55 end of a word. So the left S is for as. The P is for cup. The E is for cute. The I is for just the single letter I. The right S is for us. The E on the third level, second from the bottom, is for he. And then the only other question is, how does it know that at is a word? Because basically there's 1, 2, 3, 4, 5, 6, 7.
Starting point is 00:27:31 So there's seven words. Every single one of those words ends in like a singular leaf node, which is different from the try implementation in the directed acyclic word graph in that like there's an indicator so that if you have car and cars Car is gonna end at the R in C A R and you're just gonna have like a boolean indicator flag saying a word ends here and it's just like straight from the top Node down to that node where it indicates that there's a word and then when you add an s you're also gonna set the flag In that node to like word ending here whereas here it seems that like a word that starts with a c is going to start from the top but a word that starts
Starting point is 00:28:12 from any other letter is not going to start at that level you every other letter that's less you're going to go down to the low node any other letter that is greater than and starts at it you're going to go to the higher node. And then any other word that's building off of C is going to go to the equal node somehow. So, like, I understand all of that. The only thing I don't understand is, like, if you've got, like, chapel, so a C-H, how do you indicate that? Oh, yeah, yeah. so I think this is it anytime you take
Starting point is 00:28:49 the non-equal node, if you take the low or the high node you are not including that parent node that you came from so in the case of in the case of in the case of
Starting point is 00:29:05 cup you include the C because you're using the equal to, you include the U because you're going straight down as well but then when you get to the T if you take the low node which means you go to P, you leave out the T
Starting point is 00:29:19 the TST is I won't say infinitely more complex, but definitely one of the major differences between a TST and a dog is that a dog, when're basically guaranteed to only search the number of characters in that word to see if it exists in that dictionary at most. That's if you hit a match. If the word you're searching for does not exist in the dictionary, at some point you're going to find out, oh, this word, you know, at the fourth letter, we don't have anywhere to go. This word doesn't exist. Whereas in this TST, depending on how it's constructed, you could search, I'm not sure what the bound is, but definitely more nodes than are in the word,
Starting point is 00:30:10 the characters of the word that you're searching for. But yeah, we'll say thanks for listening. Hopefully this episode post-editing is less of a mess than it was while recording. I personally learned something as a co-host. As a listener, that's up in the air at this moment. But yeah, thanks for listening. And we'll catch everyone in the next episode, hopefully. Unless if you just uninstalled this podcast from your podcast listening app. We'll see you guys next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.