TLB Hit 💥 - Parsers

Starting point is 00:00:00 Hey, JF. It's a warm and fuzzy and parsy time of year. Hey, Chris. So today I suggest we only talk about parsers, not fuzzers. But, you know, parsley is always good on holiday food. Sounds good. All right. Welcome, everyone, to the disclaimer. We only know so much, but we'll try not to say things that are wrong, and we'll follow up with errata, corrections, and clarifications as we find out about them and put them on the website, tlbh.it. We do the lifelong learners thing, and we're always trying to push the boundaries of things we know about and have conversations about what's interesting to develop and hone our understanding. So today, as JF said, we're going to talk about parsers. And we're mostly going to be talking about some of our practical experience with parsers.

Starting point is 00:00:41 Addressing the parsing field as a whole is very hard because it's both theoretically nuanced and large. One of the criticisms you often hear about the original edition of the Dragonbook was that it was super front-loaded with a lot of theoretical parser content, which kind of tripped folks up in compiler's class. A lot of that book was on syntax-directed code generation as a back- backend flow. So it's kind of understandable that it did a lot on parsing. But that's often also not what a lot of people wanted to understand from compilers class. So this is always a little bit of an interesting proposition. So it's a little heretical, but part of what we're going to claim today

Starting point is 00:01:18 is that parsers in a practical sense can just be viewed as kind of a stylized pattern of coding where you can easily create your own parsing scenarios and parsing solutions and reuse things if you just accept recursive descent style parsing as the way to do it. Right. And what's interesting is parsers really, once you get into them, they show up all over the place, right? So people could say, like, you can't just point at anything and call it a parser. But when you start loving parsers, you kind of do. You point at things, you're like, parser here, parser there, parser there. So an example is, you know, CPU

Starting point is 00:01:53 frontends, the ones that decode the instructions, they're really just very compact on-the-fly parser generators for microcode, right? So if you look at ARM64, it's four byte words that you parse one at a time. And the CPU does it as any disassembler would do it. Sometimes it does it more than one word at a time, but it's basically a fancy parser. x86 is a bit trickier as a front end because the instruction size is variable in x86. So it's just a variable with encoding, but it's still kind of a parser. This type of parser, it needs to have context to decode the instructions properly. So starting at the right place to decode the right instructions.

Starting point is 00:02:30 And in fact, if you're misaligned, you could decode totally different instructions, but it's still effectively a very fancy parser. Yeah, that's one of the fun things that in security shellcode will try to do is put a bunch of constants into code and jump to a weird offset so that the instructions get decoded differently with immediates in there. Cool. Sounds good. So just some kind of naive background on the parsing process. As humans, we often write text down using alphabets that we use to make composite words. And then we have parts of speech that put words in relation to one another to make sentences that have semantic meaning.

Starting point is 00:03:06 And parsing for computer programs is analogous or similar to that in a lot of ways. We have scanned characters like code points and Unicode. We use those to make up tokens, which are kind of like words. And we put those into a parse tree that often relate values, which are like nouns and verbs, which are like operators together. And then we can define new composite things by putting all these primitive pieces together. If you've never watched Guy Steele's Growing a Language talk, you should probably push pause and go watch that right now. It's really good. Often we use declarative languages to help us describe how to read in computer languages. So we use regular expressions, for example, to say how to scan tokens out of the character stream. And we use little languages like BNF that describe grammars of languages, the kind of recursive definition of an arithmetic expression can be made up of strong arithmetic with multiplications or weak arithmetic with additions and subtractions, these kinds of recursive definitions.

Starting point is 00:04:12 And BNF is structured as these recursive rules for how to turn the tokens into a tree, and that's the parse tree that we get out of the parsing process. Yeah, it's interesting that you mentioned Unicode because, you know, I was saying you could just point at anything and call it a parser. Unicode itself, you kind of need a parser for, even though it's a kind of self-restarting parser. It's not that context-full in the sense that the way Unicode is encoded, every UTF start in UTF-8, for example, tells you where the start of the code point is. And so if you miss one code point, the next one you won't be able to miss because of the way it's encoded. It's a very simple parser to parse, but it's still something that you would, I would say, parse out of a byte stream. So, you know, parsers are really everywhere. And what's interesting is the common wisdom out there is to never write a parser by hand, but to always use a parser generator. And I think you and I think that the common wisdom is just wrong. We would prefer to always write a parser by hand, start with a recursive descent parser. And if you realize

Starting point is 00:05:17 that at some point that causes issues, go to a state machine to have more recursion and state and things like that. I think recursive descent parsers are really easy to write, and transforming them to a state machine is not that bad. For me, I started my first parsers a long time ago with the common wisdom, and I tried a bunch of stuff. I tried Antler, I tried LX Bison, and I ended up with Boost Spirit, which was, in retrospect, a bad idea. It was hard to write, hard to debug, not really flexible. It gave you terrible error messages. But as a trade-off, Boost Spirit gave you terrible compile times, too. So that was great.

Starting point is 00:05:52 But the silver lining in this first experience of working with Boost Spirit all that time ago was that I get to make fun of Bryce Lelback, who's a friend of the show and ADSP podcast host, because he worked on Spirit, and I always make fun of him for that. So there's a silver lining in learning from disregarding the common wisdom. Can you describe Boost Spirit a little bit of how you work with Boost Spirit? Yeah, yeah. So Boost Spirit, what's funny about Boost Spirit, one of the ways I make fun about it on Twitter every so often is that Boost Spirit is kind of an EBNF DSL written in C++ template metaprogramming. So if you look at EBNF,

Starting point is 00:06:29 it has all these special operators like the space and plus and star and things like that. And some of these exist in C++ as operators. So someone went and made a clever C++ optimizer in template metaprogramming with a DSL language using template and operator overloading to use somewhat of EBNF parser

Starting point is 00:06:52 replacing things like space with like to right shift and plus and star instead of being postfix being infix. And it looks and reads kind of like EBNF, but in C++ template metaprogramming with a little optimizer in there and everything else. It's very cute. Kind of a silly idea, honestly. I would rather just write

Starting point is 00:07:08 all of that stuff by hand. And I think we as an industry had a rude awakening to see all the.y files trying to interweave things into the parsing process with unclear lifetimes when you use the.y files. Memory errors, part of what made me love recursive descent parsers by contrast was that type of stuff. So while I make fun of Boo's spirit as kind of a weird thing, I wouldn't say that the.y files are any better than what you end up there. So I really think that recursive descent parsers are the right way to go. It's much more malleable, much nicer to write, much easier to debug, and things like that.

Starting point is 00:07:44 Right, and for those folks who haven't used YAC before, the.y files are stylized C templates that the YAC parser can call into when events occur in the formation of the parse tree. So you kind of create this hybridized C-looking file, but where things are happening that are driven by the YAC parser. Yeah. I'm personally a big fan of code generation, right? Where you have a separate custom DSL that you yourself parse and generate C or C++ code to do stuff that your then code uses. I just think YAC files are not the best composition way to create code generation out of things. Right. And increasingly, people composition way to create code generation out of things. Right. And increasingly, people are trying to create and use simple versions that operate more like that. So in YAK, the inner loop is driven by kind of the theoretical Lawler grammar composition.

Starting point is 00:08:41 So there's kind of something you don't understand that you'll get shift-reduce conflicts from in your inner loop, but that kind of triggers events in this other file. Things like peg grammars that people are trying to use more of over time can end up kind of spitting out a C file that is a recursive descent parser, but we'll talk about some of the trade-offs there as well. And again, even though things like peg grammars are spitting out recursive descent parser implementations, which is a more modern technique, here we're mostly going to talk about the hand-rolled recursive descent parsers because the trade-off there is that they work well and they're understandable when you structure them as code yourself. And really, it just comes down to a stylized way of structuring your code to make a useful and easily recognized data structure, which is the parse tree that comes out or a nice error message that can use the context from where the error message occurred.

Starting point is 00:09:34 So we will link to articles in the notes that give counterpoints on our view that this is probably just the right way to do it. But parsing historically was one of these great crossovers between fundamental computer science and practical applications. So you learn about it in theory of computations class, that a regular expression has a certain level of power, and then push down automata are like the next level of power that enable you to do things like match parentheses when you have an open and closed parentheses, right? You can't write an HTML parser, for example, using just regular expressions. So parsers are strictly more powerful than what you can do with regular expressions.

Starting point is 00:10:15 And people, as a result of this important theoretical basis, can feel strongly that the underlying theory is important to the point that you find yourself learning about Chomsky's theories and counter-diagnostication proofs and things like that. And that is all important and great. Computer CS fundamentals are great. But from a practitioner perspective, there's also part of it where it says, well, if you just write code in a particular stylized way, you can accomplish your parsing tasks. You can get done the job that you want to get done. And you can, if you find it daunting, you may be able to skip over some of those fundamentals and know how to write this stylized form of code. Yeah. And it's interesting that you talk about the theory of computer science and parsing, because maybe 10 years ago or so, I used to work with someone who actually did his PhD

Starting point is 00:11:03 on parsing, right. It sounds silly. Nowadays, parsing is a fairly well-known field. So doing a PhD on that, he significantly advanced the state of the art in computer science in the 70s. And so it really tells you that computer science itself is a fairly novel field. There's a lot of stuff that was figured out fairly recently. And it's kind of interesting that that's, you know, we're, we're in that field where things are so novel that it's possible to know people who invented things that you look at it now you're like, yeah, that's, that's parsing. It's easy. Of course. Yeah. It's pretty interesting there. So the basic flow of parsing is that first you'll, you'll tokenize the characters, right? So it's a linear stream

Starting point is 00:11:42 of characters to linear stream of token transform. And then you make a tree out of those tokens. So scanning and tokenization are kind of synonyms. But even that fact is interesting, because it brings up questions like is tokenization context sensitive? And how parallelizable is it with respect to the character stream? So sometimes you want to split scanning and parsing. Sometimes you want to join them. It really depends on a variety of things. Scanning tends to have very little context. So it usually just understands basic delimiters in the text itself. Whereas parsing tends to understand the structure and the context of the grammar that you're dealing

Starting point is 00:12:19 with. And the parser itself then has a state to understand where you are in a state machine, usually to decode the grammar you're parsing. And then it generates an output. And again, that's dependent on what you're doing. It could be an abstract syntax tree that you output from your parser, but it could be something else. You might generate code directly if you're kind of a small and efficient C compiler, for example. So in a recursive descent parser, that state is partly encoded in the data flow itself, right? So the function calls what's on the stack and so on, encodes some of the state. And so when I was talking about transforming a recursive descent

Starting point is 00:12:54 parser into more of a state machine, what you end up doing is looking at that state that you had through recursion and just encoding it, having your own stack to encode that state as a side data structure. Right. And that's a generally interesting thing in the way that we write sequential programs. The program counter is effectively a state that marches forward, right? And then takes jumps and things like this. And then recursive functions are one particular way to structure your state with things on

Starting point is 00:13:22 the stack and the marching forward of the program counter. Sounds good. So I think, like we were talking about, the tokens get formed from the character stream. And so looking ahead at what token you see, ultimately is part of what determines the grammar production, the grammar rule to apply. So if I'm looking ahead at the character stream, and I'm holding something in my hand that I was just looking at and I see a plus sign, I know that this is probably a binary plus operator in many common languages that wants to take the thing I just saw on the left-hand side and add it with something that I then have to determine on the right-hand side. And so what happens with these rules is these rules nest. I'll take the plus operator.

Starting point is 00:14:09 I'll say, okay, I must currently be doing a weak arithmetic expression, right? Because times binds more tightly than plus. Plus is weaker. And so then I would look at the right-hand side and I would ask, can I parse what I see after the plus as a strong arithmetic operation, like a times, one that binds more tightly than the plus that I'm currently looking at as an operator. And this binding or precedence level determines whether you nest the next expression at the same level or an increased level, like a tighter binding, like the times would be with respect to the plus operator. So you set up your grammar rules so that things that nest more tightly

Starting point is 00:14:45 become more interior nodes inside the tree. And you do this using self-contained rules. So you'll make like a weak arithmetic rule and that one will have a strong arithmetic rule inside on the left hand side and the right hand side. So you kind of stylize your recursive functions or your BNF grammar to make this association clear. So in JavaScript land, beyond these kind of basic flows of how grammars work, there are also cool and fancy things like you would first outline all the functions in a file. So you would say, there's like hundreds of functions in this file, let me see if I can peek at where the beginning and end of the functions are, that I can lazily load in the tree that describes the syntax for this particular function. Let's say I never call a function. Do I want to spend cycles parsing it and doing things with it if I never end

Starting point is 00:15:37 up calling it? But there can also be interesting challenges there because tokenization, like you were talking about earlier, if you're inside of a regular expression, you do different things in the scanner than if you're in normal JavaScript code. So you might switch your scanner to RegEx mode or back into normal JavaScript mode. Or if you find yourself in the middle of a multi-line comment, there could be arbitrary stuff in there that doesn't do anything. And so these are some of the interesting challenges in doing really fast or advanced parallelized scanning or tokenization. Yeah, and what's interesting with that as well is when you try to parallelize things, you have to have enough work inside of each parallel worker to do something useful, right? So it's a classic parallelism concurrency trade-off.

Starting point is 00:16:20 So you go off and optimize all the function at this point? So you have a quick parallel parser that figures out where the function starts and end are. And then do you fork off work to end threads to quickly optimize all of them or maybe do some extra amount of parsing or something like that? Or do you just keep that information around and do that lazily with an interpreter when you end up executing it, and then just-in-time compile it. And then do you just-in-time compile in the foreground thread or in background threads or something like that is, again, part of the trade-off that you have to do. Parallelization is a good way to use multiple cores, but for lower latency page loads, maybe that's not what you want to do. Maybe you want to block and compile right then and there on the foreground thread and do the work as it comes in or something like that. So it's always kind of an interesting trade-off to see whether you tie in this case parsing with optimization or other

Starting point is 00:17:10 things like that, some amount of context generation. Yeah, that makes a lot of sense. So to kind of pull back a little bit from the advanced techniques that we're talking about and go back more to the fundamentals of parsing, one of the very important aspects of parsing is this concept of backtracking. So just like you speculatively might parallelize for parsing JavaScript on a page or something like this, in the parsing process itself, when there's something ambiguous to do, you see a less than operator, and you're not sure if that's opening, that's starting a template instantiation in C++, or if it's actually doing a less than operation, right? When you have that open angle, it could mean one of two things. So you have this ambiguous choice. And backtracking basically means capturing

Starting point is 00:17:58 your current point in the process of parsing, and being able to revisit that choice and make a different choice. So you might speculate, okay, I think that this open angle bracket is going to be a actual less than operator. And you might go try to parse as if that's the case. And if you find out that that was an error, then you might go back, restore your original context, and then go back down the other path because you didn't know which one it was going to be. And maybe there was no way for you to look ahead in the token stream to know which one you should definitely go down. So some languages can be more regular and require less backtracking or no backtracking. And there's a whole description of what languages can have this property in the theory. And it's an important concept that gets introduced when your grammar isn't fully determined by lookahead tokens.

Starting point is 00:18:49 So this is a property that many real-world languages do have that we use in computers. So Rust also relatedly has this TurboFish operator that helps resolve the ambiguity I mentioned in C++. So instead of seeing the open angle symbol and saying, oh, maybe that's a template instantiation, there's an explicit operator, which is double colon open angle, which says to the parser, the thing on the left-hand side of this operator is a type name. I'm telling you that it is. And so that's kind of an interesting solution for making it less ambiguous, but at the cost of things having more punctuation in the lines.

Starting point is 00:19:29 And Haskell also has the ability for users to create whole new infix operators, which is cool. You can extend the grammar with the Kirby operator. Actually, I think you can't create the Kirby operator. I think I tried to do that one time. But basically, arbitrary punctuation can become new infix operators. And things like camel4p, I know, were in the history of the Rust macro creation, where things were extensible in user space so that the syntax could be extended and you could build new parse trees and constructs.

Starting point is 00:20:00 So for backtracking in general, but also in parsing, you have these decision points where you may have to throw away work that you do in some subtree and then go on to make a different choice until you figure out what the actual subtree was. So that's similar to all transactional things. You make a continuation point that you might choose to revisit until you decide that, okay, I actually finished and I never need to revisit, in which case you kind of commit that and you pass your point of no return. Yeah, and as you talk about backtracking, what it reminds me is designing a parser is something that you want to do, ideally, while you design the grammar that you're going to parse itself,

Starting point is 00:20:37 right? A lot of grammars are designed in a way that ought to be easy to parse. And some of the thinking is, it's not just easy for the programmer to write a parser, it's also easy for a human to understand what's written, right? So in C++, you were talking about the less than operator, those familiar with C++ in a template context, you end up having to specify type name or template. So the type name and template keywords have a few different meanings. And in a template context, it disambiguates certain things. And so when C++ was designed, its grammar was realized to be quite complex. And those keywords are there to help disambiguate some of the things, not just for parsers, but in some cases for humans as well. Right. I remember when they were creating Go, that this was a big focus, that they wanted

Starting point is 00:21:20 to have a grammar that was not just for humans, but also easy to parse. And so they ended up with things like types following identifiers as a result of this. Yeah, yeah. And the basic thing that we're burying in talking about recursive descent parsers primarily is whether parse trees should grow from the leaves to the root or vice versa. If you look at a graph, like the edge of the very bottom of the graph versus towards the root top, the joke here is that real trees probably have a strong preference in which direction things grow. But in computer science, you get to choose how trees grow. So it's kind of interesting. And in LL and recursive descent, that's usually top

Starting point is 00:22:02 down, right? So it starts at the root, whereas LALR and bottom-up parsing starts at the leaves, so the terminals of your grammar. And there's a trade-off in choosing each of these based on what it is that you're parsing, what's the language, what's the grammar that you're looking at. Sometimes recursing, what's called from the left versus the right, will have more backtracking and be less efficient in those circumstances. So again, designing your grammar with the parser makes trade-offs in how things are efficient and how you implement the parsing. But ultimately, it comes down to how ambiguities are resolved and when they're detected during the parsing process. So shift-reduce conflicts are notorious here, and they come

Starting point is 00:22:43 from ambiguities in the bottom-up merging process. And what's neat about recursive descent parsers is that you're really just writing code, so you resolve these types of conflicts easily with the code that you write. That's one of the things I really like about doing that. If you're not really careful about recursive descent parser, you don't even notice you're doing it. You're just like, oh, there's an ambiguity here. I'll write code to just handle it. And it just kind of happens. It falls out of the writing process. But what's bad about recursive descent parsers is the same thing that's good about it. You're just writing code. So you can do whatever. Sometimes that's nice. Sometimes it's a great way to make a mess, right? So it's a very powerful tool. And with

Starting point is 00:23:22 great power comes great responsibility. Right. You could have someone who said, the grammar parses this way when it's a Wednesday after 4 p.m. in this time zone. But that would be a pretty terrible parser to write, even though it would be possible in a recursive descent parser. What the structure of these declarative things is forcing you to do is say, oh no, it follows this regular structure that I have ways of resolving ambiguities that I specify in my specification up front. Okay, so to go back to what we were talking about at the beginning of the trade-off between parser generators and hand-rolled recursive descent parsers, obviously there's a bunch of trade-offs involved. There's no way to get away from the language of trade-offs, right? So some of the things that you might consider are performance.

Starting point is 00:24:11 So when your parser generator creates something that has a certain performance in parsing, what are your tools and techniques for getting more performance out of the system? How do you parse more megabytes of text per second into your data structure or something like this? And again, when you have the code in your hand with a recursive descent parser, we can use a lot of our conventional techniques for optimization, whereas restructuring the grammar or perhaps providing hints to the parser generator are the ways that you might do it in a generated scenario. When you're writing code, you kind of have more control over memory usage and how things get structured. Like if you have peak memory limits

Starting point is 00:24:50 that you want to stay within, or if you have some kind of divide and conquer that you might be able to do. These things, again, when you have the code in your hand, it kind of gives you the control and the power, but at the cost of some additional code writing and extra work for you. Error messages are a pretty important point that I think we should talk about more.

Starting point is 00:25:08 So things like inline diagnostics. So let's say you get part of the way through parsing your program, but then you find that there's some problem. Inline diagnostics on partially completed parse process are actually very important to be able to feedback to the user of, I got to this point, but then I saw that this thing in particular went wrong. Yeah, and tracking states like this as you parse in kind of a sidecar for diagnostic purposes is a really powerful tool in recursive descent parsers. So if you look at Clang, the C++ compiler,

Starting point is 00:25:41 it always tracks potential errors or diagnostics in a side state while it's parsing. It's always keeping track of a bunch of things. And if it encounters a problem, it'll print it out if the diagnostic is enabled. Otherwise, it'll just kind of keep going. And this also allows it to print out diagnostic context more easily because it keeps track of context while it's doing the parsing. And also diagnostic information itself can lead to error correction as well. So to keep making progress, right? You want to, you see a problem, you want to keep parsing afterwards to not just give one diagnostic and give up.

Starting point is 00:26:16 Some languages want to do that, right? Give you potentially useful errors, not just the first one in your compiler invocation. Particularly if the compile time is slow, you want to have a handful of errors that you can address, not just the first one to force recompilation. And it's kind of similar to the lax mode in HTML quarks parsing. If you know the history of HTML, there's been these kind of back and forths with like XHTML and HTML5 and the previous versions

Starting point is 00:26:41 where people were very religious about like, HTML has to parse exactly correctly in any small error you shouldn't render a web page versus uh you know html should be parsable kind of whatever and if you look at html parsers nowadays the ones in browsers they end up doing very interesting things to be in quarks mode to be able to handle slightly misformed html that happens to work right and i think the ability to handle slightly misformed HTML that happens to work. Right. And I think the ability to handle error messages and error cases very well using partially constructed information does end up hurting the ability to modularize a little bit because you do want to kind of reach into the surrounding what didn't finish yet,

Starting point is 00:27:21 how much of it did get done. And so that can often prevent you from creating little, very easily pluggable modules. And we'll talk a little bit more about that. I have seen arguments against recursive descent parsers in the sense that what if your grammar is changing a lot? It could make you do significant changes to that recursive descent parser code that you wrote. But in my personal experience, at least, things evolve structure over time in a way that can pretty easily get composed function-wise.

Starting point is 00:27:52 So these functions are defining productions or helpers inside of your grammar. And usually productions don't disappear entirely or get like 100% reworked in the language versus getting tweaked and or refactored. So declarative specifications like BNF, Bacchus Narfarm, which we never defined the term, are good, but there's also some trade-offs around driving everything off of them, right? So if you want to have multiple versions of your language, for example, let's say you want to be able to ingest version one, make some internal AST changes, and then format out version two using a formatting methodology like Clang Format or something

Starting point is 00:28:30 like this. It's nice to be able to have simultaneous parsers and formatters there with common subroutines between them, right? Even though they're different subversions of the language, which may have differing BNF, right? And so on the point of modularity, there's also parser combinators, which are really cool. You can kind of make these little functions that have the same API for doing things, right? This is the general idea of combinators. They're kind of Lego bricks that you can plug together because they have a uniform interface. And the challenge with those is because they have a uniform interface, they can't pass sideband information between each other very easily, right? Because they all kind of take

Starting point is 00:29:11 a stream of tokens in, say, and produce some subfragment of the AST out. So handling that global context of error messages and things can be more difficult. How does that sideband information or how does that global scope come into play? And although combinators can be difficult for getting the best error message experience in my experience, it's really nice when the user experience is less important because you can make them very quickly in the same sense as parser generators. And there was one cool case study where, because formatting is the inverse of parsing, you can make one combinator that both parses and then formats out its input. And we made some of those for TPU assembly, for example.

Starting point is 00:29:56 So you can create like a little environment of data that came from parsing the assembly, and you can use that same environment to format out the assembly as a result of that. So it's nice that these functions can be easily inverted when you plug them together in this combinator sort of way. Yeah. And so you and I are engineers and we like to get our hands dirty. And the kind of way I like to describe this, the same goes, it's like wrestling in the mud with a pig. At some point you realize that the pig likes it And we like to get our hands dirty with code. So let's talk about a few times we've got our hands dirty with parsers.

Starting point is 00:30:29 First example that comes to mind is you and I used to work on a database query language in the style of array programming languages. And in that case, it was part right to left. So you make a right-leaning tree instead of the left-leaning one, meaning that the outermost expression was on the right-hand side of the screen. So that was an interesting kind of code base. Yeah, that was definitely the first time that I had seen this idea that operators would apply from right to left, but it made for a pretty interesting parser.

Starting point is 00:30:57 Yeah, there's also features that were somewhat notorious that we got to get some exposure to in JavaScript, like its automatic semicolon insertion, which had some funny grammatical properties. So normal BNF grammars, they don't have this notion of kind of the negative look ahead in a regular expression. You might have seen something like that. There cannot be a return character here as a terminal, but that's how automatic semicolon insertion was described in the JavaScript spec. And so in Firefox SpiderMonkey, we had a recursive descent parser. So it was very clear how that kind of production rule extension could be handled.

Starting point is 00:31:40 But I think JavaScript core actually had a Lex and Yak based parser. And I'm not exactly sure how it handled things like this negative token look ahead. But that would be actually interesting to go back and find out of how did it deal with this extension to the BNF grammar that was in the JavaScript spec. Yeah, automatic second colon insertion is definitely one of those weird quirks of JavaScript that nobody ends up relying on. But it makes it a very interesting language to parse just because if there should have been a semicolon here, then just the parser is supposed to put it there. So it's kind of interesting, kind of an interesting accident. I think it's more if it would be an error and putting a semicolon there would make it not an error, then you should put a semicolon there.

Starting point is 00:32:28 I think that's the rule. Yeah. Yeah. And so when I worked on Chrome's native client, that was a security sandbox. The parsing had to be correct. Otherwise, it was a potential security vulnerability, right? So the idea of native client back when it started in 2009, there's a really great paper describing it was, you know, let's try to run native code on the web. Because at the time, they'd written V8, and they realized there's kind of a wall you hit with JavaScript, regardless of how much your

Starting point is 00:32:54 compiler can optimize things for JavaScript. And you kind of want to run more powerful things on the web. Native Client isn't what ended up being standardized, right? WebAssembly was. But the core idea of running native code on the web was really interesting. And there were different experiments, Native Client, Portable Native Client. At Mozilla, there was Asm.js, which ended up creating WebAssembly together. And in all those contexts, the parser is really, really critical to get right. If you don't get it right, there's a vulnerability. But when I worked on Native Client, the input was x86-32, x86-64, and ARM. So there's really tricky things here, and parsing those ISAs correctly is important to security because then the CPU is just going to execute it.

Starting point is 00:33:40 So if you haven't validated the executable correctly, something else is going to get executed, which might be a vulnerability. So one fun bug that the team had to resolve early before launch, way before launch, is they were dealing with push F and pop F instructions in x86. And those tend to just push the flags, right? So overflow and things like that flags. But because of the way x86 is made, that also has interesting flags that nobody really uses. So for example, there's a flag called the direction flag, which you can save and restore that sets the direction of rep prefix instructions. So rep is the repeat prefix, and that ends up being used in things like memcpy. And so you can change the direction flag,

Starting point is 00:34:20 and that makes memcpy go backwards, which is kind of interesting, but a great way to write vulnerabilities. If you're trying to bounce check thing, you see a rep instruction, you know it's going to go, you know, in increasing addresses. Well, if you set the flags differently, it goes backwards. So that will allow you to go out of bounds in certain ways and create an exploit chain. Another interesting parts of parsing that we had to get right was in ARMv7. There's interesting features in ARMv7. One of them is you can change endianness, right? So you can change the endianness of the CPU,

Starting point is 00:34:53 or you can change the instructions to a different instruction set. So ARM has something called ARM32 as well as Thumb, Thumb and Thumb2. And you can change from one to the other during execution. And so obviously, if you see those instructions that change endianness or the change instruction set, one thing you can do is to just disallow them. Because if you try to handle them, it becomes very, very tricky to do that very well. So it's a giant context change for a parser. It's not something we wanted to allow. So we just disallowed it in the parser. There's other interesting things like Giselle and other stuff in ARM that we won't talk about. And there's other cool stuff that came out

Starting point is 00:35:27 of the NACL team, the native client team. So at some point, the team wrote an auto-generated x86 parser. They had a description of all of x86 that we wanted to handle, and they auto-generated the parser from that description. And the code that they generated was so big that it was, at the time, the biggest set of functions inside of Chrome. And it blew up Visual Studio really bad. Visual Studio has some behavior where it just gives up and generates really terrible code. And so the Chrome parser was extremely slow until they figured out how to tune the code generator to split up what it generated so that Visual Studio wouldn't choke on it and generate terrible code. So that was kind of interesting as well.

Starting point is 00:36:06 Yeah, that is an important point to bring up as one of the potential pitfalls in recursive descent parsing. So because you have these production rules for your functions, your functions represent production rules, and they're all co-recursive. They're all calling into each other. And then some production rules will be hotter than others. Some will be smaller handling code that gets inlined into kind of outer sites that are trying to implement parts of the grammar. So sometimes you can get these giant mega frames where, you know, there's one

Starting point is 00:36:37 function that does some grammatical production and it ends up with tons and tons of stack data whenever you do call this particular function. And so you can have giant frames that end up leaping over guard pages or terrible things or not cause the OS to page in the next stack page appropriately. So that's something you have to watch out for is these giant stack frames or stack overflows that cause segmentation faults.

Starting point is 00:37:04 Yeah, and another cool parser that I worked on frames or stack overflows that cause segmentation faults. Yeah. And another cool parser that I worked on was I wrote a good chunk of the parser for WebAssembly inside of WebKit and writing testing and things. So I wrote kind of the core of it and then everyone else kind of wrote the rest of the parsing. And that is also security sensitive, right? WebAssembly is kind of an interesting kind of stack machine, little language, but it's all byte oriented when you parse it. And it's got to be secure. If you don't parse it properly, you end up having vulnerabilities. And you also, in those cases, want to have good error messages, because it's a bunch of bytes that you push in. And so

Starting point is 00:37:39 for a developer to be able to look at the error messages with WebAssembly is really important. So what I liked doing to solve the parsing of WebAssembly and WebKit was using a recursive descent parser. But what was neat is I used the C++ stood expected type to return either a result or an error. And what was nice there in the recursive descent parser we haven't expected is it's basically impossible to ignore the errors when you do this. If you set it in a mode where like any kind of access of the wrong thing is going to fail, where you can't disregard the return value of something, it makes it almost correct by construction when you write the parser or not necessarily correct by construction, because you still have to treat the information

Starting point is 00:38:20 correctly, right? But it makes it much easier to do the right thing and harder to do the wrong thing. And if you compose it well, it's hard to get things like integer overflows or out-of-bounds accesses or things like that when you're doing the parsing because it's just kind of constructed out of basic parts that you put together. So it ends up being interesting to write a security-sensitive parser with kind of modern tools that way. Yeah, that's a nice example story. I like standard expected, and I hope that there's also a question mark operator for it, like in Rust, that we can use one day.

Starting point is 00:38:53 So if there are C++ committee people listening. There isn't for now, but we added the monadic operators to it for C++23 the same way we added them for std optional. So there's these nice kind of monadic operators that are coming in for C++23 the same way we added them for std optional. So there's these nice kind of monadic operators that are coming in for C++23 that kind of augment both optional and expected, but doesn't quite have the same power as the question mark operator. There's a separate proposal about the question mark operator or something like it, but it's somewhat novel to C++. And so I'm not sure it'll make it to C++ 26. We'll see. Yeah, having to write if for every function call that can fail is just terrible. But that's just

Starting point is 00:39:30 my personal opinion. What's nice about it is in C++, you can write if, you know, auto something equals an expected, and then use the expected result in the if and then use the unexpected one in the else, right? So the scoping of the variable, you create that variable inside of the parentheses of the if, and then use the unexpected one in the else, right? So the scoping of the variable, you create that variable inside of the parentheses of the if, use the value result in the if block, and then use the error result in the else block. So that's kind of nice. It's a neat kind of little syntactic thing in C++. Yeah, one other story that I think is relevant is the team I'm currently on collaborates on a project that parses system Verilog using Antler v4, I think, relevant is the team I'm currently on collaborates on a project that parses system

Starting point is 00:40:05 Verilog using Antler v4, I think, is the latest, which I think is the largest Antler grammar as well. And we've been talking a little bit about the trade-offs we see there, and folks on my team have been doing performance optimization work, which, as we were talking about, sometimes means restructuring the grammar so that the generator can do better with it. And, you know, there's all these same interesting questions that come up around memory usage, and you get a lot of power from the grammar generating. That would be a lot of work to write in a recursive descent parser form, especially as the grammar gets large, but there's clearly a trade-off space there. So it's been fun to also kind of relive this understanding of the trade-offs. One last thing I think we wanted to touch on was the difference between parse trees and ASTs. So

Starting point is 00:40:49 we kind of threw those terms around a little bit here. But generally, I think that's not formally defined, but usually a parse tree will be kind of the full tree that comes out of the grammar. Like the grammar rules literally have a node for every rule inside of the grammar, whereas an AST will be kind of a more reduced form. So if I have a bunch of ands strung together and they're all at the same precedence level, if I make an anary node that has all the operands to the string of adds, that's not strictly the original grammar, but its equivalent representation is simplified. So I think that's usually how people tend to talk about AST versus parse trees. Yeah, well, cool. I think that's a good amount of knowledge about parsers that we've talked about today. So how about we adjourn and meet soon for another deep dive into an interesting geeky topic? Sounds great. Great catching up with you again. Yeah, same here. All right. See you later. See you, JF.

TLB Hit 💥 - Parsers

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

TLB Hit 💥 - Parsers

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.