Transcript
Discussion (0)
Hey, JF. It's a warm and fuzzy and parsy time of year.
Hey, Chris. So today I suggest we only talk about parsers, not fuzzers. But, you know, parsley is always good on holiday food.
Sounds good. All right. Welcome, everyone, to the disclaimer.
We only know so much, but we'll try not to say things that are wrong, and we'll follow up with errata, corrections, and clarifications as we find out about them and put them on the website, tlbh.it. We do the lifelong learners thing, and we're always trying to push the boundaries of things
we know about and have conversations about what's interesting to develop and hone our
understanding.
So today, as JF said, we're going to talk about parsers.
And we're mostly going to be talking about some of our practical experience with parsers.
Addressing the parsing field as a whole is very hard because it's both
theoretically nuanced and large. One of the criticisms you often hear about the original
edition of the Dragonbook was that it was super front-loaded with a lot of theoretical
parser content, which kind of tripped folks up in compiler's class. A lot of that book was on
syntax-directed code generation as a back- backend flow. So it's kind of understandable that it did a lot on parsing.
But that's often also not what a lot of people wanted to understand from compilers class.
So this is always a little bit of an interesting proposition.
So it's a little heretical, but part of what we're going to claim today
is that parsers in a practical sense can just be viewed as kind of a stylized pattern of coding
where you can easily create
your own parsing scenarios and parsing solutions and reuse things if you just accept recursive
descent style parsing as the way to do it. Right. And what's interesting is parsers really,
once you get into them, they show up all over the place, right? So people could say, like,
you can't just point at
anything and call it a parser. But when you start loving parsers, you kind of do. You point at
things, you're like, parser here, parser there, parser there. So an example is, you know, CPU
frontends, the ones that decode the instructions, they're really just very compact on-the-fly
parser generators for microcode, right? So if you look at ARM64, it's four byte words that you parse one at a time.
And the CPU does it as any disassembler would do it.
Sometimes it does it more than one word at a time, but it's basically a fancy parser.
x86 is a bit trickier as a front end because the instruction size is variable in x86.
So it's just a variable with encoding, but it's still kind of a parser.
This type of parser, it needs to have context to decode the instructions properly.
So starting at the right place to decode the right instructions.
And in fact, if you're misaligned, you could decode totally different instructions,
but it's still effectively a very fancy parser.
Yeah, that's one of the fun things that in security shellcode will try to do
is put a bunch of constants into code and jump to a weird offset
so that the instructions
get decoded differently with immediates in there. Cool. Sounds good. So just some kind of naive
background on the parsing process. As humans, we often write text down using alphabets that we use
to make composite words. And then we have parts of speech that put words in relation to one another to make sentences that have semantic meaning.
And parsing for computer programs is analogous or similar to that in a lot of ways.
We have scanned characters like code points and Unicode.
We use those to make up tokens, which are kind of like words.
And we put those into a parse tree that often relate values, which are like nouns and verbs, which are like operators together.
And then we can define new composite things by putting all these primitive pieces together.
If you've never watched Guy Steele's Growing a Language talk, you should probably push pause and go watch that right now.
It's really good.
Often we use declarative languages to help us describe how to read in computer languages. So we use regular expressions, for example, to say how to scan tokens out of the character stream. And we use little languages like BNF that describe grammars of languages, the kind of recursive definition of an arithmetic expression can be made up of strong arithmetic with multiplications or weak arithmetic with additions and subtractions, these kinds of recursive definitions.
And BNF is structured as these recursive rules for how to turn the tokens into a tree, and that's the parse tree that we get out of the parsing process. Yeah, it's interesting that you mentioned Unicode because, you know, I was saying you could just point at anything and call it a parser.
Unicode itself, you kind of need a parser for, even though it's a kind of self-restarting parser.
It's not that context-full in the sense that the way Unicode is encoded, every UTF start in UTF-8, for example, tells you where the start of the code point is.
And so if you miss one code point, the next one you won't be able to miss because of the way it's encoded.
It's a very simple parser to parse, but it's still something that you would, I would say, parse out of a byte stream.
So, you know, parsers are really everywhere.
And what's interesting is the common wisdom out there is to never write a parser by hand, but to always use a parser generator. And I think you and I think that the common wisdom is just wrong. We would prefer to
always write a parser by hand, start with a recursive descent parser. And if you realize
that at some point that causes issues, go to a state machine to have more recursion and state
and things like that. I think recursive descent parsers are really easy to write, and transforming them to a state machine is not that bad.
For me, I started my first parsers a long time ago with the common wisdom, and I tried a bunch
of stuff. I tried Antler, I tried LX Bison, and I ended up with Boost Spirit, which was,
in retrospect, a bad idea. It was hard to write, hard to debug, not really flexible.
It gave you terrible error messages.
But as a trade-off, Boost Spirit gave you terrible compile times, too.
So that was great.
But the silver lining in this first experience of working with Boost Spirit all that time ago
was that I get to make fun of Bryce Lelback, who's a friend of the show and ADSP podcast host,
because he worked on Spirit,
and I always make fun of him for that. So there's a silver lining in learning from
disregarding the common wisdom. Can you describe Boost Spirit a little bit of how
you work with Boost Spirit? Yeah, yeah. So Boost Spirit, what's funny about Boost Spirit,
one of the ways I make fun about it on Twitter every so often is that Boost Spirit is kind of an EBNF DSL written in C++ template metaprogramming.
So if you look at EBNF,
it has all these special operators
like the space and plus and star and things like that.
And some of these exist in C++ as operators.
So someone went and made a clever C++ optimizer
in template metaprogramming
with a DSL language
using template and operator overloading
to use somewhat of EBNF parser
replacing things like space
with like to right shift
and plus and star
instead of being postfix being infix.
And it looks and reads kind of like EBNF,
but in C++ template metaprogramming
with a little optimizer
in there and everything else. It's very cute. Kind of a silly idea, honestly. I would rather just write
all of that stuff by hand. And I think we as an industry had a rude awakening to see all the.y
files trying to interweave things into the parsing process with unclear lifetimes when you use the.y
files. Memory errors, part of what made me love recursive descent parsers by contrast was that type of stuff.
So while I make fun of Boo's spirit as kind of a weird thing,
I wouldn't say that the.y files are any better than what you end up there.
So I really think that recursive descent parsers are the right way to go.
It's much more malleable, much nicer to write,
much easier to debug, and things like that.
Right, and for those folks who haven't used YAC before, the.y files are stylized C templates that the YAC parser can call into when events occur in the formation of the parse tree.
So you kind of create this hybridized C-looking file, but where things are happening that are driven by the YAC parser.
Yeah. I'm personally a big fan of code generation, right? Where you have a separate
custom DSL that you yourself parse and generate C or C++ code to do stuff that your then
code uses. I just think YAC files are not the best composition way to create code generation
out of things. Right. And increasingly, people composition way to create code generation out of things.
Right. And increasingly, people are trying to create and use simple versions that operate more like that.
So in YAK, the inner loop is driven by kind of the theoretical Lawler grammar composition.
So there's kind of something you don't understand that you'll get shift-reduce conflicts from in your inner loop, but that kind of triggers events in this other file.
Things like peg grammars that people are trying to use more of over time can end up kind of spitting
out a C file that is a recursive descent parser, but we'll talk about some of the trade-offs there
as well. And again, even though things like peg grammars are spitting out recursive
descent parser implementations, which is a more modern technique, here we're mostly going to talk
about the hand-rolled recursive descent parsers because the trade-off there is that they work
well and they're understandable when you structure them as code yourself. And really, it just comes
down to a stylized way of structuring your code to make a useful and easily recognized data structure, which is the parse tree that comes out or a nice error message that can use the context from where the error message occurred.
So we will link to articles in the notes that give counterpoints on our view that this is probably just the right way to do it.
But parsing historically was one of these great
crossovers between fundamental computer science and practical applications. So you learn about
it in theory of computations class, that a regular expression has a certain level of power,
and then push down automata are like the next level of power that enable you to do things like
match parentheses when you have an open and closed parentheses,
right? You can't write an HTML parser, for example, using just regular expressions. So
parsers are strictly more powerful than what you can do with regular expressions.
And people, as a result of this important theoretical basis, can feel strongly that
the underlying theory is important to the point that you find yourself learning about Chomsky's theories and counter-diagnostication proofs and things like that. And that is all
important and great. Computer CS fundamentals are great. But from a practitioner perspective,
there's also part of it where it says, well, if you just write code in a particular stylized way,
you can accomplish your parsing tasks. You can get done the job that you want to get done. And you can, if you find it daunting, you may be able to skip over some of those fundamentals
and know how to write this stylized form of code.
Yeah. And it's interesting that you talk about the theory of computer science and parsing,
because maybe 10 years ago or so, I used to work with someone who actually did his PhD
on parsing, right. It sounds silly.
Nowadays, parsing is a fairly well-known field.
So doing a PhD on that, he significantly advanced the state of the art in computer science in the 70s.
And so it really tells you that computer science itself is a fairly novel field.
There's a lot of stuff that was figured out fairly recently. And it's kind of interesting that that's, you know, we're, we're in that field where things are so novel that it's
possible to know people who invented things that you look at it now you're like, yeah, that's,
that's parsing. It's easy. Of course. Yeah. It's pretty interesting there. So the basic flow of
parsing is that first you'll, you'll tokenize the characters, right? So it's a linear stream
of characters to linear stream of token
transform. And then you make a tree out of those tokens. So scanning and tokenization are kind of
synonyms. But even that fact is interesting, because it brings up questions like is tokenization
context sensitive? And how parallelizable is it with respect to the character stream?
So sometimes you want to split scanning and parsing.
Sometimes you want to join them. It really depends on a variety of things. Scanning tends to have
very little context. So it usually just understands basic delimiters in the text itself.
Whereas parsing tends to understand the structure and the context of the grammar that you're dealing
with. And the parser itself then has a state to understand where you are in a state machine, usually to decode the grammar you're parsing.
And then it generates an output.
And again, that's dependent on what you're doing.
It could be an abstract syntax tree that you output from your parser, but it could be something else.
You might generate code directly if you're kind of a small and efficient C compiler, for example.
So in a recursive descent parser, that state is partly
encoded in the data flow itself, right? So the function calls what's on the stack and so on,
encodes some of the state. And so when I was talking about transforming a recursive descent
parser into more of a state machine, what you end up doing is looking at that state that you had
through recursion and just encoding it, having your own stack to encode that state as a side data
structure.
Right.
And that's a generally interesting thing in the way that we write sequential programs.
The program counter is effectively a state that marches forward, right?
And then takes jumps and things like this.
And then recursive functions are one particular way to structure your state with things on
the stack and the marching forward of the program
counter. Sounds good. So I think, like we were talking about, the tokens get formed from the
character stream. And so looking ahead at what token you see, ultimately is part of what determines
the grammar production, the grammar rule to apply. So if I'm looking ahead at the character stream,
and I'm holding something in my hand that I was just looking at and I see a plus sign, I know that this is probably
a binary plus operator in many common languages that wants to take the thing I just saw on the
left-hand side and add it with something that I then have to determine on the right-hand side.
And so what happens with these rules is these rules nest. I'll take the plus operator.
I'll say, okay, I must currently be doing a weak arithmetic expression, right? Because times binds more tightly than plus.
Plus is weaker.
And so then I would look at the right-hand side and I would ask, can I parse what I see
after the plus as a strong arithmetic operation, like a times, one that binds more tightly
than the plus that I'm
currently looking at as an operator. And this binding or precedence level determines whether
you nest the next expression at the same level or an increased level, like a tighter binding,
like the times would be with respect to the plus operator. So you set up your grammar rules so that things that nest more tightly
become more interior nodes inside the tree. And you do this using self-contained rules. So you'll
make like a weak arithmetic rule and that one will have a strong arithmetic rule inside on the left
hand side and the right hand side. So you kind of stylize your recursive functions or your BNF grammar to make this association clear.
So in JavaScript land, beyond these kind of basic flows of how grammars work, there are also cool
and fancy things like you would first outline all the functions in a file. So you would say,
there's like hundreds of functions in this file, let me see if I can peek at where the beginning
and end of the functions are, that I can lazily load in the tree that describes the syntax for this particular function. Let's say
I never call a function. Do I want to spend cycles parsing it and doing things with it if I never end
up calling it? But there can also be interesting challenges there because tokenization, like you
were talking about earlier, if you're inside of a regular
expression, you do different things in the scanner than if you're in normal JavaScript code. So you
might switch your scanner to RegEx mode or back into normal JavaScript mode. Or if you find yourself
in the middle of a multi-line comment, there could be arbitrary stuff in there that doesn't do
anything. And so these are some of the interesting challenges in doing really fast or advanced parallelized scanning or tokenization.
Yeah, and what's interesting with that as well is when you try to parallelize things, you have to have enough work inside of each parallel worker to do something useful, right?
So it's a classic parallelism concurrency trade-off.
So you go off and optimize all the function at this point? So you have a quick parallel parser that figures out where the function starts and end are.
And then do you fork off work to end threads to quickly optimize all of them or maybe do some extra amount of parsing or something like that?
Or do you just keep that information around and do that lazily with an interpreter when you end up executing it, and then just-in-time compile it. And then do you
just-in-time compile in the foreground thread or in background threads or something like that
is, again, part of the trade-off that you have to do. Parallelization is a good way to use multiple
cores, but for lower latency page loads, maybe that's not what you want to do. Maybe you want
to block and compile right then and there on the foreground thread and do the work as it comes in
or something like that. So it's always kind of an interesting trade-off to see whether you tie in this case parsing with optimization or other
things like that, some amount of context generation. Yeah, that makes a lot of sense. So to kind of
pull back a little bit from the advanced techniques that we're talking about and go back more to the
fundamentals of parsing, one of the very important aspects of parsing is this concept of backtracking.
So just like you speculatively might parallelize for parsing JavaScript on a page or something
like this, in the parsing process itself, when there's something ambiguous to do, you see a
less than operator, and you're not sure if that's opening, that's starting a template instantiation in C++,
or if it's actually doing a less than operation, right? When you have that open angle, it could
mean one of two things. So you have this ambiguous choice. And backtracking basically means capturing
your current point in the process of parsing, and being able to revisit that choice and make a different choice.
So you might speculate, okay, I think that this open angle bracket is going to be a actual less than operator. And you might go try to parse as if that's the case.
And if you find out that that was an error, then you might go back, restore your original context,
and then go back down the other path because you didn't know which one it was going to be. And maybe there was no way for you to look ahead in the token stream to know
which one you should definitely go down. So some languages can be more regular and require
less backtracking or no backtracking. And there's a whole description of what languages can have
this property in the theory. And it's an important concept that gets introduced
when your grammar isn't fully determined by lookahead tokens.
So this is a property that many real-world languages do have
that we use in computers.
So Rust also relatedly has this TurboFish operator
that helps resolve the ambiguity I mentioned in C++.
So instead of seeing the open angle symbol and saying, oh, maybe that's
a template instantiation, there's an explicit operator, which is double colon open angle,
which says to the parser, the thing on the left-hand side of this operator is a type name.
I'm telling you that it is. And so that's kind of an interesting solution for making it less ambiguous, but at the cost of things having more punctuation in the lines.
And Haskell also has the ability for users to create whole new infix operators, which is cool.
You can extend the grammar with the Kirby operator.
Actually, I think you can't create the Kirby operator.
I think I tried to do that one time.
But basically, arbitrary punctuation can become new infix operators.
And things like camel4p, I know, were in the history of the Rust macro creation, where
things were extensible in user space so that the syntax could be extended and you could
build new parse trees and constructs.
So for backtracking in general, but also in parsing, you have these decision points where
you may have to throw away work that you do in some subtree and then go on to make a different
choice until you figure out what the actual subtree was.
So that's similar to all transactional things.
You make a continuation point that you might choose to revisit until you decide that, okay,
I actually finished and I never need to revisit, in which case you kind of commit that and you pass your point of no return.
Yeah, and as you talk about backtracking, what it reminds me is designing a parser is something
that you want to do, ideally, while you design the grammar that you're going to parse itself,
right? A lot of grammars are designed in a way that ought to be easy to parse. And some of the
thinking is, it's not just easy for the programmer to write a parser, it's also easy for a human to understand what's written,
right? So in C++, you were talking about the less than operator, those familiar with C++ in a
template context, you end up having to specify type name or template. So the type name and template
keywords have a few different meanings. And in a template context, it disambiguates certain things.
And so when C++ was designed, its grammar was realized to be quite complex. And those keywords
are there to help disambiguate some of the things, not just for parsers, but in some cases for humans
as well. Right. I remember when they were creating Go, that this was a big focus, that they wanted
to have a grammar that was not just for humans, but also easy to parse. And
so they ended up with things like types following identifiers as a result of this.
Yeah, yeah. And the basic thing that we're burying in talking about recursive
descent parsers primarily is whether parse trees should grow from the leaves to the root
or vice versa. If you look at a graph, like the edge of the
very bottom of the graph versus towards the root top, the joke here is that real trees probably
have a strong preference in which direction things grow. But in computer science, you get to choose
how trees grow. So it's kind of interesting. And in LL and recursive descent, that's usually top
down, right? So it starts at the root, whereas LALR and
bottom-up parsing starts at the leaves, so the terminals of your grammar. And there's a trade-off
in choosing each of these based on what it is that you're parsing, what's the language, what's the
grammar that you're looking at. Sometimes recursing, what's called from the left versus the right,
will have more backtracking and be less efficient in those circumstances. So again, designing your
grammar with the parser makes trade-offs in how things are efficient and how you implement the
parsing. But ultimately, it comes down to how ambiguities are resolved and when they're detected
during the parsing process. So shift-reduce conflicts are notorious here, and they come
from ambiguities in the bottom-up merging process.
And what's neat about recursive descent parsers is that you're really just writing code, so you resolve these types of conflicts easily with the code that you write.
That's one of the things I really like about doing that.
If you're not really careful about recursive descent parser, you don't even notice you're doing it.
You're just like, oh, there's an ambiguity here. I'll write code to just handle it. And it just kind of happens.
It falls out of the writing process. But what's bad about recursive descent parsers is the same
thing that's good about it. You're just writing code. So you can do whatever. Sometimes that's
nice. Sometimes it's a great way to make a mess, right? So it's a very powerful tool. And with
great power comes great responsibility.
Right. You could have someone who said, the grammar parses this way when it's a Wednesday after 4 p.m. in this time zone. But that would be a pretty terrible parser to write,
even though it would be possible in a recursive descent parser. What the structure of these
declarative things is forcing you to do is say, oh no, it follows this regular structure
that I have ways of resolving ambiguities that I specify in my specification up front.
Okay, so to go back to what we were talking about at the beginning of the trade-off between
parser generators and hand-rolled recursive descent parsers, obviously there's a bunch
of trade-offs involved. There's no way to get away from the language of trade-offs, right? So some of the things that you might consider are performance.
So when your parser generator creates something that has a certain performance in parsing,
what are your tools and techniques for getting more performance out of the system? How do you
parse more megabytes of text per second into your data structure or something like this? And again, when you have the code in your hand with a recursive descent parser,
we can use a lot of our conventional techniques for optimization, whereas restructuring the grammar
or perhaps providing hints to the parser generator are the ways that you might do it in a generated
scenario. When you're writing code, you kind of have more control over memory usage
and how things get structured.
Like if you have peak memory limits
that you want to stay within,
or if you have some kind of divide and conquer
that you might be able to do.
These things, again, when you have the code in your hand,
it kind of gives you the control and the power,
but at the cost of some additional code writing
and extra work for you.
Error messages are a pretty important point that I think we should talk about more.
So things like inline diagnostics.
So let's say you get part of the way through parsing your program,
but then you find that there's some problem.
Inline diagnostics on partially completed parse process
are actually very important to be able to feedback to the user of,
I got to this point, but then I saw that this thing in particular went wrong.
Yeah, and tracking states like this as you parse in kind of a sidecar for diagnostic purposes is
a really powerful tool in recursive descent parsers. So if you look at Clang, the C++ compiler,
it always tracks potential errors or diagnostics in a side state while it's
parsing. It's always keeping track of a bunch of things. And if it encounters a problem, it'll
print it out if the diagnostic is enabled. Otherwise, it'll just kind of keep going.
And this also allows it to print out diagnostic context more easily because it keeps track of
context while it's doing the parsing. And also diagnostic information itself can lead to error correction as well.
So to keep making progress, right?
You want to, you see a problem, you want to keep parsing afterwards to not just give one
diagnostic and give up.
Some languages want to do that, right?
Give you potentially useful errors, not just the first one in your compiler invocation.
Particularly if the compile time is slow, you want to have a handful of errors that you can address,
not just the first one to force recompilation.
And it's kind of similar to the lax mode in HTML quarks parsing.
If you know the history of HTML,
there's been these kind of back and forths
with like XHTML and HTML5 and the previous versions
where people were very religious about like,
HTML has to parse exactly
correctly in any small error you shouldn't render a web page versus uh you know html should be
parsable kind of whatever and if you look at html parsers nowadays the ones in browsers they end up
doing very interesting things to be in quarks mode to be able to handle slightly misformed html that
happens to work right and i think the ability to handle slightly misformed HTML that happens to work.
Right. And I think the ability to handle error messages and error cases very well using partially constructed information does end up hurting the ability to modularize a little bit
because you do want to kind of reach into the surrounding what didn't finish yet,
how much of it did get done. And so that can often prevent you from creating little, very easily pluggable modules.
And we'll talk a little bit more about that.
I have seen arguments against recursive descent parsers in the sense that what if your grammar
is changing a lot?
It could make you do significant changes to that recursive descent parser code that you
wrote.
But in my personal experience, at least, things evolve structure over time in a way that can
pretty easily get composed function-wise.
So these functions are defining productions or helpers inside of your grammar.
And usually productions don't disappear entirely or get like 100% reworked in the language
versus getting tweaked and or refactored. So declarative
specifications like BNF, Bacchus Narfarm, which we never defined the term, are good, but there's also
some trade-offs around driving everything off of them, right? So if you want to have multiple
versions of your language, for example, let's say you want to be able to ingest version one,
make some internal AST changes, and then
format out version two using a formatting methodology like Clang Format or something
like this. It's nice to be able to have simultaneous parsers and formatters there
with common subroutines between them, right? Even though they're different subversions of
the language, which may have differing BNF, right? And so on the point of modularity,
there's also parser combinators, which are really cool. You can kind of make these little functions
that have the same API for doing things, right? This is the general idea of combinators. They're
kind of Lego bricks that you can plug together because they have a uniform interface. And the
challenge with those is because they have a uniform interface, they
can't pass sideband information between each other very easily, right? Because they all kind of take
a stream of tokens in, say, and produce some subfragment of the AST out. So handling that
global context of error messages and things can be more difficult. How does that sideband information or how does that global scope come into play? And although combinators can be difficult for getting the best error message
experience in my experience, it's really nice when the user experience is less important because you
can make them very quickly in the same sense as parser generators. And there was one cool case study where,
because formatting is the inverse of parsing,
you can make one combinator that both parses
and then formats out its input.
And we made some of those for TPU assembly, for example.
So you can create like a little environment of data
that came from parsing the assembly,
and you can use that same environment
to format out the assembly as a result of that.
So it's nice that these functions can be easily inverted when you plug them together in this
combinator sort of way. Yeah. And so you and I are engineers and we like to get our hands dirty. And
the kind of way I like to describe this, the same goes, it's like wrestling in the mud with a pig.
At some point you realize that the pig likes it And we like to get our hands dirty with code. So let's talk about a few times we've got our hands dirty with parsers.
First example that comes to mind is you and I used to work on a database query language
in the style of array programming languages. And in that case, it was part right to left.
So you make a right-leaning tree instead of the left-leaning one, meaning that the outermost
expression was on the right-hand side of the screen.
So that was an interesting kind of code base.
Yeah, that was definitely the first time that I had seen this idea
that operators would apply from right to left,
but it made for a pretty interesting parser.
Yeah, there's also features that were somewhat notorious
that we got to get some exposure to in JavaScript,
like its automatic semicolon insertion, which had some funny grammatical properties.
So normal BNF grammars, they don't have this notion of kind of the negative look ahead in a regular expression.
You might have seen something like that. There cannot be a return character here as a terminal, but that's how automatic semicolon
insertion was described in the JavaScript spec.
And so in Firefox SpiderMonkey, we had a recursive descent parser.
So it was very clear how that kind of production rule extension could be handled.
But I think JavaScript core actually had a Lex and Yak based parser. And I'm not exactly
sure how it handled things like this negative token look ahead. But that would be actually
interesting to go back and find out of how did it deal with this extension to the BNF grammar that
was in the JavaScript spec. Yeah, automatic second colon insertion is definitely one of those weird quirks of JavaScript that nobody ends up relying on.
But it makes it a very interesting language to parse just because if there should have been a semicolon here, then just the parser is supposed to put it there.
So it's kind of interesting, kind of an interesting accident.
I think it's more if it would be an error and putting a semicolon there would make it not an error,
then you should put a semicolon there.
I think that's the rule.
Yeah.
Yeah.
And so when I worked on Chrome's native client, that was a security sandbox.
The parsing had to be correct.
Otherwise, it was a potential security vulnerability, right? So the idea of native client back when it started in 2009, there's a really great paper
describing it was, you know, let's try to run native code on the web. Because at the time, they'd written V8,
and they realized there's kind of a wall you hit with JavaScript, regardless of how much your
compiler can optimize things for JavaScript. And you kind of want to run more powerful things on
the web. Native Client isn't what ended up being standardized, right? WebAssembly was. But the
core idea of running native code on the web was really interesting.
And there were different experiments, Native Client, Portable Native Client.
At Mozilla, there was Asm.js, which ended up creating WebAssembly together.
And in all those contexts, the parser is really, really critical to get right.
If you don't get it right, there's a vulnerability. But when I worked on Native Client, the input was x86-32, x86-64, and ARM.
So there's really tricky things here, and parsing those ISAs correctly is important to security because then the CPU is just going to execute it.
So if you haven't validated the executable correctly, something else is going to get executed, which might be a vulnerability.
So one fun bug that the team had to resolve early before launch, way before launch, is they were dealing with push F and pop F instructions in x86.
And those tend to just push the flags, right?
So overflow and things like that flags.
But because of the way x86 is made, that also has interesting flags
that nobody really uses. So for example, there's a flag called the direction flag, which you can
save and restore that sets the direction of rep prefix instructions. So rep is the repeat prefix,
and that ends up being used in things like memcpy. And so you can change the direction flag,
and that makes memcpy go backwards, which is kind of interesting, but a
great way to write vulnerabilities. If you're trying to bounce check thing, you see a rep
instruction, you know it's going to go, you know, in increasing addresses. Well, if you set the
flags differently, it goes backwards. So that will allow you to go out of bounds in certain ways and
create an exploit chain. Another interesting parts of parsing that we had to get right was in ARMv7.
There's interesting features in ARMv7.
One of them is you can change endianness, right?
So you can change the endianness of the CPU,
or you can change the instructions to a different instruction set.
So ARM has something called ARM32 as well as Thumb, Thumb and Thumb2.
And you can change from one to the other during execution.
And so obviously, if you see those instructions that change endianness or the change instruction set,
one thing you can do is to just disallow them. Because if you try to handle them,
it becomes very, very tricky to do that very well. So it's a giant context change for a parser. It's
not something we wanted to allow. So we just disallowed it in the parser. There's other
interesting things like Giselle and other stuff in ARM that we won't talk about. And there's other cool stuff that came out
of the NACL team, the native client team. So at some point, the team wrote an auto-generated x86
parser. They had a description of all of x86 that we wanted to handle, and they auto-generated the
parser from that description. And the code that they generated was so big that it was, at the
time, the biggest set of functions inside of Chrome.
And it blew up Visual Studio really bad.
Visual Studio has some behavior where it just gives up and generates really terrible code.
And so the Chrome parser was extremely slow until they figured out how to tune the code generator to split up what it generated so that Visual Studio wouldn't choke on it and generate terrible code.
So that was kind of interesting as well.
Yeah, that is an important point to bring up as one of the potential pitfalls in recursive
descent parsing.
So because you have these production rules for your functions, your functions represent
production rules, and they're all co-recursive.
They're all calling into each other.
And then some production rules will be hotter than others. Some will be
smaller handling code that gets inlined into kind of outer sites that are trying to implement parts
of the grammar. So sometimes you can get these giant mega frames where, you know, there's one
function that does some grammatical production and it ends up with tons and tons of stack data
whenever you do call this particular function.
And so you can have giant frames
that end up leaping over guard pages or terrible things
or not cause the OS to page in the next stack page appropriately.
So that's something you have to watch out for
is these giant stack frames or stack overflows
that cause segmentation faults.
Yeah, and another cool parser that I worked on frames or stack overflows that cause segmentation faults. Yeah.
And another cool parser that I worked on was I wrote a good chunk of the parser for WebAssembly inside of WebKit and writing testing and things.
So I wrote kind of the core of it and then everyone else kind of wrote the rest of the
parsing.
And that is also security sensitive, right?
WebAssembly is kind of an interesting kind of stack machine, little language, but it's all byte oriented when you parse it. And it's got to be
secure. If you don't parse it properly, you end up having vulnerabilities. And you also, in those
cases, want to have good error messages, because it's a bunch of bytes that you push in. And so
for a developer to be able to look at the error messages with WebAssembly is really important.
So what I liked doing to solve the parsing of WebAssembly and WebKit was using a recursive
descent parser. But what was neat is I used the C++ stood expected type to return either a result
or an error. And what was nice there in the recursive descent parser we haven't expected
is it's basically impossible to ignore the errors when you do this. If you set it in a
mode where like any kind of access of the wrong thing is going to fail, where you can't disregard
the return value of something, it makes it almost correct by construction when you write the parser
or not necessarily correct by construction, because you still have to treat the information
correctly, right? But it makes it much easier to do the right thing and harder to do
the wrong thing. And if you compose it well, it's hard to get things like integer overflows or
out-of-bounds accesses or things like that when you're doing the parsing because it's just kind
of constructed out of basic parts that you put together. So it ends up being interesting to
write a security-sensitive parser with kind of modern tools that way.
Yeah, that's a nice example story.
I like standard expected, and I hope that there's also a question mark operator for
it, like in Rust, that we can use one day.
So if there are C++ committee people listening.
There isn't for now, but we added the monadic operators to it for C++23 the same way we
added them for std optional.
So there's these nice kind of monadic operators that are coming in for C++23 the same way we added them for std optional. So there's these nice kind of monadic
operators that are coming in for C++23 that kind of augment both optional and expected,
but doesn't quite have the same power as the question mark operator. There's a separate
proposal about the question mark operator or something like it, but it's somewhat novel to
C++. And so I'm not sure it'll make it to C++ 26. We'll see. Yeah, having to write if for every function call that can fail is just terrible. But that's just
my personal opinion.
What's nice about it is in C++, you can write if, you know, auto something equals an expected,
and then use the expected result in the if and then use the unexpected one in the else,
right? So the scoping of the variable, you create that variable inside of the parentheses of the if, and then use the unexpected one in the else, right? So the scoping of the variable,
you create that variable inside of the parentheses of the if, use the value result in the if block,
and then use the error result in the else block. So that's kind of nice. It's a neat kind of little
syntactic thing in C++. Yeah, one other story that I think is relevant is the team I'm currently on
collaborates on a project that parses system Verilog using Antler v4, I think, relevant is the team I'm currently on collaborates on a project that parses system
Verilog using Antler v4, I think, is the latest, which I think is the largest Antler grammar as
well. And we've been talking a little bit about the trade-offs we see there, and folks on my team
have been doing performance optimization work, which, as we were talking about, sometimes means
restructuring the grammar so that the generator can do better with it. And, you know, there's all these same interesting questions that come up around
memory usage, and you get a lot of power from the grammar generating. That would be a lot of work to
write in a recursive descent parser form, especially as the grammar gets large,
but there's clearly a trade-off space there. So it's been fun to also kind of relive this
understanding of the trade-offs. One last thing I think we wanted to touch on was the difference between parse trees and ASTs. So
we kind of threw those terms around a little bit here. But generally, I think that's not
formally defined, but usually a parse tree will be kind of the full tree that comes out of the
grammar. Like the grammar rules literally have a node for every rule inside of the grammar, whereas an AST will be kind of a more reduced form. So if I have a bunch of ands strung together and they're all at the same precedence level, if I make an anary node that has all the operands to the string of adds, that's not strictly the original grammar, but its equivalent representation is simplified. So I think that's usually how people tend to talk about AST versus parse trees. Yeah, well, cool. I think that's a good amount of
knowledge about parsers that we've talked about today. So how about we adjourn and meet soon for
another deep dive into an interesting geeky topic? Sounds great. Great catching up with you again.
Yeah, same here. All right. See you later. See you, JF.