TLB Hit 💥 - Episode 6: ƑẍɄʑʑ҉⟆Ƒu𝔷𝔷⧫ᶳΩ𝓕

Episode Date: December 15, 2024

In this episode, Chris and JF dive into the world of code coverage and fuzzing. They explore why coverage matters, the types of coverage metrics, and how fuzzing helps uncover unexpected software beha...vior. From the practical trade-offs of MC/DC testing to the randomness of AFL’s mutators, the hosts discuss techniques, tools, and the balance between robustness and resiliency. Tune in for insights, laughs, and maybe even a recommendation to "go fuzz yourself!"

Transcript
Discussion (0)
Starting point is 00:00:00 All right, welcome to our weekly episode of TLB Hit. So let's start with some levity. The software tester walks into a bar and orders a beer, then orders zero beers. Then the engineer orders 2,147,483,647 beers. Then they order a lizard, then minus one beers, then corny UIP beers. Finally, finally, the real customer walks in and asks where the bathroom is.
Starting point is 00:00:31 The bar bursts into flames. Now, what's this got to do anything? Well, we want to talk about code coverage and fuzzing. And this joke is great because it illustrates what happens when you, presumably a human, test what you think needs to be tested. The problem we thinking meat creatures have is that we don't always know where failures might be. That is where metrics like coverage are useful. As a human, we might think that testing beer orders is where a bar might full bar. But as the saying goes, it's hard
Starting point is 00:01:02 to make things foolproof because fools are so ingenious. Or in this case, the loo is where failure was. Anyhow, I heard that needing to explain jokes is what makes them good. So there you go. So what the engineer is poking at in the bar is in some ways the coverage of what the bar can handle. So let's talk about coverage. What kind of coverage is there?
Starting point is 00:01:21 We think about a program that we're trying to test and we ask, did this thing run for a bunch of different things? There's lines, whether a particular line of code executed, there's branches, which can be taken or not taken. We can ask whether loops were executed at least once. These are all similar to kind of basic block execution counts. For functions, we can ask whether functions were called.
Starting point is 00:01:44 Did we get to the function entry point? This can also be more interesting for inline functions. Ultimately, the basic blocks executing in a sequence form a path. What was the path taken through the basic blocks? The sequence of program counters that we were jumping to. Ultimately, there's a big state space in memory. For some piece of state, what values did we see that take on? And this is similar for value coverage. For any given value in the program, what are all of the concrete values we observed for it? There's also a categorization called MCDC, modified condition slash decision coverage.
Starting point is 00:02:19 Not to be confused with the iconic rock group ACDC. And then there's human injectedinjected conditions of interest, which are like cover points. Those can explicitly help guide what we're trying to hit or look for when we're running the program. All right, so why is coverage a zygote? Well, ultimately, it's all about quality assurance. Is the thing we're making good the software?
Starting point is 00:02:41 Does it do what we think it does? I don't know about you, but I don't like it when software just doesn't work. And so although that sounds easy and makes sense, the details are really tricky. What it actually means to cover code, so for example, with template instantiations, is one instance enough? What does it mean to cover a template? Do I have to cover all the types or just one type? And that's fine. For type traits, does it even make sense to have coverage since it's kind of a meta code or something like that? And with constexpr, constant expressions,
Starting point is 00:03:11 if it all executes at compile time, can you say the code was really covered? And this is getting to the theoretical what are we computing and how aspect. So there's a way to slice this here where we have pure functions, which just take values in and produce values out. And then there's functions that execute with some state
Starting point is 00:03:31 or some context in the background. And things like constexpr are really pushing on partial evaluation. If this program was partially evaluated at compile time versus if it ran at runtime, what does it mean in terms of what we ultimately covered? The constexpr case is more similar to pure functions that take values in and spit values out. There's limited side effects that are possible, and the state is available all in the constexpr evaluation time. In C++, constexpr used to be totally pure, but now we have these things
Starting point is 00:04:00 like transient allocations in C++, so it got more interesting. The full set of all possibilities is combinatorial for general programs. So just for perspective, even exhaustively verifying a function with a 64-bit input space by executing it is pretty intractable. 32 bits you can do in a few minutes, but 64 bits, now you're really in trouble. Some programs like safety related ones might try to have less branchiness by construction, but sometimes it's hard to have less branchiness. Sometimes the business logic just is what it is. Yeah. And so we talked earlier about MCDC, modified condition slash decision coverage.
Starting point is 00:04:39 What is that? Well, it's a type of coverage that's often used in functional safety software, as you were mentioning. So for example, in aviation, they have DO-178, or in automotive, there's ISO-26262. And the quick and dirty explanation is that you make sure all independent Boolean combinations for each functions have been executed. And you can have tools that look for MCDC coverage and give you metrics on that. And we talked about combinatorial conditions earlier. Doesn't MCDC just run into this combinatorial problem? Well, basically, yeah. And the argument is that for functional safety programs, you shouldn't have
Starting point is 00:05:15 combinatorial space at all, because that's difficult to ensure the safety of. And this also gets to, we've talked about what coverage is and what we'd like to know about it, but how is it implemented? How do we figure out what was covered? So you effectively do a static instrumentation of the program that you're interested in. And then when that dynamically executes, you gather up the statistics that your instrumentation creates. So we're going to insert things like probe points into the program, and we're going to observe what gets triggered and what happens. Or you can also do dynamic sampling, where you run the program in its normal flow, but you just interrupt it every so often to see where it was and what it must have run in order
Starting point is 00:05:54 to get there. And that's recording your data as you interrupt it. Yeah. And so you set coverage goals of how much coverage you want to meet. Can you even do 100% coverage? And how do you do it? Do you want to do it with, say, unit tests? Or should you just do coverage in system tests? That's kind of difficult to choose. And there's really good writing online by the SQLite authors. We'll put a link in the show notes.
Starting point is 00:06:19 And they have this to say on coverage. They have an article, and you look for tensions between fuzz testing and 100% MCDC testing in their article. And the short of it is they basically say that they started with 100% MCDC coverage to ensure robustness of SQLite. But they found that doing that,
Starting point is 00:06:36 having 100% MCDC coverage, discouraged defensive programming and discouraged having unreachable code, right, that you usually write in defensive programming. So programming that allows you to defend against the changes of code over time. And for a while, they had been doing 100% MCDC coverage, but then they started fuzzing when fuzzing became a thing. And lo and behold, they found vulnerabilities. How come? Well, they discovered that MCDC and fuzzing are kind of at odds with each
Starting point is 00:07:06 other. They're a balance between ensuring robustness in normal code versus resiliency against attackers. But they're kind of also complementary because having high MCDC helps ensure that when a bug fix comes in due to a fuzzing report, that bug fix doesn't itself add another bug. And so this comparison is a good segue to fuzzing. So fuzzing is basically a million mecha monkeys all trying to put inputs into your program so that they can observe good coverage that comes out and see what happens when they put those inputs in. So there's two kinds of fuzzing that are most commonly used. There's generative fuzzing and there's mutation-based coverage guided fuzzing.
Starting point is 00:07:48 So generative fuzzing is generating things that you know fit a particular pattern to pass them into this program. So it's very similar to defining a parser grammar like we talked about in a previous episode, but we run it in reverse. Instead of building a tree out of tokens, we take a tree definition and expand it until at the leaves we have all the tokens that we want to feed into the program. So these nodes expand in their tree form until you hit the leaves, which are also known as terminals, and then you spit out what was created and feed it into the program. So generative systems
Starting point is 00:08:21 are similar. You may have heard of Haskell's QuickCheck or the Hypothesis library in Python. And these are some of the key examples of generative-based fuzz tools. Mutation-based coverage-guided fuzzing is through libraries and tools that you may have heard of like AFL and LibFuzzer. So these tools actually keep a corpus, a whole bunch of set of examples of stimulus
Starting point is 00:08:42 that they feed to a program, and it mutates those to randomly try to increase the coverage metric. So they prune out redundant entries, things that are covering the same thing, take the interesting ones and perform a genetic style algorithm to cross them over and mutate them. And this is tapping into that coverage instrumentation we mentioned. They want to know, did I do a good job when I changed this? So as it goes down a new path, we find problems and it adds it to the set of things
Starting point is 00:09:10 that created interesting input space. And this works very scalably because it's random and embarrassingly parallel. So we just need to exchange coverage findings between different things that are exploring parallel paths periodically to avoid too much redundant work from happening. Yeah, and you mentioned libfuzzer. It has an interesting mode tuned for people using protobuffs in particular. It takes an arbitrary proto with fields and can do some
Starting point is 00:09:35 generative expansion, so defining an action space for the mutation. And protobuff contents are the thing that get fuzzed based on the proto schema. And it's nice because then you don't have to teach the fuzzer about your APIs, because sometimes half growth buff has their API for everything. And that's convenient. And related to that, Google has cluster level fuzzing for things like open source projects. And there's an industry talks on YouTube about OSS fuzz, which we'll link to in the show notes. And it's really fascinating that you can take this interface, like just give me stuff, and coverage-guided fuzzers can find the interesting things from nothing,
Starting point is 00:10:11 from just the interface, and find traps and signal and address certain types of violations. Yeah, and again, back to how they do that, they're getting the information from instrumentation that they put into the program. So they want to know what branches get taken. They want to know what values might have caused you to go down a particular direction. And they use that to change the inputs to the program, trying to figure out what changes
Starting point is 00:10:35 that they make on the inputs will cause them to get further into the program execution through the branches to the really interesting paths where those violations and traps and signals might occur. So what is fuzzing even looking for here? When we turn on sanitizers, they can look for things like undefined behavior, use after freeze, crashes, terminations, and all these kinds of things that indicate
Starting point is 00:10:57 something didn't go quite right in your program and probably you didn't want that to happen. So fuzzing goes great with sanitizers because it increases the set of things you can find instead of just kind of silently trucking on in your program execution. Sometimes they also find things you don't care about as much,
Starting point is 00:11:13 like your stack might grow overly large in recursive programs, or integers might overflow innocuously, and you have to kind of develop techniques to prune out those things that you don't care about as much. Helpfully, these toolkits also provide tools like minimizers, where if it finds a stimulus that identifies a failure, it can try to prune down that input space and still take a similar path to that similar failure. This is similar to kind of delta debugging techniques we've talked about before, where it can kind of minimize test cases with something being held constant like the failure that occurred.
Starting point is 00:11:47 Yeah, and so you described fuzzers kind of as mecha monkeys earlier, and some fuzzers can be pretty dumb overall in how they mutate inputs versus, you know, some of them can end up being pretty smart. You can go all kind of generational genetic algorithm on them with a bunch of inputs saved and mixed over time. But really, fuzzers mostly mutate previous inputs based on branch coverage information to try to explore new branches that they haven't covered before. And they're surprisingly good at this, exploring branches and switches, even for things like image formats, which have structured formats with headers and then more data. So in a way, they're smart, and in a way, they're not. And in fact, if you look at EFL's logo, we'll put a link in the show notes, it's pretty cute. It's originally a JPEG from Alice in Wonderland, so kind of a bunny rabbit or something very fuzzy. And the author of AFL turned that JPEG into a GIF, or is it a GIF? I
Starting point is 00:12:40 don't know, showing each step of the fuzzer changing the image, one change per frame. So each frame of the GIF is one mutation, basically. And the fuzzer can, through its branch coverage, understand the structure of the image format and perform mutation based on that structure. And if you look at that picture, it's oddly relaxing to watch that logo being fuzzed over time. I just have to point out that choosy programmers choose GIF. So fuzzers do particularly well, as we're saying, when you feed them a corpus of information that
Starting point is 00:13:10 they can then mutate. So say you scraped all the unit tests from your software program that caused errors in the past, like say every GitHub issue turned into some unit test that checked for regression and came from some resolved issue. That could be used as your seed corpus to help them figure out how to reach interesting places in the program easily. It's a nice map of where to start. And then they do a genetic algorithm style thing where they can cross over the interesting parts of examples with each other or mutate them similar to how genetic algorithms take bit patterns and do crossovers and mutations
Starting point is 00:13:44 in addition to trying to sample new things from the underlying space. similar to how genetic algorithms take bit patterns and do crossovers and mutations, in addition to trying to sample new things from the underlying space. Separately, you can also provide suggestions on which mutators might be most useful for your program. Things like strings of repeated text maybe would be very useful for a generalized parser, but maybe not so much for the internals of some other system. That's just an example. The problem with corpus is ultimately that you have to remember how to pluralize it. Corpi, corpora, corpoxin. I never remember what the right way to do that is. I think it's from Greek, and so it's corpots, technically. So anyways, we talked about using sanitizers to find undefined behavior.
Starting point is 00:14:21 So why are we looking for undefined behavior? Well, oftentimes, UB is where unexpected things live. Now, I want to say not all UB is really actually bad, but if you're trying to fix potential issues, then removing all UB can tend to be valuable. But it doesn't help to find problems in your program's higher level behavior. Beyond the language level UB,
Starting point is 00:14:41 your program has behavior as well. The programmer knows what their requirements are and what is invalid state for their program. And that's why defensive programming is useful when the programmer adds checks for states that's invalid in their own program and calls abort when such a state is reached. This type of check is useful because the programmer tells the fuzzer what to look for. Basically, the fuzzer is told, hey, like try to hit this abort and the fuzzer is going look for. Basically, the fuzzers told, hey, like try to hit the support and the fuzz is going to try to find path to the support being called. Right.
Starting point is 00:15:09 And you can also add some rejection sampling before you even feed your program. You could say, no, this didn't look like an interesting case sample to feed, which can also help with that. Really, there's two styles of fuzzing that we are touching on here. There's binary modes where imagine you've got a binary and you don't have the sources and you want to observe what its behaviors are, or you've lost the sources or something. You can actually rewrite or observe the binary as a binary instrumentation versus in full source available instrumentation mode, you can actually recompile and emit extra
Starting point is 00:15:41 code and extra pieces of assembly into the output artifact. So in a way, binary mode ties back into the history of dynamic binary instrumentation tools like Valgrind and even static instrumentation tools like Atom. Whereas static and code available instrumentation is how the modern sanitizers work like Address Sanitizer, UBSan, MSan. So fuzzing uses randomness to explore a format and branches, but there's diminishing returns at some point. How do we maximize the exploration of the program state while minimizing the types of inputs that have the same outcome?
Starting point is 00:16:16 That's one of the key things that this fuzzing infrastructure tries to answer as it handles its corpus of examples. On the topic of fuzzing binaries, it's kind of cool that you can fuzz an implementation you can't see at all, similar to when you have a binary, but you don't know what the source is. There's actually fuzzing of ISAs as well.
Starting point is 00:16:34 An ISA fuzzing can find things like unpublished instructions. There's a cool Black Hat talk called Breaking the x86 Instruction Set that found undocumented instructions that we'll link to in the show notes. Yeah, and we're really scratching the history of fuzzing a bit in these last few points that we made. And on that topic, Thomas Doolian had a keynote on fuzzing success recently, and we'll put a link in the show notes. It's really interesting to watch.
Starting point is 00:16:58 So let's change the topic a bit and segue to hardware, because you talked about hardware just now with ISAs. There are strong ties between fuzzing software and hardware design verification. So you don't want to have bugs in hardware, even less in software, because in hardware, that'll cost you millions of dollars to fix. You'd better make sure it's correct before going to a fab and making a chip. And the problem with hardware is that you have deep and giant stink space with all the flip-flops and stateful elements on the chip. And what's been done for a long time in hardware is to use formal methods and tools like SAT solvers to mathematically prove properties of the hardware. And in those circumstances, when you can prove correctness of some properties, you don't actually need fuzzing because you have a proof of the entire state space. So what that means is that fuzzing is basically trying
Starting point is 00:17:46 to get a dynamic execution to something unexpected, whereas formal methods are trying to mathematically prove a property from all possible inputs. So what's been done with hardware for a long time can also apply to software. If you can use formal methods, even on a subset of your program, then fuzzing isn't needed for that subset.
Starting point is 00:18:06 You can use a SAT or SMT solver and prove whether conditions are reachable at all in any possible execution. Now, fully inductive proofs are quite difficult to close, and I don't know enough about how inductive proofs to really understand when they don't work, so it would be fun to learn more. Now, one thing that you can is turn subsets of a compiler's IR into SMT inputs. So you can't do that easily for all programs and all programming languages, because there's often things left as explicitly undefined in there. But it's also a problem when you try to create hardware from a language that has undefined behavior, which many or most of them do. So imagine a shift operation that's undefined when overshifting.
Starting point is 00:18:47 What kind of hardware gates do you create? What comes out and what gate patterns are acceptable if you overshift? If you end up statically or dynamically overshifting? Well, clearly, you know, you should care about this. Yeah. And so I think on the topic of formal methods, they make proving properties of very large spaces more tractable. So think about the difference between a 32-, it's like 90 years to test all those values, even for a very, very small, simple function. But we can prove properties of even functions with large input spaces because we find symbolically what the computation is trying to do
Starting point is 00:19:37 and use the properties of what we're doing inside of the computation. That's how these formal methods work at proving things that we couldn't do exhaustively by testing all the possible inputs. This also relates to a system called TLA+, which was originally by Leslie Lamport, which is a formal modeling language and runtime that people use for proving properties of important algorithms, including in hardware construction and distributed and consensus-like algorithms like Paxos you may have heard of. Oftentimes people will translate finite state machines, which are little automata, into a TLA plus model and prove properties on that in abstract form. So they won't necessarily do it directly on Verilog, say, but they'll create a model and then prove properties of the model. We're going to link to Hello Wayne's talk on getting started with TLA plus,
Starting point is 00:20:24 which helped me get started writing some models in that way. All right. So now that you've mentioned Leslie Landport, friend of the show, let's try to jump into quickfire topics on fuzzy because I think we're not information dense enough yet. Yeah, we need more. So one thing that really blew my mind was also this paper called the Silly Fuzz paper, and it introduced this notion of fuzzing by proxy. So the idea here was that we could fuzz something like QEMU using coverage-guided fuzzing and build up a corpus. And a lot of the complexity and coverage that we observe in the simulator
Starting point is 00:20:58 inside the QEMU implementation, which is effectively an emulator, right, for the hardware architecture, that corpus is still very useful as a proxy for the complexity and the coverage that would happen inside of a silicon implementation, even if it's implemented totally differently. One is software, one is hardware, but the corpus is very interesting and reusable even across those domains. All right, so different topic. Let's talk about compiler instrumentation.
Starting point is 00:21:25 So part of how coverage-guided fuzzing works as well as it does is that the compiler and native code generation are in on the game. And instrumentation makes callbacks to the fuzzing runtime and that tracks information in a data structure. And doing this in the process allows the fuzzing process to run a lot faster
Starting point is 00:21:44 than if it was going on outside the process. And in that, maximizing the frequency with which you run the fuzzer program is also a big part of the game, being able to iterate linearly. Basically, coverage recording flags can control where and how to add pseudo-branches. So consider if you're branching on an arithmetic expression that says something like, you know, x is greater than 32. Well, maybe you still want to know what bits are set to x beyond the fifth one, and include those in the things that you try to cover. Yeah, another interesting thing in this space is fault injection. So sometimes to hit rare cases, you have to inject faults that get you down some path that you would never expect to happen in practice. It might be very difficult to craft some input stimulus to get you exactly in that mode.
Starting point is 00:22:31 So in a way, this goes back to our not taking branches podcast from a few weeks ago. How do we force our way towards those exceptional circumstances that are being handled in the cold paths of the code? So you need enough seams. Seams is a really nice word. It's some way to reach in and grab a piece of your program and control what it needs to be able to do. You need enough seams in your program to enable fault injection-based testing. Sometimes people will do this with mocks or fakes. You have to design the software ultimately so you can easily do that reaching in and causing of something atypical to happen.
Starting point is 00:23:09 Michael Feathers has a great book on testing patterns that really hammered home this word seems for me. I did really enjoy his book working effectively with legacy code. Yeah. So I have to chime in and mention my coworker and friend of the show, Robert Secord. He has a book called Modernizing Legacy Systems. It's from 2003 and I haven't read it, but Robert will be super happy that I mentioned his book. Oh, thanks. I won't read it either. Nice. So another quick fire topic here,
Starting point is 00:23:31 classic software question around assertions and check failure. So the question here is, is it better to crash or to keep on trucking when something bad happens? So it typically depends on how unrecoverable and impactful your invariant failure is and what the potential impact of crashing versus trucking on in an inconsistent state might be. So what we've seen is that different teams and different industries define different value functions around this question.
Starting point is 00:23:59 So if you have a big website, you don't want someone to be able to send you a packet of doom or something like they just send a packet and your entire website goes down. And even with a supervisor process that crashes and recovers, it's possible to cause a crashing when there's something like a packet of doom. And in industries like functional safety, there's this thing called a hazard analysis and risk assessment, AHARA. And that's something that's written to see what program failures could be and how to react to failure.
Starting point is 00:24:29 And often what you do with that is you end up designing a supervisor or something that can handle the failures and it forces the design of your software to handle the problem, including if something causes a crash and if you have a crash loop and so on. So reaching up towards the top shelf of some of
Starting point is 00:24:45 the interesting techniques and research in this space, there's something called concolic execution, concolic testing. And the word there is a mix of the word concrete and the word symbolic. So the idea is that you throw concrete input stimulus at your device under test and you observe what happens. But then you say, hey, I saw that this condition never triggered. Can it ever trigger? And I'm just not finding the right stimulus. So when you ask yourself this question, you actually go and try to back solve symbolically. This is where the symbolic comes in. You try to find if that condition is even possible to reach, or if it is possible to reach, what is the cone of inputs in the input
Starting point is 00:25:26 space that will cause that condition to be true? So you can see you're flipping between concrete stimulus and then back-solving symbolically for interesting things in the input space. So if you do find something, now you've found a great thing to add to your corpus. And if you find that it's impossible to actually make this condition true, you know that that condition is effectively dead and you can flag that and not care about reaching it anymore. Yeah. And so like you described, you know, can I even see this condition type of thing, this type of testing? It's also related to bounded model checking.
Starting point is 00:25:57 So imagine your program has like a top level loop somewhere, like a wild one. And you can ask successively, can the bad thing happen if I run the loop just one time, right? You can try to prove like if I run the loop one time, can I make the program fall flat on its face? No. Okay, well, what if I run this loop two times? And you say, okay, like maybe that can't happen two times. And as you brute force unroll and try to prove, unroll once, twice, three times. You gain more confidence in whether the condition seems reachable with a reasonable amount of stimulus and time steps. So you just usually hard code a limit on unrolling and give up once you reach the limit. Or in some cases,
Starting point is 00:26:35 you might try to prove the equivalence of the loop unrolled a few times or something like that. And so if you have a symbolic representation of the valid state of your program, you can kind of try to fast forward to some reachable state by changing the value and then walk forward a few time steps from there. Yeah, it's kind of like a fast forwarding technique. Yeah. So I'll point out that Joe Armstrong of Erlang fame, who is the inventor of Erlang, and may he rest in peace, has a great talk called the mess we're in. So it talks about how QA being so difficult really relates to the huge state space that we have in modern systems and that they're often not super well-defined, at least in a formal sense, and that we're trying to revise designs and
Starting point is 00:27:17 rev them and increase and iterate on them with lots of features over time. And so it indeed does look like a mess in many ways, but it is our mess. This is the stuff that we work with. This is the medium we work in. And so with fuzzing, we're actually solving our carefully constructed mess of the software we wrote using these focused beams of random mess. Beautiful in its own way to use mess to try to solve mess as we try to find ways of doing things better. All right. Well, on that note, I'm glad that we got to record this episode. I really hope that folks enjoyed it. Please join us next week for our next podcast episode.
Starting point is 00:27:51 We'd like to thank our sponsors, but we don't have any, and don't forget to like and subscribe, tell your friends and go fuzz something. Are you telling people to go fuzz themselves? Basically, I guess I did. All right. All right. See you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.