[DRAFT] RFC: generator grammars, state space estimates, and topological fuzzing #203

camshaft · 2024-01-13T11:30:25Z

I've been investigating improving generators by adding grammar introspection to the generator traits. This allows for an abstract tree to be constructed and reason about the generator at runtime.

This enables some interesting applications:

State space estimates

Having the whole grammar means we can produce an estimate size of the entire state space. This could then be used in a few ways.

The first could be to simply generate a report of what the size is. You could take this a step further and measure roughly how long each iteration takes and extrapolate how long it would take to brute force the space.

Another interesting application would be to update the strategy used for smaller state spaces. For example: if the space is small enough and the test is fast enough, you could just exhaustively iterate through all of the inputs and declare it completely tested.

Topological fuzzing

I've also been thinking about using the grammar to improve how we pull bytes from the fuzzing input. Some of this is inspired by https://blog.regehr.org/archives/1700.

At a high level, say you have an input like the following:

enum Expr {
    Bool(bool),
    Not(Box<Expr>),
}

Currently, the decision as to which variant to use at each level is made in isolation. It looks something like this

match driver.gen(0..2) {
    0 => Expr::Bool(driver.gen()),
    1 => Expr::Not(driver.gen()),
}

This causes the topology selection to be biased towards smaller inputs, rather than being uniform across all possible grammars

This is really well stated on that blog post:

Let’s consider this very simple decision tree:

If we use it to generate output a bunch of times, we’ll end up getting outputs with percentages something like this:

A 49.979
B 25.0018
C 12.509
D 6.26272
E 3.1265
F 3.12106

What this means in practice is we spend a large amount of time generating shallow inputs we've already seen, rather than exploring deeper trees. This becomes even more of an issue once you start using Vecs in the input.

This can be solved by inspecting the grammar at the beginning of a fuzzing run and producing weights for each variant based on their relative sub tree topologies. This will produce uniformly distributed shapes.

A really nice property with this is we can map these topologies to binary representations. In this way, each topology is assigned a unique identifier. This should work quite well with fuzzers, where we read the identifier at the beginning of the input, followed by a fixed separator, followed by a fix pool of values for the nodes on the topology

let topology = slice.split(4); // in this example the topology space fits in 4 bytes
if slice.split(4) != "Blro" { return }
match topology {
    0 => // generate the values that are needed for this topology
    // Others go here
}

Note: the above is pseudocode that highlights the concept, not necessarily how it will be implemented.

Something that I want to highlight here is this means that the generator branches once at the very beginning of the input and every other value extracted after that will be fixed. I think this should be quite powerful, since without it, random changes at random offsets will completely change how bytes are interpreted and can be quite difficult for a fuzzing engine to track.

I've implemented a WIP prototype of these concepts in https://github.com/camshaft/bolero/tree/grammars.

Ekleog · 2024-01-13T15:55:45Z

This looks interesting! I haven't had the time to look through all the code, but it seems to have the potential to be the future of fuzzer input generation :)

Hope it works out well in practice, compared to other ways of generating the input!

mwhicks1 · 2024-02-21T14:15:45Z

@camshaft Is it possible to share more details about your proposed algorithm? Looking at the code is not very illuminating to me. :(

Some prior work on "k-paths" proposes an on-line (feedback-directed) algorithm for generating tests that cover portions of a grammar (in the case of Bolero, an enum type definition). Another option would be to analyze the grammar/enum and choose probabilities for different branches based on the hoped-for input depth and diversity. It seems like you are angling for this option. But I can't tell how it would work, generally, based on your example, since you don't show the connection to the type definition (at least, it doesn't seem like the topology part and the Expr part are related, perhaps I'm missing something).

camshaft mentioned this issue Jan 13, 2024

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based rust-fuzz/arbitrary#165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] RFC: generator grammars, state space estimates, and topological fuzzing #203

[DRAFT] RFC: generator grammars, state space estimates, and topological fuzzing #203

camshaft commented Jan 13, 2024 •

edited

Ekleog commented Jan 13, 2024

mwhicks1 commented Feb 21, 2024

[DRAFT] RFC: generator grammars, state space estimates, and topological fuzzing #203

[DRAFT] RFC: generator grammars, state space estimates, and topological fuzzing #203

Comments

camshaft commented Jan 13, 2024 • edited

State space estimates

Topological fuzzing

Ekleog commented Jan 13, 2024

mwhicks1 commented Feb 21, 2024

camshaft commented Jan 13, 2024 •

edited