Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] RFC: generator grammars, state space estimates, and topological fuzzing #203

Open
camshaft opened this issue Jan 13, 2024 · 2 comments

Comments

@camshaft
Copy link
Owner

camshaft commented Jan 13, 2024

I've been investigating improving generators by adding grammar introspection to the generator traits. This allows for an abstract tree to be constructed and reason about the generator at runtime.

This enables some interesting applications:

State space estimates

Having the whole grammar means we can produce an estimate size of the entire state space. This could then be used in a few ways.

The first could be to simply generate a report of what the size is. You could take this a step further and measure roughly how long each iteration takes and extrapolate how long it would take to brute force the space.

Another interesting application would be to update the strategy used for smaller state spaces. For example: if the space is small enough and the test is fast enough, you could just exhaustively iterate through all of the inputs and declare it completely tested.

Topological fuzzing

I've also been thinking about using the grammar to improve how we pull bytes from the fuzzing input. Some of this is inspired by https://blog.regehr.org/archives/1700.

At a high level, say you have an input like the following:

enum Expr {
    Bool(bool),
    Not(Box<Expr>),
}

Currently, the decision as to which variant to use at each level is made in isolation. It looks something like this

match driver.gen(0..2) {
    0 => Expr::Bool(driver.gen()),
    1 => Expr::Not(driver.gen()),
}

This causes the topology selection to be biased towards smaller inputs, rather than being uniform across all possible grammars

This is really well stated on that blog post:

Let’s consider this very simple decision tree:

image

If we use it to generate output a bunch of times, we’ll end up getting outputs with percentages something like this:

A 49.979
B 25.0018
C 12.509
D 6.26272
E 3.1265
F 3.12106

What this means in practice is we spend a large amount of time generating shallow inputs we've already seen, rather than exploring deeper trees. This becomes even more of an issue once you start using Vecs in the input.

This can be solved by inspecting the grammar at the beginning of a fuzzing run and producing weights for each variant based on their relative sub tree topologies. This will produce uniformly distributed shapes.

A really nice property with this is we can map these topologies to binary representations. In this way, each topology is assigned a unique identifier. This should work quite well with fuzzers, where we read the identifier at the beginning of the input, followed by a fixed separator, followed by a fix pool of values for the nodes on the topology

let topology = slice.split(4); // in this example the topology space fits in 4 bytes
if slice.split(4) != "Blro" { return }
match topology {
    0 => // generate the values that are needed for this topology
    // Others go here
}

Note: the above is pseudocode that highlights the concept, not necessarily how it will be implemented.

Something that I want to highlight here is this means that the generator branches once at the very beginning of the input and every other value extracted after that will be fixed. I think this should be quite powerful, since without it, random changes at random offsets will completely change how bytes are interpreted and can be quite difficult for a fuzzing engine to track.

I've implemented a WIP prototype of these concepts in https://github.com/camshaft/bolero/tree/grammars.

@Ekleog
Copy link
Contributor

Ekleog commented Jan 13, 2024

This looks interesting! I haven't had the time to look through all the code, but it seems to have the potential to be the future of fuzzer input generation :)

Hope it works out well in practice, compared to other ways of generating the input!

@mwhicks1
Copy link

@camshaft Is it possible to share more details about your proposed algorithm? Looking at the code is not very illuminating to me. :(

Some prior work on "k-paths" proposes an on-line (feedback-directed) algorithm for generating tests that cover portions of a grammar (in the case of Bolero, an enum type definition). Another option would be to analyze the grammar/enum and choose probabilities for different branches based on the hoped-for input depth and diversity. It seems like you are angling for this option. But I can't tell how it would work, generally, based on your example, since you don't show the connection to the type definition (at least, it doesn't seem like the topology part and the Expr part are related, perhaps I'm missing something).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants