Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Unstructured to be backed by an iterator of bytes instead of a byte slice? #103

Open
maackle opened this issue Feb 18, 2022 · 4 comments

Comments

@maackle
Copy link
Contributor

maackle commented Feb 18, 2022

In our use case, we are not using Arbitrary for fuzzing, but simply for creating arbitrary fixture values in tests. Currently we create a 10MB static Vec<u8> of random noise and use that as our Unstructured data. However this is annoying since it requires 10MB of memory overhead, and sometimes even then we run out of bytes.

I am wondering if it would be valid to have two flavors of Unstructured, one backed by a byte slice, presumably for fuzzing, and one backed by an infinite iterator of bytes which can never be exhausted. I experimented with a PR for doing this and got some basic tests passing, but don't know how valid it is in the grand scheme. I did find that some functionality is indeed dependent on the fixed byte slice, so at the very least some extra UX effort would have to be made to provide slightly different interfaces for different Unstructured flavors.

I understand if this wouldn't be worth the effort but I guess I am primarily wondering from a motivational standpoint why Unstructured is backed by a byte array instead of an iterator, and only secondarily asking for some feedback on the feasibility of using infinite iterators. Note: I know next to nothing about fuzzing.

@nagisa
Copy link
Member

nagisa commented Feb 18, 2022

It is more straightforward to derive the relationship between data passed in as a buffer and its use in code, than when it needs to go through the iterator machinery.

The predecessor to Unstructured was also intended to be inexhaustible, so it was important for it to be able to read the data from the buffer multiple times in a ring buffer like fashion.

It would be an interesting challenge/exercise to implement a virtually infinite buffer of virtual memory that is filled with data as accesses from Unstructured fault the pages in. Basically something along the lines of:

  1. mmap(a long inaccessible chunk of virtual memory)
  2. set up signal handlers for sigsegv on the mmaped pages
  3. make Unstructured with the mmaped buffer
  4. when signal handler fires, fill the faulting page with data and make page RW before returning control to the faulting instruction.

@maackle
Copy link
Contributor Author

maackle commented Feb 18, 2022

To be clear, I was thinking of something as simple as std::iter::repeat_with(rand::thread_rng().gen) as the iterator.

I guess what I don't understand is what relationship needs to be understood between the data and its use in code, if the unstructured data is truly arbitrary. This may be where my lack of knowledge of fuzzing methodology is holding me back.

@fitzgen
Copy link
Member

fitzgen commented Feb 22, 2022

The Arbitrary+Unstructured APIs are already fairly complex and I wouldn't want to make them any more complex by adding type parameters or anything like that.

I think you should be able to get thyour ultimately desired functionality in a performant manner by using the size_hint method and doing something like

let (min, max) = T::size_hint(0);
let capacity = max.unwrap_or(min * 2);
let mut data = vec![0; capacity];
loop {
    rng.fill(&mut data);
    let u = Unstructured::new(&data);
    let x = match MyType::arbitrary_take_rest(u) {
        Ok(x) => x,
        Err(arbitrary::Error::NotEnoughData) => {
            // Double the buffer's size. Optionally have a max
            // buffer size.
            let new_len = data.len() * 2;
            data.resize(new_len, 0);
            continue;
        }
        Err(_) => {
            // Just try again with new data. Optionally have a
            // max number of retries.
            continue;
        }
    };

    // Do stuff with `x`...
}

@maackle
Copy link
Contributor Author

maackle commented Feb 22, 2022

Interesting, this could work. Thinking it could be made ergonomic by creating a trait parallel to Arbitrary, with a similar API but which instead of taking Unstructured, takes a new struct which wraps a SeededRng and a Vec<u8> buffer, and produces Unstructured on the fly when asked to generate an arbitrary type. If I wind up needing this, I'll give that a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants