Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Open
Ekleog opened this issue Jan 12, 2024 · 2 comments
Open

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Ekleog opened this issue Jan 12, 2024 · 2 comments

Comments

@Ekleog
Copy link

Ekleog commented Jan 12, 2024

Hey!

I've been thinking a lot about arbitrary recently, had a chat with @nagisa, and they told me to report all my thoughts here for your consumption. So here are a few thoughts, hoping to give ideas, gather feedback, and maybe eventually I'll write a PR if you agree with the concept :)

There are three main points that have been bothering me with arbitrary, and my alternating usage of both it and bolero:

  1. The fact it only supports byte slices as inputs, and not RNGs to double as proptests. The chat just now with @nagisa made me understand why that's not actually a problem but a good thing, for minimization reasons.
  2. The fact that when writing manual implementations of Arbitrary, it's very easy to get stack overflows. Just yesterday, I had an enum with enum Foo { Variant(Vec<Foo>), ... }, and got a stack overflow by doing things the naive way. This is basically this issue, and it's already been closed so I won't relitigate, but I'll just say that having a freestanding arbitrary::recursion_guard() that updates a thread-local counter on each nested call to Unstructured:: functions would probably solve the issue without complicating the Arbitrary trait itself.
  3. The one I'm actually writing this issue for. Currently, the way arbitrary generates collections is one I had not heard of yet: it tests after each element whether to take a new element. This might look good, but a very simple change (eg. changing the length of one element) can lead to a complete structural change, especially with nested collections.

In order to avoid the issue in point 3, a magic separator-based solution should help the fuzzer identify the input: if the length of one element changes, then either the input will stop parsing (if the length would go beyond the magic separator), or the input will parse with the exact same structure. In turn, this means that small changes in input lead to small changes in structure, which is a great property for fuzzers.

On the other hand, this only works for fuzzers. Such a scheme would definitely not work for "lightweight" tests like proptests, because they'd have a way too high fail-to-parse rate for that.

Hence, I'm thinking there'd be the need for at least two versions of Unstructured: one that handles collections as it does right now (stop-or-continue check), and one that handles collections with a magic delimiter (plus escaping and nested delimiters, non-trivial to implement but should be doable).

Incidentally, the version that uses a magic delimiter might hit some issues with the current Unstructured API; due to the lifetimes there that require borrowing from the Unstructured, yet unescaping the magic word would require mutating the Unstructured data (or copying it, which would be sad)

So I wanted to first ask: what do you think about the idea? :)

@Ekleog Ekleog changed the title Unstructured v2: TLV-based vs. Delimitor-based Unstructured v2: Stop-or-Go-based vs. Delimitor-based Jan 12, 2024
@Ekleog Ekleog changed the title Unstructured v2: Stop-or-Go-based vs. Delimitor-based Unstructured v2: Stop-or-Continue-based vs. Delimitor-based Jan 12, 2024
@camshaft
Copy link

I've been thinking about a third option for bolero generators for a few months now. I finally had a bit of time to write my thoughts about it: camshaft/bolero#203.

@silvergasp
Copy link

Not a maintainer, but I thought I'd chime in with some thoughts. Namely it'd be nice to constrain Unstructured such that it becomes easier to implement de-arbitrary at the moment this is quite a difficult problem as some of the operations in Unstructured are not easily reversible. This would make building seed-corpora easier and also enable structured custom mutators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants