Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Ekleog · 2024-01-12T00:30:50Z

Hey!

I've been thinking a lot about arbitrary recently, had a chat with @nagisa, and they told me to report all my thoughts here for your consumption. So here are a few thoughts, hoping to give ideas, gather feedback, and maybe eventually I'll write a PR if you agree with the concept :)

There are three main points that have been bothering me with arbitrary, and my alternating usage of both it and bolero:

The fact it only supports byte slices as inputs, and not RNGs to double as proptests. The chat just now with @nagisa made me understand why that's not actually a problem but a good thing, for minimization reasons.
The fact that when writing manual implementations of Arbitrary, it's very easy to get stack overflows. Just yesterday, I had an enum with enum Foo { Variant(Vec<Foo>), ... }, and got a stack overflow by doing things the naive way. This is basically this issue, and it's already been closed so I won't relitigate, but I'll just say that having a freestanding arbitrary::recursion_guard() that updates a thread-local counter on each nested call to Unstructured:: functions would probably solve the issue without complicating the Arbitrary trait itself.
The one I'm actually writing this issue for. Currently, the way arbitrary generates collections is one I had not heard of yet: it tests after each element whether to take a new element. This might look good, but a very simple change (eg. changing the length of one element) can lead to a complete structural change, especially with nested collections.

In order to avoid the issue in point 3, a magic separator-based solution should help the fuzzer identify the input: if the length of one element changes, then either the input will stop parsing (if the length would go beyond the magic separator), or the input will parse with the exact same structure. In turn, this means that small changes in input lead to small changes in structure, which is a great property for fuzzers.

On the other hand, this only works for fuzzers. Such a scheme would definitely not work for "lightweight" tests like proptests, because they'd have a way too high fail-to-parse rate for that.

Hence, I'm thinking there'd be the need for at least two versions of Unstructured: one that handles collections as it does right now (stop-or-continue check), and one that handles collections with a magic delimiter (plus escaping and nested delimiters, non-trivial to implement but should be doable).

Incidentally, the version that uses a magic delimiter might hit some issues with the current Unstructured API; due to the lifetimes there that require borrowing from the Unstructured, yet unescaping the magic word would require mutating the Unstructured data (or copying it, which would be sad)

So I wanted to first ask: what do you think about the idea? :)

The text was updated successfully, but these errors were encountered:

camshaft · 2024-01-13T11:56:56Z

I've been thinking about a third option for bolero generators for a few months now. I finally had a bit of time to write my thoughts about it: camshaft/bolero#203.

silvergasp · 2024-01-16T21:43:17Z

Not a maintainer, but I thought I'd chime in with some thoughts. Namely it'd be nice to constrain Unstructured such that it becomes easier to implement de-arbitrary at the moment this is quite a difficult problem as some of the operations in Unstructured are not easily reversible. This would make building seed-corpora easier and also enable structured custom mutators.

Ekleog changed the title ~~Unstructured v2: TLV-based vs. Delimitor-based~~ Unstructured v2: Stop-or-Go-based vs. Delimitor-based Jan 12, 2024

Ekleog changed the title ~~Unstructured v2: Stop-or-Go-based vs. Delimitor-based~~ Unstructured v2: Stop-or-Continue-based vs. Delimitor-based Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Ekleog commented Jan 12, 2024

camshaft commented Jan 13, 2024

silvergasp commented Jan 16, 2024

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Unstructured v2: Stop-or-Continue-based vs. Delimitor-based #165

Comments

Ekleog commented Jan 12, 2024

camshaft commented Jan 13, 2024

silvergasp commented Jan 16, 2024