pkg/fuzzer: corpus progs with non-reproducible coverage #4639

a-nogikh · 2024-04-04T12:32:08Z

We regularly notice that syzkaller is not able to reproduce 100% of the accumulated corpus coverage after every restart. The same effect is visible on syzbot: https://syzkaller.appspot.com/upstream/graph/fuzzing?Instances=ci-upstream-kasan-gce-root&Metrics=TriagedCoverage&Months=1

The problem

The actual problem begins much earlier than when we already restart a syzkaller instance. At the fuzzing time, not all new inputs that syzkaller adds to the corpus actually reliably reproduce the signal (coverage) they were thought to.

Here's a small experiment that tries to shed more light on the problem: a-nogikh@a4859a7

Clone a program after it was deflaked.
After minimization is done, run the minimized and the original program once more and see whether they actually reproduce info.newStableSignal.

I ran it on a local syzkaller instance that had quite a good accumulated corpus (~22K programs), so it only captured the newly found programs -- it's similar to what our syzbot instances do.

Whether the reproduced signal was exactly the same (new = after minimize, old = before minimize).

Name	Value
A: signal == target : new false, old false	81
B: signal == target : new false, old true	222
C: signal == target : new true, old false	191
D: signal == target : new true, old true	1240

The non-minimized program gave the same signal in the (B+D)/(A+B+C+D)=84.3% of cases
If the original program was stable, the minimized program was successful in the D/(D+B)=84.8% of cases

So is 84% just the average probability of programs reproducing any coverage after 3 runs?

Looking at the syzbot stats, I also see that the triaged coverage is usually 85-90% of the previous maximum of the corpus coverage.

Whether we have reproduced at least any of the new signal (new = after minimize, old = before minimize).

Name	Value
E: signal > 0 : new false, old false	75
F: signal > 0 : new false, old true	215
G: signal > 0 : new true, old false	189
H: signal > 0 : new true, old true	1255

The values look very very similar to the previous table.
So, in almost all cases, we either reproduce all of the new coverage, of none of it?

What do we do?

If the reproduction probability is as high as 80+%, it does not feel like we should discard such programs (or try to avoid their reaching the corpus in the first place).

At the same time, we don't want to retry every program too many times -- corpus triage already takes 1-2-3 hours on our syzbot instances. Adding more iterations would only increase it.

The text was updated successfully, but these errors were encountered:

a-nogikh · 2024-04-04T15:31:38Z

If we do 4 runs in triageJob.deflake():

Name	Value
A: signal == target : new false, old false	67
B: signal == target : new false, old true	139
C: signal == target : new true, old false	115
D: signal == target : new true, old true	844

The non-minimized programs have reproduced their coverage 84.5% of times.

If we do 5 runs in triageJob.deflake():

Name	Value
A: signal == target : new false, old false	150
B: signal == target : new false, old true	513
C: signal == target : new true, old false	423
D: signal == target : new true, old true	3203

That's 86%.

So adding more runs doesn't change the ratio much. I assume then, the majority of inputs just behave this way and by running them more times in triage we just pick the most lucky, but not the most stable ones?

Some more calculations. Since I already had a corpus, my data is somewhat skewed towards the less stable inputs -- the stable ones are already in the corpus.

So let's assume that the actual figure is higher and inputs reproduce with a 90% probability. From being a corpus.db seed to being a triaged corpus item each input must successfully run 4 times: once as a candidate and then 3 times in deflake(). That gives a total 65% probability of success.

We feed all of corpus.db twice, so the final probability of getting each input is 1-(1-0.6561)^2 = 88%. That actually looks quite similar to what we observe on syzbot (e.g.).

Even if the initial probability were 95%, we would lose ~4% of the corpus each restart.

a-nogikh · 2024-04-04T17:23:29Z

Another experiment:

Take corpus.db from syzbot, execute these programs as candidates, deflake() new signal with 3 runs (as we do now) and then repeatedly run them to estimate the probability of reproducing info.newStableSignal for each particular input.

I need to accumulate more data (I'll attach a distribution to this comment then), but from what I see now, the median probability is around 90-95%, just like in the calculations above.

UPD:

~400 runs per each triaged corpus prog.

52% of corpus progs reproduce coverage with a 95-100% probability.
11% of corpus progs reproduce coverage with 90-95%.
8% of corpus progs reproduce coverage with 85-90%.
9% of corpus progs reproduce coverage with 80-85%.
4.5% of corpus progs reproduce coverage with 70-75%.

a-nogikh · 2024-04-08T09:56:36Z

Some data from a local instance that was running for several days.

Main sources of flaky (=the one that failed deflake()) coverage:

sendmsg$nl_route: 25.35% (total triaged=4930, flaky signal=7090)
openat: 9.06% (total triaged=3589, flaky signal=4515)
pwritev2: 8.85% (total triaged=1638, flaky signal=3362)
.extra: 10.06% (total triaged=1391, flaky signal=3102)
mkdirat: 6.91% (total triaged=1606, flaky signal=2796)
fallocate: 11.98% (total triaged=935, flaky signal=2477)
syz_mount_image$ext4: 5.20% (total triaged=2056, flaky signal=2430)
sendmmsg$inet: 10.57% (total triaged=1987, flaky signal=2422)
syz_emit_ethernet: 34.10% (total triaged=1557, flaky signal=2181)
close_range: 15.45% (total triaged=1230, flaky signal=1897)
sendfile: 9.13% (total triaged=865, flaky signal=1841)
bpf$PROG_LOAD: 21.54% (total triaged=2052, flaky signal=1819)
madvise: 9.50% (total triaged=1474, flaky signal=1806)
pread64: 22.15% (total triaged=1133, flaky signal=1554)
connect$inet: 7.55% (total triaged=1046, flaky signal=1381)
mmap: 7.12% (total triaged=1082, flaky signal=1322)
syz_mount_image$udf: 5.41% (total triaged=1183, flaky signal=1306)
syz_mount_image$hfsplus: 3.78% (total triaged=1086, flaky signal=1244)
syz_mount_image$vfat: 4.03% (total triaged=1068, flaky signal=1231)
connect$inet6: 8.40% (total triaged=1059, flaky signal=1230)
getsockopt$inet_sctp6_SCTP_SOCKOPT_CONNECTX3: 4.11% (total triaged=1046, flaky signal=1229)
syz_mount_image$ntfs3: 2.60% (total triaged=923, flaky signal=1139)
ioctl$FITRIM: 26.40% (total triaged=250, flaky signal=1125)
ioctl$sock_inet_SIOCSIFFLAGS: 13.16% (total triaged=585, flaky signal=1074)
unshare: 15.45% (total triaged=220, flaky signal=1073)
mount$9p_fd: 22.24% (total triaged=652, flaky signal=1067)
read$FUSE: 19.41% (total triaged=876, flaky signal=1062)
mknodat: 12.01% (total triaged=791, flaky signal=1042)
ioctl$SIOCSIFMTU: 15.24% (total triaged=164, flaky signal=994)
sendmmsg$inet6: 18.10% (total triaged=884, flaky signal=985)
sendmsg$IPSET_CMD_SAVE: 1.65% (total triaged=182, flaky signal=964)
syz_mount_image$btrfs: 2.35% (total triaged=638, flaky signal=950)
openat$cdrom: 0.46% (total triaged=217, flaky signal=939)
syz_mount_image$xfs: 4.03% (total triaged=447, flaky signal=935)
syz_mount_image$hfs: 4.13% (total triaged=678, flaky signal=882)

%% is the share of successful triageJob() for new signal for the particular call.

a-nogikh added the bug label Apr 4, 2024

a-nogikh mentioned this issue Apr 5, 2024

pkg/fuzzer: make deflake() more flexible #4644

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/fuzzer: corpus progs with non-reproducible coverage #4639

pkg/fuzzer: corpus progs with non-reproducible coverage #4639

a-nogikh commented Apr 4, 2024 •

edited

a-nogikh commented Apr 4, 2024 •

edited

a-nogikh commented Apr 4, 2024 •

edited

a-nogikh commented Apr 8, 2024 •

edited

pkg/fuzzer: corpus progs with non-reproducible coverage #4639

pkg/fuzzer: corpus progs with non-reproducible coverage #4639

Comments

a-nogikh commented Apr 4, 2024 • edited

The problem

What do we do?

a-nogikh commented Apr 4, 2024 • edited

a-nogikh commented Apr 4, 2024 • edited

a-nogikh commented Apr 8, 2024 • edited

a-nogikh commented Apr 4, 2024 •

edited

a-nogikh commented Apr 4, 2024 •

edited

a-nogikh commented Apr 4, 2024 •

edited

a-nogikh commented Apr 8, 2024 •

edited