New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sliding window by backing up the loop min is busted #8140
Comments
A very simple failure case that is resolved by disabling the loop rewinding mode:
Note that this schedule means all output loops are size 1. All three of the loops yo, yi, and x get backed up, and you get:
I have temporarily disabled storage_folding to keep things simpler. It fails with or without it. It looks like both Funcs slid in both dimensions. f0 slid in y over the output yi loop, and slid in x over the output x loop. f1 slid in y over the output yo loop, and slid in x over the output x loop. According to debug(3), it's all using loop rewinding, despite the select, so this isn't a case of old and new style sliding interacting poorly. The output is indeed only stored once, in the last loop iteration of all loops. If the input is 1, then f[0] should be 1, f[1] should be 4, and f[2] should be 10. The trace looks like this:
f1(-1, -1) is loaded without ever being stored to. Even given that the output should be 6, not 0, but maybe llvm has decided this is UB so it can store whatever it feels like. Indeed if you look at the .ll for the version without tracing, llvm has simplified the entire pipeline to Reading the IR more closely, f0 is computed over a 5x5 box (good), but f1 is only computed over a 1x3 box when x = 1. It needs to be computed over a 3x3 box. The if condition that surrounds it is wrong. That condition is: It looks like two passes of sliding over x have been applied to that condition, when only one should have? At this point I'm lost. I'll call it here, and hopefully this log will be useful to me tomorrow. |
Slightly simpler: It still fails if you only slide in x and don't have a boundary condition:
|
Incidentally I didn't just write this fuzzer for fun. It was an attempt to come up with a simpler repro for a real incorrect-output bug found when writing production code. In that code I had to revert to a slower non-sliding schedule to work around it. So even though the pipelines produced by my fuzzer are weird, this bug absolutely affects real code, which is why I'm so anxious to fix it. |
Does that reproducer need the second dimension? It looks like y is almost unrelated (except for the store_at). The one thing that is interesting/unique about this pipeline compared to many prior uses is that f[2] consumes f[0] and f[1]. The bug must be related to that somehow. |
Yeah for that reproducer I think y is unrelated, unless the store_at nesting is important. I think the production pipeline that failed also had that property. Three funcs f, g, h. g calls f, h calls f and g, f and g are both sliding. |
I wrote a fuzzer that generates random 5-stage stencil pipelines with a boundary condition on the input and then a series of upsamples, downsamples, and small blurs. It then applies a simple sliding window schedule:
store_root().compute_at(output, y)
. It fails with incorrect output about 1% of the time.If I disable the version of sliding window that backs up the start of the loop, there are no failures, but IIUC this version of sliding window is important for some google pipelines so I'm hesitant to just ditch it. I'll spend some time trying to fix it before I open a PR that just disables it. Here's an example pipeline that produces a different output if you compute_root everything:
The text was updated successfully, but these errors were encountered: