Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sliding window by backing up the loop min is busted #8140

Open
abadams opened this issue Mar 6, 2024 · 5 comments
Open

Sliding window by backing up the loop min is busted #8140

abadams opened this issue Mar 6, 2024 · 5 comments
Labels
bug dev_meeting Topic to be discussed at the next dev meeting

Comments

@abadams
Copy link
Member

abadams commented Mar 6, 2024

I wrote a fuzzer that generates random 5-stage stencil pipelines with a boundary condition on the input and then a series of upsamples, downsamples, and small blurs. It then applies a simple sliding window schedule: store_root().compute_at(output, y). It fails with incorrect output about 1% of the time.

If I disable the version of sliding window that backs up the start of the loop, there are no failures, but IIUC this version of sliding window is important for some google pipelines so I'm hesitant to just ditch it. I'll spend some time trying to fix it before I open a PR that just disables it. Here's an example pipeline that produces a different output if you compute_root everything:

ImageParam input(UInt(8), 2);
Var x{"x"}, y{"y"};
Func f[5];
f[0] = BoundaryConditions::repeat_edge(input)
f[1](x, y) = ((f[0]((x/2) + 2, (y/2) + 1) + f[0](x/2, (y/2) + 2)) + (f[0](x*2, y*2) + f[0]((x*2) + -1, y*2)));
f[2](x, y) = ((f[0](x + 1, y + 2) + f[0](x + 1, y + 1)) + (f[1]((x/2) + -2, (y/2) + -1) + f[1](x/2, (y/2) + -2)));
f[3](x, y) = ((f[1](x + -1, y + 2) + f[1](x + 1, y + -1)) + (f[1](x/2, y/2) + f[1]((x/2) + -1, (y/2) + -2)));
f[4](x, y) = ((f[3]((x*2) + 1, y*2) + f[3]((x*2) + -2, (y*2) + 2)) + (f[2](x/2, y/2) + f[2]((x/2) + 1, (y/2) + 1)));
f[0].store_root().compute_at(f[4], y);
f[1].store_root().compute_at(f[4], y);
f[2].store_root().compute_at(f[4], y);
f[3].store_root().compute_at(f[4], y);
f[4].store_root().compute_at(f[4], y);
@abadams abadams added bug dev_meeting Topic to be discussed at the next dev meeting labels Mar 6, 2024
@abadams
Copy link
Member Author

abadams commented Mar 7, 2024

A very simple failure case that is resolved by disabling the loop rewinding mode:

ImageParam input(UInt(8), 2);
Var x{"x"}, y{"y"}, yo{"yo"}, yi{"yi"};
Func f[3];
f[0] = BoundaryConditions::repeat_edge(input);
f[1](x, y) = ((f[0](x + 1, y + 1) + f[0](x + -1, y + -1))*2);
f[2](x, y) = ((f[0](x + 1, y + 1) + f[0](x + -1, y + -1)) + (f[1](x + 1, y + 1) + f[1](x + -1, y + -1)));
f[2].never_partition_all().bound(x, 0, 1).bound(y, 0, 1).split(y, yo, yi, 1, TailStrategy::RoundUp);
f[0].never_partition_all().store_at(f[2], yo).compute_at(f[2], x);
f[1].never_partition_all().store_root().compute_at(f[2], x);

Note that this schedule means all output loops are size 1.

All three of the loops yo, yi, and x get backed up, and you get:

 allocate f1$1[uint8 * 9]
 produce f2$1 {
  let t408 = (p1.min.1*p1.stride.1) + p1.min.0
  let t407 = p1.extent.1 + p1.min.1
  let t409 = p1.extent.0 + p1.min.0
  for (f2$1.s0.y.yo.$n.rebased, 0, 3) {
   let f0$1.y.min_realized.s = min(select(f2$1.s0.y.yo.$n.rebased < 2, f2$1.s0.y.yo.$n.rebased, min(f2$1.s0.y.yo.$n.rebased, 1)) + 1, f2$1.s0.y.yo.$n.rebased)
   allocate f0$1[uint8 * 30]
   let t412 = 2 <= f2$1.s0.y.yo.$n.rebased
   let t413 = f0$1.y.min_realized.s*-5
   let t410 = f2$1.s0.y.yo.$n.rebased - f0$1.y.min_realized.s
   for (f2$1.s0.y.yi.$n.rebased, 0, 4) {
    let t416 = 3 <= f2$1.s0.y.yi.$n.rebased
    let t415 = (f2$1.s0.y.yi.$n.rebased + t410)*5
    let t417 = t410*5
    let t418 = f2$1.s0.y.yo.$n.rebased*3
    let t414 = (max(min((f2$1.s0.y.yi.$n.rebased + f2$1.s0.y.yo.$n.rebased) + -2, t407) + -1, p1.min.1)*p1.stride.1) - t408
    for (f2$1.s0.x.$n.$n.rebased, 0, 5) {
     produce f0$1 {
      f0$1[f2$1.s0.x.$n.$n.rebased + t415] = p1[max(min(f2$1.s0.x.$n.$n.rebased + -1, t409) + -1, p1.min.0) + t414]
     }
     if ((4 <= f2$1.s0.x.$n.$n.rebased) && t416) {
      consume f0$1 {
       produce f1$1 {
        f1$1[t418 + 2] = (f0$1[t417 + 19] + f0$1[t417 + 7])*(uint8)2
       }
       if (t412) {
        consume f1$1 {
         f2$1[0] = f0$1[t413 + 11] + (f0$1[t413 + 23] + (f1$1[8] + f1$1[0]))
        }
       }
      }
     }
    }
   }
   free f0$1
  }
 }
 free f1$1

I have temporarily disabled storage_folding to keep things simpler. It fails with or without it.

It looks like both Funcs slid in both dimensions. f0 slid in y over the output yi loop, and slid in x over the output x loop. f1 slid in y over the output yo loop, and slid in x over the output x loop. According to debug(3), it's all using loop rewinding, despite the select, so this isn't a case of old and new style sliding interacting poorly.

The output is indeed only stored once, in the last loop iteration of all loops.

If the input is 1, then f[0] should be 1, f[1] should be 4, and f[2] should be 10. The trace looks like this:

Store f0(-2, -3) = 1
Store f0(-1, -3) = 1
Store f0(0, -3) = 1
Store f0(1, -3) = 1
Store f0(2, -3) = 1
Store f0(-2, -2) = 1
Store f0(-1, -2) = 1
Store f0(0, -2) = 1
Store f0(1, -2) = 1
Store f0(2, -2) = 1
Store f0(-2, -1) = 1
Store f0(-1, -1) = 1
Store f0(0, -1) = 1
Store f0(1, -1) = 1
Store f0(2, -1) = 1
Store f0(-2, 0) = 1
Store f0(-1, 0) = 1
Store f0(0, 0) = 1
Store f0(1, 0) = 1
Store f0(2, 0) = 1
Load f0(2, 0) = 1
Load f0(0, -2) = 1
Store f1(1, -1) = 4
Store f0(-2, -2) = 1
Store f0(-1, -2) = 1
Store f0(0, -2) = 1
Store f0(1, -2) = 1
Store f0(2, -2) = 1
Store f0(-2, -1) = 1
Store f0(-1, -1) = 1
Store f0(0, -1) = 1
Store f0(1, -1) = 1
Store f0(2, -1) = 1
Store f0(-2, 0) = 1
Store f0(-1, 0) = 1
Store f0(0, 0) = 1
Store f0(1, 0) = 1
Store f0(2, 0) = 1
Store f0(-2, 1) = 1
Store f0(-1, 1) = 1
Store f0(0, 1) = 1
Store f0(1, 1) = 1
Store f0(2, 1) = 1
Load f0(2, 1) = 1
Load f0(0, -1) = 1
Store f1(1, 0) = 4
Store f0(-2, -1) = 1
Store f0(-1, -1) = 1
Store f0(0, -1) = 1
Store f0(1, -1) = 1
Store f0(2, -1) = 1
Store f0(-2, 0) = 1
Store f0(-1, 0) = 1
Store f0(0, 0) = 1
Store f0(1, 0) = 1
Store f0(2, 0) = 1
Store f0(-2, 1) = 1
Store f0(-1, 1) = 1
Store f0(0, 1) = 1
Store f0(1, 1) = 1
Store f0(2, 1) = 1
Store f0(-2, 2) = 1
Store f0(-1, 2) = 1
Store f0(0, 2) = 1
Store f0(1, 2) = 1
Store f0(2, 2) = 1
Load f0(2, 2) = 1
Load f0(0, 0) = 1
Store f1(1, 1) = 4
Load f0(-1, -1) = 1
Load f0(1, 1) = 1
Load f1(1, 1) = 4
Load f1(-1, -1) = 0
Store f2(0, 0) = 0

f1(-1, -1) is loaded without ever being stored to. Even given that the output should be 6, not 0, but maybe llvm has decided this is UB so it can store whatever it feels like. Indeed if you look at the .ll for the version without tracing, llvm has simplified the entire pipeline to ret i32 0.

Reading the IR more closely, f0 is computed over a 5x5 box (good), but f1 is only computed over a 1x3 box when x = 1. It needs to be computed over a 3x3 box. The if condition that surrounds it is wrong. That condition is: f2$1.s0.x.loop_min.orig <= f2$1.s0.x.$n.$n

It looks like two passes of sliding over x have been applied to that condition, when only one should have? At this point I'm lost. I'll call it here, and hopefully this log will be useful to me tomorrow.

@abadams
Copy link
Member Author

abadams commented Mar 7, 2024

Slightly simpler: It still fails if you only slide in x and don't have a boundary condition:

Var x{"x"}, y{"y"}, yo{"yo"}, yi{"yi"};
Func f[3];
f[0](x, y) = input(x, y);
f[1](x, y) = ((f[0](x + 1, y) + f[0](x + -1, y))*2);
f[2](x, y) = ((f[0](x + 1, y) + f[0](x + -1, y)) + (f[1](x + 1, y) + f[1](x + -1, y)));
f[2].never_partition_all().split(y, yo, yi, 1, TailStrategy::RoundUp);
output_func.output_buffer().dim(0).set_bounds(0, 1);
output_func.output_buffer().dim(1).set_bounds(0, 1);
f[0].never_partition_all().store_at(f[2], yo).compute_at(f[2], x);
f[1].never_partition_all().store_root().compute_at(f[2], x);
 produce f2$1 {
  let t107 = (p1.min.1*p1.stride.1) + p1.min.0
  for (f2$1.s0.x.$n.$n.rebased, 0, 5) {
   produce f0$1 {
    f0$1[f2$1.s0.x.$n.$n.rebased] = p1[(f2$1.s0.x.$n.$n.rebased - t107) + -2]
   }
   if (4 <= f2$1.s0.x.$n.$n.rebased) {
    consume f0$1 {
     produce f1$1 {
      f1$1[2] = (f0$1[4] + f0$1[2])*(uint8)2
     }
     consume f1$1 {
      f2$1[0] = f1$1[0] + (f1$1[2] + (f0$1[3] + f0$1[1]))
     }
    }
   }
  }
 }

abadams added a commit that referenced this issue Mar 7, 2024
Works around issue #8140

It would test hoist storage too, but that's disabled due to issue #8141
@abadams
Copy link
Member Author

abadams commented Mar 8, 2024

Incidentally I didn't just write this fuzzer for fun. It was an attempt to come up with a simpler repro for a real incorrect-output bug found when writing production code. In that code I had to revert to a slower non-sliding schedule to work around it. So even though the pipelines produced by my fuzzer are weird, this bug absolutely affects real code, which is why I'm so anxious to fix it.

@dsharlet
Copy link
Contributor

dsharlet commented Mar 8, 2024

Does that reproducer need the second dimension? It looks like y is almost unrelated (except for the store_at).

The one thing that is interesting/unique about this pipeline compared to many prior uses is that f[2] consumes f[0] and f[1]. The bug must be related to that somehow.

@abadams
Copy link
Member Author

abadams commented Mar 8, 2024

Yeah for that reproducer I think y is unrelated, unless the store_at nesting is important.

I think the production pipeline that failed also had that property. Three funcs f, g, h. g calls f, h calls f and g, f and g are both sliding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dev_meeting Topic to be discussed at the next dev meeting
Projects
None yet
Development

No branches or pull requests

2 participants