Sliding window by backing up the loop min is busted #8140

abadams · 2024-03-06T21:34:43Z

I wrote a fuzzer that generates random 5-stage stencil pipelines with a boundary condition on the input and then a series of upsamples, downsamples, and small blurs. It then applies a simple sliding window schedule: store_root().compute_at(output, y). It fails with incorrect output about 1% of the time.

If I disable the version of sliding window that backs up the start of the loop, there are no failures, but IIUC this version of sliding window is important for some google pipelines so I'm hesitant to just ditch it. I'll spend some time trying to fix it before I open a PR that just disables it. Here's an example pipeline that produces a different output if you compute_root everything:

ImageParam input(UInt(8), 2);
Var x{"x"}, y{"y"};
Func f[5];
f[0] = BoundaryConditions::repeat_edge(input)
f[1](x, y) = ((f[0]((x/2) + 2, (y/2) + 1) + f[0](x/2, (y/2) + 2)) + (f[0](x*2, y*2) + f[0]((x*2) + -1, y*2)));
f[2](x, y) = ((f[0](x + 1, y + 2) + f[0](x + 1, y + 1)) + (f[1]((x/2) + -2, (y/2) + -1) + f[1](x/2, (y/2) + -2)));
f[3](x, y) = ((f[1](x + -1, y + 2) + f[1](x + 1, y + -1)) + (f[1](x/2, y/2) + f[1]((x/2) + -1, (y/2) + -2)));
f[4](x, y) = ((f[3]((x*2) + 1, y*2) + f[3]((x*2) + -2, (y*2) + 2)) + (f[2](x/2, y/2) + f[2]((x/2) + 1, (y/2) + 1)));
f[0].store_root().compute_at(f[4], y);
f[1].store_root().compute_at(f[4], y);
f[2].store_root().compute_at(f[4], y);
f[3].store_root().compute_at(f[4], y);
f[4].store_root().compute_at(f[4], y);

The text was updated successfully, but these errors were encountered:

abadams · 2024-03-07T01:05:18Z

A very simple failure case that is resolved by disabling the loop rewinding mode:

ImageParam input(UInt(8), 2);
Var x{"x"}, y{"y"}, yo{"yo"}, yi{"yi"};
Func f[3];
f[0] = BoundaryConditions::repeat_edge(input);
f[1](x, y) = ((f[0](x + 1, y + 1) + f[0](x + -1, y + -1))*2);
f[2](x, y) = ((f[0](x + 1, y + 1) + f[0](x + -1, y + -1)) + (f[1](x + 1, y + 1) + f[1](x + -1, y + -1)));
f[2].never_partition_all().bound(x, 0, 1).bound(y, 0, 1).split(y, yo, yi, 1, TailStrategy::RoundUp);
f[0].never_partition_all().store_at(f[2], yo).compute_at(f[2], x);
f[1].never_partition_all().store_root().compute_at(f[2], x);

Note that this schedule means all output loops are size 1.

All three of the loops yo, yi, and x get backed up, and you get:

 allocate f1$1[uint8 * 9]
 produce f2$1 {
  let t408 = (p1.min.1*p1.stride.1) + p1.min.0
  let t407 = p1.extent.1 + p1.min.1
  let t409 = p1.extent.0 + p1.min.0
  for (f2$1.s0.y.yo.$n.rebased, 0, 3) {
   let f0$1.y.min_realized.s = min(select(f2$1.s0.y.yo.$n.rebased < 2, f2$1.s0.y.yo.$n.rebased, min(f2$1.s0.y.yo.$n.rebased, 1)) + 1, f2$1.s0.y.yo.$n.rebased)
   allocate f0$1[uint8 * 30]
   let t412 = 2 <= f2$1.s0.y.yo.$n.rebased
   let t413 = f0$1.y.min_realized.s*-5
   let t410 = f2$1.s0.y.yo.$n.rebased - f0$1.y.min_realized.s
   for (f2$1.s0.y.yi.$n.rebased, 0, 4) {
    let t416 = 3 <= f2$1.s0.y.yi.$n.rebased
    let t415 = (f2$1.s0.y.yi.$n.rebased + t410)*5
    let t417 = t410*5
    let t418 = f2$1.s0.y.yo.$n.rebased*3
    let t414 = (max(min((f2$1.s0.y.yi.$n.rebased + f2$1.s0.y.yo.$n.rebased) + -2, t407) + -1, p1.min.1)*p1.stride.1) - t408
    for (f2$1.s0.x.$n.$n.rebased, 0, 5) {
     produce f0$1 {
      f0$1[f2$1.s0.x.$n.$n.rebased + t415] = p1[max(min(f2$1.s0.x.$n.$n.rebased + -1, t409) + -1, p1.min.0) + t414]
     }
     if ((4 <= f2$1.s0.x.$n.$n.rebased) && t416) {
      consume f0$1 {
       produce f1$1 {
        f1$1[t418 + 2] = (f0$1[t417 + 19] + f0$1[t417 + 7])*(uint8)2
       }
       if (t412) {
        consume f1$1 {
         f2$1[0] = f0$1[t413 + 11] + (f0$1[t413 + 23] + (f1$1[8] + f1$1[0]))
        }
       }
      }
     }
    }
   }
   free f0$1
  }
 }
 free f1$1

I have temporarily disabled storage_folding to keep things simpler. It fails with or without it.

It looks like both Funcs slid in both dimensions. f0 slid in y over the output yi loop, and slid in x over the output x loop. f1 slid in y over the output yo loop, and slid in x over the output x loop. According to debug(3), it's all using loop rewinding, despite the select, so this isn't a case of old and new style sliding interacting poorly.

The output is indeed only stored once, in the last loop iteration of all loops.

If the input is 1, then f[0] should be 1, f[1] should be 4, and f[2] should be 10. The trace looks like this:

Store f0(-2, -3) = 1
Store f0(-1, -3) = 1
Store f0(0, -3) = 1
Store f0(1, -3) = 1
Store f0(2, -3) = 1
Store f0(-2, -2) = 1
Store f0(-1, -2) = 1
Store f0(0, -2) = 1
Store f0(1, -2) = 1
Store f0(2, -2) = 1
Store f0(-2, -1) = 1
Store f0(-1, -1) = 1
Store f0(0, -1) = 1
Store f0(1, -1) = 1
Store f0(2, -1) = 1
Store f0(-2, 0) = 1
Store f0(-1, 0) = 1
Store f0(0, 0) = 1
Store f0(1, 0) = 1
Store f0(2, 0) = 1
Load f0(2, 0) = 1
Load f0(0, -2) = 1
Store f1(1, -1) = 4
Store f0(-2, -2) = 1
Store f0(-1, -2) = 1
Store f0(0, -2) = 1
Store f0(1, -2) = 1
Store f0(2, -2) = 1
Store f0(-2, -1) = 1
Store f0(-1, -1) = 1
Store f0(0, -1) = 1
Store f0(1, -1) = 1
Store f0(2, -1) = 1
Store f0(-2, 0) = 1
Store f0(-1, 0) = 1
Store f0(0, 0) = 1
Store f0(1, 0) = 1
Store f0(2, 0) = 1
Store f0(-2, 1) = 1
Store f0(-1, 1) = 1
Store f0(0, 1) = 1
Store f0(1, 1) = 1
Store f0(2, 1) = 1
Load f0(2, 1) = 1
Load f0(0, -1) = 1
Store f1(1, 0) = 4
Store f0(-2, -1) = 1
Store f0(-1, -1) = 1
Store f0(0, -1) = 1
Store f0(1, -1) = 1
Store f0(2, -1) = 1
Store f0(-2, 0) = 1
Store f0(-1, 0) = 1
Store f0(0, 0) = 1
Store f0(1, 0) = 1
Store f0(2, 0) = 1
Store f0(-2, 1) = 1
Store f0(-1, 1) = 1
Store f0(0, 1) = 1
Store f0(1, 1) = 1
Store f0(2, 1) = 1
Store f0(-2, 2) = 1
Store f0(-1, 2) = 1
Store f0(0, 2) = 1
Store f0(1, 2) = 1
Store f0(2, 2) = 1
Load f0(2, 2) = 1
Load f0(0, 0) = 1
Store f1(1, 1) = 4
Load f0(-1, -1) = 1
Load f0(1, 1) = 1
Load f1(1, 1) = 4
Load f1(-1, -1) = 0
Store f2(0, 0) = 0

f1(-1, -1) is loaded without ever being stored to. Even given that the output should be 6, not 0, but maybe llvm has decided this is UB so it can store whatever it feels like. Indeed if you look at the .ll for the version without tracing, llvm has simplified the entire pipeline to ret i32 0.

Reading the IR more closely, f0 is computed over a 5x5 box (good), but f1 is only computed over a 1x3 box when x = 1. It needs to be computed over a 3x3 box. The if condition that surrounds it is wrong. That condition is: f2$1.s0.x.loop_min.orig <= f2$1.s0.x.$n.$n

It looks like two passes of sliding over x have been applied to that condition, when only one should have? At this point I'm lost. I'll call it here, and hopefully this log will be useful to me tomorrow.

abadams · 2024-03-07T19:45:09Z

Slightly simpler: It still fails if you only slide in x and don't have a boundary condition:

Var x{"x"}, y{"y"}, yo{"yo"}, yi{"yi"};
Func f[3];
f[0](x, y) = input(x, y);
f[1](x, y) = ((f[0](x + 1, y) + f[0](x + -1, y))*2);
f[2](x, y) = ((f[0](x + 1, y) + f[0](x + -1, y)) + (f[1](x + 1, y) + f[1](x + -1, y)));
f[2].never_partition_all().split(y, yo, yi, 1, TailStrategy::RoundUp);
output_func.output_buffer().dim(0).set_bounds(0, 1);
output_func.output_buffer().dim(1).set_bounds(0, 1);
f[0].never_partition_all().store_at(f[2], yo).compute_at(f[2], x);
f[1].never_partition_all().store_root().compute_at(f[2], x);

 produce f2$1 {
  let t107 = (p1.min.1*p1.stride.1) + p1.min.0
  for (f2$1.s0.x.$n.$n.rebased, 0, 5) {
   produce f0$1 {
    f0$1[f2$1.s0.x.$n.$n.rebased] = p1[(f2$1.s0.x.$n.$n.rebased - t107) + -2]
   }
   if (4 <= f2$1.s0.x.$n.$n.rebased) {
    consume f0$1 {
     produce f1$1 {
      f1$1[2] = (f0$1[4] + f0$1[2])*(uint8)2
     }
     consume f1$1 {
      f2$1[0] = f1$1[0] + (f1$1[2] + (f0$1[3] + f0$1[1]))
     }
    }
   }
  }
 }

Works around issue #8140 It would test hoist storage too, but that's disabled due to issue #8141

abadams · 2024-03-08T03:22:43Z

Incidentally I didn't just write this fuzzer for fun. It was an attempt to come up with a simpler repro for a real incorrect-output bug found when writing production code. In that code I had to revert to a slower non-sliding schedule to work around it. So even though the pipelines produced by my fuzzer are weird, this bug absolutely affects real code, which is why I'm so anxious to fix it.

dsharlet · 2024-03-08T03:51:36Z

Does that reproducer need the second dimension? It looks like y is almost unrelated (except for the store_at).

The one thing that is interesting/unique about this pipeline compared to many prior uses is that f[2] consumes f[0] and f[1]. The bug must be related to that somehow.

abadams · 2024-03-08T04:20:58Z

Yeah for that reproducer I think y is unrelated, unless the store_at nesting is important.

I think the production pipeline that failed also had that property. Three funcs f, g, h. g calls f, h calls f and g, f and g are both sliding.

abadams added bug dev_meeting Topic to be discussed at the next dev meeting labels Mar 6, 2024

abadams added a commit that referenced this issue Mar 7, 2024

Add a fuzzer for sliding window schedules

49a14c8

Works around issue #8140 It would test hoist storage too, but that's disabled due to issue #8141

abadams mentioned this issue Mar 7, 2024

Add a fuzzer for sliding window schedules #8144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliding window by backing up the loop min is busted #8140

Sliding window by backing up the loop min is busted #8140

abadams commented Mar 6, 2024

abadams commented Mar 7, 2024

abadams commented Mar 7, 2024

abadams commented Mar 8, 2024

dsharlet commented Mar 8, 2024

abadams commented Mar 8, 2024

Sliding window by backing up the loop min is busted #8140

Sliding window by backing up the loop min is busted #8140

Comments

abadams commented Mar 6, 2024

abadams commented Mar 7, 2024

abadams commented Mar 7, 2024

abadams commented Mar 8, 2024

dsharlet commented Mar 8, 2024

abadams commented Mar 8, 2024