Skip to content
Andrew Adams edited this page Feb 1, 2024 · 4 revisions

if I specify a compute_at() but no store_at(), is the default that store_at() always matches compute_at()? (vs "unspecified")

Correct. store_at defaults to the same as compute_at.

if I specify parallel() without any explicit split() on the same function, how are splits chosen for parallelism?

Every instance of the variable is run as its own task. There's no splitting done (or the split is effectively 1). In the default parallel runtime (posix_thread_pool) I tried running tasks in small batches, but the per-task overhead was low enough that it was better just to treat each instance of the var as its own job. That way different cores more tightly interleave their work and you get better L2/L3 usage. If do something silly like parallelize across the innermost dimension (usually x), you get tiny jobs and cores trash each other's caches.

is split() guaranteed to handle non-even multiples of the split factor? e.g.

f.split(y, y, yi, 7); // f.height is not an even multiple of 7

am I risking a runtime out-of-bounds access error, or will this be handled under the covers? (probably worth documenting)

For a pure function, or a reduction initialization step, f.height must be at least 7, or you get an out of bounds error, but it need not be a multiple of 7. The last group of size 7 gets pushed backwards so that it doesn't compute off the end. I.e., for a split of size 4 computing a region of size 6, we compute the elements 0 1 2 3, and then 2 3 4 5.

For a reduction update step, it can change the meaning to recompute the same element multiple times, so it rounds up to a multiple of 7.

For the purpose of bounds inference, a vectorize is just a split, so this behavior is what lets you vectorize x by four even when the width of the input is not a multiple of four.

Loop and storage order

A few questions/assertions to hopefully confirm some details I think are accurate, but haven't found explicitly documented...

(1) If I never reorder() or reorder_storage() a Func, both processing order and memory layout will default to the ordering of assignment of the Func,

Func foo; foo(c, x, y) = whatever(); foo.bound(c, 0, 3);

we'll traverse via

for (y)
  for (x)
    for (c)

with memory arranged in row-major chunky-RGB style (i.e., innermost var clumped most tightly)

(2) If I reorder() but never reorder_storage(), the memory layout is unaffected (see assertion 1). Likewise, reorder_storage() doesn't affect traversal order.

(3) If I call reorder() and/or reorder_storage() multiple times on the same Func, subsequent calls supersede previous ones entirely.

Short answer is yes to all of the above.

Slightly longer answer: You don't need to specify all of the vars in the reorder (or reorder_storage) call. If you don't, it just reorders the specified vars and leaves the other ones in the same places. This means a later reorder that specifies only some of the vars doesn't completely clobber the effect of the earlier reorders. E.g. the following is legal and possibly useful:

foo(x, y, _) = g(x, y, _) + 7;
foo.reorder(y, x); 

While it can make sense to specify only some of the vars, you probably shouldn't use multiple reorder calls. I can't think of a case where it isn't confusing.