idempotency optimization on initialized outputs #353

kylebd99 · 2023-12-13T23:39:59Z

In my project, I'm using the instance AST to programmatically build a finch kernel. But, I've been running into an issue where it sometimes needlessly densifies fibers. The following code should reproduce the issue where output_fiber is fully dense even though it should just be a copy of X which is sparse. I'm probably just not building the AST correctly, but it seems like an issue of it not recognizing the idempotency of overwrite.

using Finch
using Finch.FinchNotation: index_instance, tag_instance, variable_instance, literal_instance,
                            Reader, Updater, access_instance, assign_instance, loop_instance,
                            Dimensionless, declare_instance, block_instance

function test_fill()
    X = fsprand((1000, 1000), .5)
    indices = [index_instance(:i), index_instance(:j)]
    X_tag = tag_instance(variable_instance(:X), X)
    X_access = access_instance(X_tag, literal_instance(Reader()), indices...)
    X_instance = X_access
    output_fiber = Fiber(SparseHashLevel{1}(SparseHashLevel{1}(Element(0.0))))
    output_tag = tag_instance(variable_instance(:output_fiber), output_fiber)
    output_access = access_instance(output_tag, literal_instance(Updater()), indices...)
    full_prgm = assign_instance(output_access, overwrite, X_instance)

    for index in indices
        full_prgm = loop_instance(index, Dimensionless(), full_prgm)
    end

    initializer = declare_instance(tag_instance(variable_instance(:output_fiber), output_fiber), literal_instance(0.0))
    full_prgm = block_instance(initializer, full_prgm)

    output_fiber = Finch.execute(full_prgm).output_fiber
    println("Input Size: ", countstored(X))
    println("Output Size: ", countstored(output_fiber))
    return output_fiber
end

test_fill()

The text was updated successfully, but these errors were encountered:

willow-ahrens · 2023-12-14T15:51:01Z

Okay, I took a look:

The first thing to say is overwrite(a, b) = b, and has no identity. initwrite(z)(a, b) = b == z ? a : b, so initwrite(z) is identity on z. Finch substitutes x[i] = y with x[i] <<initwrite(default(x))>>==y by default. In a sense, = is asserting that writing the default to a tensor will have no effect. You can get the kernel to have the expected behavior by using initwrite(0.0).

A similar issue arises in this non-reducing kernel when we use max instead of overwrite. Finch doesn't think 0 is an identity on max, so it doesn't simplify it. If the right hand side has a default of -Inf, this kernel would then work as expected.

Ultimately, these behaviors are a result of deciding not to reason, ever, about the current value held in a tensor. We don't do constant propagation through arrays, even trying to reason about whether something is the first write to a tensor after initialization.

willow-ahrens · 2023-12-14T16:01:43Z

A similar issue arises during reduction: it's difficult to apply idempotent simplifications if the idempotent reduction is not in the innermost loop.

Consider the reducing kernel:

function test_fill()
           X = Fiber!(Dense(Dense(SparseList(Element(0.0)))), fsprand((100, 100, 100), .001))
           indices = [index_instance(:i), index_instance(:j), index_instance(:k)]
           X_tag = tag_instance(variable_instance(:X), X)
           X_access = access_instance(X_tag, literal_instance(Reader()), indices...)
           X_instance = X_access
           output_fiber = Fiber(SparseHashLevel{1}(SparseHashLevel{1}(Element(-Inf))))
           output_tag = tag_instance(variable_instance(:output_fiber), output_fiber)
           output_access = access_instance(output_tag, literal_instance(Updater()), indices[2], indices[3])
           full_prgm = assign_instance(output_access, literal_instance(max), X_instance)

           for index in indices
               full_prgm = loop_instance(index, Dimensionless(), full_prgm)
           end

           initializer = declare_instance(tag_instance(variable_instance(:output_fiber), output_fiber), literal_instance(-Inf))
           full_prgm = block_instance(initializer, full_prgm)

           output_fiber = Finch.execute(full_prgm).output_fiber
           println("Input Size: ", countstored(X))
           println("Output Size: ", countstored(output_fiber))
           Finch.execute_code(:prgm, typeof(full_prgm)) |>Finch.pretty |> Finch.dataflow
       end

The above code will always densify the output because it doesn't know whether a negative number has previously been written to the output entry. Also, we can only apply an idempotent simplification if the reduction is in the innermost loop.

For example, if we do:

for i = _
   for j = _
      x[i] <<max>>= y[j, i]
   end
end

then we can eliminate contiguous maxes with 0 in the innermost loop.

However, in

for i = _
   for j = _
      x[j] <<max>>= y[j, i]
   end
end

each update in the inner loop modifies a single, separate entry of x, and so no idempotency can be identified because we can't reason about the current value of the entry of x and whether it has previously been maxed with 0 or not, so we have to do all the updates, just to be safe.

I think that there is likely no way to get Finch to apply the idempotent operation unless the reduction is the innermost loop.

However, I do think you could keep from densifying the output by defining a similar operation to initwrite for max. You could define initmax(z)(a, b) = b <= z ? a : max(a, b), which you could correctly declare to have the property that isidentity(initmax(z), z) = true.

willow-ahrens · 2023-12-14T16:05:00Z

As far as resolving this issue goes, what do you think we need to do here? Some options:

Describe this in the docs
Issue a warning when the output is getting densified
Issue a warning when the operations are densified

On the second two points, I think that we should just ask the user if they intended the kernel to be sparse. However, I'm open to suggestions about how to detect that it became dense @wraith1995 @nullplay.

kylebd99 · 2023-12-14T18:06:32Z

Thanks for looking into this and for the detailed explanation. The behavior makes a lot more sense now.

This may be a bit of a pain/not feasible in the rewrite infrastructure, but I think that reasoning about the initialization value of x would be enough to capture some of the big cases. If x initialized with default(y) and f is abelian (which seems like a fairly common case), then the idempotency optimization should apply to arbitrary loop nests. Under these assumptions, I think that

x .= default(y) 
for i = _
   for j = _
      x[i] <<f>>= y[j, i]
   end
end

is equivalent to

x .= default(y) 
for i = _
   for j = _
      x[i] <<f>>= f(y[j, i], default(y))
   end
end

So we can always assume that x[i] has been "maxed with the default value". This requires analyzing initialization values, so I'm not sure if its worth it per se. But, I think that it avoids reasoning about "current values'"?

On warnings, I lean towards more warnings whenever the performance is big-O worse than might be reasonably expected, especially because the user can always turn on fast mode to remove them. So, whatever version of these warnings is feasible would be helpful. To avoid over-warning for dense kernels, you could just detect when the inputs are all dense and screen warnings out in this case.

On the docs front, this seems like a good addition to the performance section. In general, a small section on operations/situations which result in densification might be useful, similar to the section on concordant iteration. Also, small note, the docs on assign here are where I got onto the overwrite function. It might be helpful to describe initwrite there and explain that it results in simple assignment.

Thanks again!

wraith1995 · 2023-12-21T15:01:02Z

Two cents on dense detection: if a sparse level is used, we could look at all indices used to write to it and then ask if they correspond to loops in the generated code that have work at least equal to the size of that level? We could probably write a pretty naive way of computing the work that assigns while loops O(1) and for loops O(n).

willow-ahrens · 2023-12-23T17:32:17Z

At the very least, if all the inputs are sparse, we could replace them with their default and double check that the code is a no-op?

We might also run the same check when any input is sparse, rather than just when they all are.

What happens when we put two statements in a loop, and one of the statements is dense while the other is sparse?

Perhaps the appropriate check is to simply have @finch_checksparse that replaces all of the sparse levels with their defaults and then prints the simplified code, and leave it to the user to interpret what went wrong, if anything.

willow-ahrens changed the title ~~Manual Program Instance Causing Densification~~ idempotency optimization on outputs Dec 20, 2023

willow-ahrens changed the title ~~idempotency optimization on outputs~~ idempotency optimization on initialized outputs Dec 20, 2023

willow-ahrens added the priority Do this first label Dec 20, 2023

willow-ahrens added this to the v0.7.0 milestone Feb 27, 2024

willow-ahrens self-assigned this Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idempotency optimization on initialized outputs #353

idempotency optimization on initialized outputs #353

kylebd99 commented Dec 13, 2023

willow-ahrens commented Dec 14, 2023

willow-ahrens commented Dec 14, 2023 •

edited

willow-ahrens commented Dec 14, 2023

kylebd99 commented Dec 14, 2023

wraith1995 commented Dec 21, 2023 •

edited

willow-ahrens commented Dec 23, 2023

idempotency optimization on initialized outputs #353

idempotency optimization on initialized outputs #353

Comments

kylebd99 commented Dec 13, 2023

willow-ahrens commented Dec 14, 2023

willow-ahrens commented Dec 14, 2023 • edited

willow-ahrens commented Dec 14, 2023

kylebd99 commented Dec 14, 2023

wraith1995 commented Dec 21, 2023 • edited

willow-ahrens commented Dec 23, 2023

willow-ahrens commented Dec 14, 2023 •

edited

wraith1995 commented Dec 21, 2023 •

edited