Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with Zygote.jl-generated function within @turbo #489

Open
MilesCranmer opened this issue Apr 30, 2023 · 2 comments
Open

Comments

@MilesCranmer
Copy link
Contributor

MilesCranmer commented Apr 30, 2023

I am trying to understand a performance issue I am seeing in DynamicExpressions.jl where using @turbo makes the evaluation kernels 4x faster, but makes the derivative kernels 10% slower. See the detailed benchmarks here: SymbolicML/DynamicExpressions.jl#28 (comment)

My derivative kernels look like this:

@maybe_turbo turbo for j in indices((cumulator, dcumulator))
    x = op(cumulator[j])::T
    dx = diff_op(cumulator[j])::T * dcumulator[j]

    cumulator[j] = x
    dcumulator[j] = dx
end

(The @maybe_turbo turbo ... will turn into @turbo ... when turbo=true, but just @inbounds @simd ... otherwise. It will also remove the various type assertions in the scope.)

To create the diff_op, I generate it using Zygote.jl here:

for op in unary_operators
    diff_op(x) = gradient(op, x)[1]
    push!(diff_unary_operators, diff_op)
end

I can try to create a MWE for this, but I quickly wanted to check if anything was obvious in how I am using @turbo here that might hurt performance rather than help it. For example, perhaps this diff_op is not being inlined correctly, and therefore not being optimized by @turbo? For the record I am not seeing any warnings about the derivative operator being incompatible, so I'm not quite sure why this is occurring.

Also - the diff_op in the benchmark is the derivative of one of +, -, *, /, cos, exp so nothing too crazy.

@chriselrod
Copy link
Member

If you only care about operators like +, -, *, /, cos, and exp, you could instead using ForwardDiff following the approach I used in SimpleChains.jl.

Because LoopVectorization.@turbo works on Expr at a high level, it is only able to understand primitive operations. Therefore it will naively assume the operations it doesn't know are expensive.
It can use the type of an operation to figure out what it is, but it won't introspect a more complicated function.
So it would be possible to do better than the DualEval example above.

Note that ChainRules are defined on LoopVectorization.vmap.

I suggest poking around using Cthulhu to make sure there aren't any accidental type instabilities. Turning optimizations on will also tell you what is and isn't inlined.

@MilesCranmer
Copy link
Contributor Author

Thanks for the quick and detailed response!

I've been trying my best to understand the code you sent, but it is a bit outside of my wheelhouse and many of these calls are new to me. ForwardDiff.Dual doesn't seem to have docs available. From what I can guess, the code is trying to expand ForwardDiff.derivative within @turbo so that it sees the exact operators before it tries to vectorize them?

Here's a MWE I have:

function f(x, op::F, turbo) where {F}
   y = similar(x)
   if turbo
       @turbo for i in eachindex(x)
           y[i] = op(x[i])
       end
   else
       @inbounds @simd for i in eachindex(x)
           y[i] = op(x[i])
       end
   end
   return y
end

we can see that turbo=true improves this significantly if I just pass in cos:

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=true; op=cos);
  19.250 μs (2 allocations: 78.17 KiB)

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=false; op=cos);
  117.708 μs (2 allocations: 78.17 KiB)

But when I pass in ForwardDiff.derivative of cos, the performance gains are not as good, despite the fact that op should just be sin:

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=true; op=Base.Fix1(ForwardDiff.derivative, cos));
  57.083 μs (2 allocations: 78.17 KiB)

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=false; op=Base.Fix1(ForwardDiff.derivative, cos));
  143.625 μs (2 allocations: 78.17 KiB)

Maybe this is not possible and I should try Enzyme.jl instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants