Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum|mapreduce versus unrolled for-loop. performance disparity #20517

Closed
felixrehren opened this issue Feb 8, 2017 · 3 comments
Closed

sum|mapreduce versus unrolled for-loop. performance disparity #20517

felixrehren opened this issue Feb 8, 2017 · 3 comments
Labels
performance Must go faster

Comments

@felixrehren
Copy link
Contributor

felixrehren commented Feb 8, 2017

mapreduce can be much slower than the equivalent for-loop. A small (real-life) example:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        B[j] = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)
    end
    B
end
function dfor(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        d = z
        for k in 1:j
            d += A[j,k]*A[k,j]
        end
        B[j] = d
    end
    B
end

A = randn(127,127)
time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))

gives me a performance ratio of about x50 on Julia 0.5, juliabox.com. I think this could be because the for-loop can be automatically simd, and the mapreduce isn't? When A = randn(N,N) and N is 16, the gap is around x75, and for N = 10000, the gap is around x25. Replacing the array access A[j,k] with A[rand(1:size(A,1)),rand(1:size(A,2))] destroys the performance on both, but the ratio becomes x1.

  1. Is simd the reason why one is x50 faster?
  2. Should this be described in Performance Tips? mapreduce underlies sum, so this could be a popular trap that isn't currently mentioned
  3. Would this be a useful benchmark? on nanosoldier?
  4. Could the performance gap be smaller?

(Benchmarking mapreduce versus for-loops without array access, I still see a x2 performance gap. E.g. mapreduce(identity, +, 0, i for i in 1:n) versus the equivalent integer-summing for loop. It looks like this gap used to be smaller? Worth another benchmark in CI?)

@KristofferC
Copy link
Sponsor Member

KristofferC commented Feb 8, 2017

Looks like dup of #15276.

Doing:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)
    @inbounds for j::Int in 1:n
        B[j] = _help(A, j, z)
    end
    B
end

_help(A, j, z) = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)

gives

julia> time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))
1.0013213312412255

You can see the problem by @code_warntype and looking for Core.Box.

@kshyatt kshyatt added the performance Must go faster label Feb 8, 2017
@stevengj stevengj added the kind:potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 10, 2017
@stevengj
Copy link
Member

+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.

@KristofferC
Copy link
Sponsor Member

Closing as dup of #15276.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants