Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster implementations #1

Open
mfalt opened this issue Sep 15, 2016 · 3 comments
Open

Faster implementations #1

mfalt opened this issue Sep 15, 2016 · 3 comments

Comments

@mfalt
Copy link
Collaborator

mfalt commented Sep 15, 2016

As we discussed, there are several ways to speed up the implementations of the prox operators. I did some tests that may be useful in rewriting them.
These are the results for the NormL1Prox, for 100 prox operations on a vector of size 1000000.
Standard implementation

function prox(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  y = sign(x).*max(0.0, abs(x)-gamma*f.lambda)
  return y
end

In-place update to avoid big allocations:

function prox1!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  for i in eachindex(x)
    x[i] = sign(x[i]).*max(0.0, abs(x[i])-gamma*f.lambda)
  end
  return x
end

Avoid bounds-checks and use simd:

function prox2!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds @simd for i in eachindex(x)
   x[i] = sign(x[i]).*max(0.0, abs(x[i])-gf)
  end
  return x
end

Avoid function calls and all but one multiplication:

function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds @simd for i in eachindex(x)
    x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
  end
  return x
end

Use ifelse to ensure that the compiler is not worried about taking both branches

function prox4!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds @simd for i in eachindex(x)
    x[i] += ifelse(x[i] <= -gf, gf , ifelse(x[i] >= gf , -gf , -x[i]))
  end
  return x
end

Results:
prox: 1.557599 seconds (1.70 k allocations: 3.725 GB, 8.86% gc time)
prox1!: 0.235333 seconds (100 allocations: 1.563 KB) (6.5x speedup)
prox2!: 0.085790 seconds (100 allocations: 1.563 KB) (2.7 x speedup)
prox3!: 0.080576 seconds (100 allocations: 1.563 KB) (1.05x speedup) (total 19x speedup)
prox4!: 0.074251 seconds (100 allocations: 1.563 KB) (1.08x speedup) (total 21x speedup)

@lostella
Copy link
Member

lostella commented Sep 15, 2016

What does @simd exactly do? The following remark in the Julia documentation makes me suspicious for some reason:

This feature is experimental and could change or disappear in future versions of Julia.

Thank you for noting down all these tips!

@mfalt
Copy link
Collaborator Author

mfalt commented Sep 16, 2016

Most processors today are able to perform operations on a chunk of data, often 8 units (4 float64s), and this macro mainly tells the compiler that it is all right to convert the code to use these "Single Instruction Multiple Data" instructions. I am not sure how experimental this actually is, and it's easy to remove if they for some reason change the syntax. It was actually also possible to get a substantial speedup without @simd by manually unrolling the loop, something like:

function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds for i in 1:4:(floor(Int64,length(x)/4)*4)
    x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
    x[i+1] += (x[i+1] <= -gf ? gf : (x[i+1] >= gf ? -gf : -x[i+1]))
    x[i+2] += (x[i+2] <= -gf ? gf : (x[i+2] >= gf ? -gf : -x[i+2]))
    x[i+3] += (x[i+3] <= -gf ? gf : (x[i+3] >= gf ? -gf : -x[i+3]))
  end

  @inbounds for i in (floor(Int64,length(x)/4)*4+1):length(x)
    x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
  end
  return x
end

but I think @simd is a better option (and faster).

@mfalt
Copy link
Collaborator Author

mfalt commented Sep 16, 2016

After reading some more about simd (https://software.intel.com/en-us/articles/vectorization-in-julia) I was able to speed it up a bit more and added the example to the original post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants