Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow on apple silicon? #279

Open
AshtonSBradley opened this issue Nov 23, 2023 · 10 comments
Open

very slow on apple silicon? #279

AshtonSBradley opened this issue Nov 23, 2023 · 10 comments

Comments

@AshtonSBradley
Copy link

AshtonSBradley commented Nov 23, 2023

I could have sworn this used to be much faster:

using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
using BenchmarkTools
@btime fft(a);
  26.528 ms (98490 allocations: 10.73 MiB)

FFTW.set_num_threads(1)
@btime fft(a);
  5.165 ms (6 allocations: 4.00 MiB)

Compare with fftw installed via python (scypy) here https://github.com/andrej5elin/howto_fftw_apple_silicon, where 4 threads takes about 500us for double precision, on slightly weaker hardware.
Rosetta with mkl is also significantly (>10x) faster than fftw.jl according to those benchmarks. Am I missing something?

julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa9485436 (2023-11-03 07:44 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_PKG_DEVDIR = /Users/abradley/Dropbox/Julia/Dev
  JULIA_NUM_THREADS = 8
  JULIA_PKG_SERVER = us-west.pkg.julialang.org
  JULIA_EDITOR = code
@stevengj
Copy link
Member

stevengj commented Nov 23, 2023

To get high performance out of FFTW, you need to create a plan first and then re-use it. Otherwise, you are getting a lot of overhead by re-creating the plan every time.) Ideally with a pre-allocated array. Note also that FFTW shares threads with Julia, so you generally need to start Julia with enough threads (e.g. julia -t 8) if you want to run FFTW multi-threaded.

(Unfortunately, the current FFTW_jll build is missing the cycle counter on Apple silicon, which disables everything but the default FFTW.ESTIMATE plan-creation mode; that should be fixed in the next release.)

@AshtonSBradley
Copy link
Author

AshtonSBradley commented Nov 23, 2023

Thanks for this

Threads.nthreads()
8
FFTW.set_num_threads(8)
F=plan_fft(a,flags=FFTW.ESTIMATE);
@Btime F*a;
715.000 μs (122 allocations: 4.01 MiB)

a vast improvement. Compiled fftw (https://github.com/andrej5elin/howto_fftw_apple_silicon) seems to manage 350us without openmp on 4 threads with PATIENT, and more gains with openmp (for single precision 210us drops to 160us on 4 threads PATIENT).

Any scope for building with openmp using Apple's Clang?

Looking forward to the release!

@giordano
Copy link
Member

Any scope for building with openmp using Apple's Clang?

Apple Clang doesn't come with OpenMP, only thing one could do is to link an external OpenMP runtime, like LLVM's.

@AshtonSBradley
Copy link
Author

AshtonSBradley commented Nov 23, 2023

@giordano
Copy link
Member

Yes.

@AshtonSBradley
Copy link
Author

is there a way to inject that into
pkg>build FFTW
or can one compile FFTW separately and have julia find it?

@giordano
Copy link
Member

The build recipe of fftw is at https://github.com/JuliaPackaging/Yggdrasil/blob/42d73ea1c9e39c6f63bdfe065caad498257d0c6a/F/FFTW/build_tarballs.jl. At the moment OpenMP isn't used anywhere as far as I understand, I guess that's a question for @stevengj.

@AshtonSBradley
Copy link
Author

AshtonSBradley commented Nov 23, 2023

Apologies: I realise now that my earlier benchmarks must have been in low power mode on the laptop.

After a charge the times are a bit more comparable, but I notice that even in-place planning gains almost nothing on M1, but has significant gains on Intel even without MKL. The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away, but the gap has closed: 446.69us on 4 threads (without openmp) vs FFTW.jl below running at 699us for 8 threads

using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
F = plan_fft!(a,flags=FFTW.ESTIMATE)
using BenchmarkTools

2021 M1 Max

@btime fft(a);
   699.666 μs (126 allocations: 4.01 MiB)
   
@btime F*a setup = (a = randn(ComplexF64,512,512));
   626.917 μs (120 allocations: 9.44 KiB)

2019 Intel 8-core i9 (no MKL)

@btime fft(a);
   1.110 ms (126 allocations: 4.01 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
  373.056 μs (120 allocations: 8.44 KiB)

2019 Intel 8-core i9 (with MKL)

@btime fft(a);
  528.822 μs (6 allocations: 4.00 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
  261.137 μs (0 allocations: 0 bytes)

@stevengj
Copy link
Member

stevengj commented Nov 24, 2023

The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away

In that post they are using FFTW's test/bench, (a) which defaults to FFTW.MEASURE, (b) precomputes the plans, and (c) pre-allocates the arrays. (b) can be accomplished using p = plan_fft(...), and (c) can be accomplished using mul!(output, p, input). However (a) requires a new build of FFTW that enables a cycle counter on ARM (otherwise FFTW.MEASURE will be equivalent to FFTW.ESTIMATE).

@ViralBShah
Copy link
Member

Not related to this issue, but just as an fyi - Apple Silicon has been added to the CI now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants