very slow on apple silicon? #279

AshtonSBradley · 2023-11-23T09:37:39Z

I could have sworn this used to be much faster:

using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
using BenchmarkTools
@btime fft(a);
  26.528 ms (98490 allocations: 10.73 MiB)

FFTW.set_num_threads(1)
@btime fft(a);
  5.165 ms (6 allocations: 4.00 MiB)

Compare with fftw installed via python (scypy) here https://github.com/andrej5elin/howto_fftw_apple_silicon, where 4 threads takes about 500us for double precision, on slightly weaker hardware.
Rosetta with mkl is also significantly (>10x) faster than fftw.jl according to those benchmarks. Am I missing something?

julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa9485436 (2023-11-03 07:44 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_PKG_DEVDIR = /Users/abradley/Dropbox/Julia/Dev
  JULIA_NUM_THREADS = 8
  JULIA_PKG_SERVER = us-west.pkg.julialang.org
  JULIA_EDITOR = code

The text was updated successfully, but these errors were encountered:

stevengj · 2023-11-23T15:20:00Z

To get high performance out of FFTW, you need to create a plan first and then re-use it. Otherwise, you are getting a lot of overhead by re-creating the plan every time.) Ideally with a pre-allocated array. Note also that FFTW shares threads with Julia, so you generally need to start Julia with enough threads (e.g. julia -t 8) if you want to run FFTW multi-threaded.

(Unfortunately, the current FFTW_jll build is missing the cycle counter on Apple silicon, which disables everything but the default FFTW.ESTIMATE plan-creation mode; that should be fixed in the next release.)

AshtonSBradley · 2023-11-23T16:31:12Z

Thanks for this

Threads.nthreads()
8
FFTW.set_num_threads(8)
F=plan_fft(a,flags=FFTW.ESTIMATE);
@Btime F*a;
715.000 μs (122 allocations: 4.01 MiB)

a vast improvement. Compiled fftw (https://github.com/andrej5elin/howto_fftw_apple_silicon) seems to manage 350us without openmp on 4 threads with PATIENT, and more gains with openmp (for single precision 210us drops to 160us on 4 threads PATIENT).

Any scope for building with openmp using Apple's Clang?

Looking forward to the release!

giordano · 2023-11-23T16:59:48Z

Any scope for building with openmp using Apple's Clang?

Apple Clang doesn't come with OpenMP, only thing one could do is to link an external OpenMP runtime, like LLVM's.

AshtonSBradley · 2023-11-23T17:13:04Z

like this? https://github.com/andrej5elin/howto_fftw_apple_silicon#installing-fftw-with-openmp

giordano · 2023-11-23T17:15:12Z

Yes.

AshtonSBradley · 2023-11-23T17:30:36Z

is there a way to inject that into
pkg>build FFTW
or can one compile FFTW separately and have julia find it?

giordano · 2023-11-23T17:37:43Z

The build recipe of fftw is at https://github.com/JuliaPackaging/Yggdrasil/blob/42d73ea1c9e39c6f63bdfe065caad498257d0c6a/F/FFTW/build_tarballs.jl. At the moment OpenMP isn't used anywhere as far as I understand, I guess that's a question for @stevengj.

AshtonSBradley · 2023-11-23T20:24:39Z

Apologies: I realise now that my earlier benchmarks must have been in low power mode on the laptop.

After a charge the times are a bit more comparable, but I notice that even in-place planning gains almost nothing on M1, but has significant gains on Intel even without MKL. The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away, but the gap has closed: 446.69us on 4 threads (without openmp) vs FFTW.jl below running at 699us for 8 threads

using FFTW
FFTW.set_num_threads(8)
a = randn(ComplexF64,512,512);
F = plan_fft!(a,flags=FFTW.ESTIMATE)
using BenchmarkTools

2021 M1 Max

@btime fft(a);
   699.666 μs (126 allocations: 4.01 MiB)
   
@btime F*a setup = (a = randn(ComplexF64,512,512));
   626.917 μs (120 allocations: 9.44 KiB)

2019 Intel 8-core i9 (no MKL)

@btime fft(a);
   1.110 ms (126 allocations: 4.01 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
  373.056 μs (120 allocations: 8.44 KiB)

2019 Intel 8-core i9 (with MKL)

@btime fft(a);
  528.822 μs (6 allocations: 4.00 MiB)

@btime F*a setup = (a = randn(ComplexF64,512,512));
  261.137 μs (0 allocations: 0 bytes)

stevengj · 2023-11-24T14:00:19Z

The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away

In that post they are using FFTW's test/bench, (a) which defaults to FFTW.MEASURE, (b) precomputes the plans, and (c) pre-allocates the arrays. (b) can be accomplished using p = plan_fft(...), and (c) can be accomplished using mul!(output, p, input). However (a) requires a new build of FFTW that enables a cycle counter on ARM (otherwise FFTW.MEASURE will be equivalent to FFTW.ESTIMATE).

ViralBShah · 2024-02-27T22:06:21Z

Not related to this issue, but just as an fyi - Apple Silicon has been added to the CI now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

very slow on apple silicon? #279

very slow on apple silicon? #279

AshtonSBradley commented Nov 23, 2023 •

edited by giordano

stevengj commented Nov 23, 2023 •

edited

AshtonSBradley commented Nov 23, 2023 •

edited

giordano commented Nov 23, 2023

AshtonSBradley commented Nov 23, 2023 •

edited

giordano commented Nov 23, 2023

AshtonSBradley commented Nov 23, 2023

giordano commented Nov 23, 2023

AshtonSBradley commented Nov 23, 2023 •

edited

stevengj commented Nov 24, 2023 •

edited

ViralBShah commented Feb 27, 2024

very slow on apple silicon? #279

very slow on apple silicon? #279

Comments

AshtonSBradley commented Nov 23, 2023 • edited by giordano

stevengj commented Nov 23, 2023 • edited

AshtonSBradley commented Nov 23, 2023 • edited

giordano commented Nov 23, 2023

AshtonSBradley commented Nov 23, 2023 • edited

giordano commented Nov 23, 2023

AshtonSBradley commented Nov 23, 2023

giordano commented Nov 23, 2023

AshtonSBradley commented Nov 23, 2023 • edited

stevengj commented Nov 24, 2023 • edited

ViralBShah commented Feb 27, 2024

AshtonSBradley commented Nov 23, 2023 •

edited by giordano

stevengj commented Nov 23, 2023 •

edited

AshtonSBradley commented Nov 23, 2023 •

edited

AshtonSBradley commented Nov 23, 2023 •

edited

AshtonSBradley commented Nov 23, 2023 •

edited

stevengj commented Nov 24, 2023 •

edited