Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Generic Optimizations #79

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

AD2605
Copy link
Contributor

@AD2605 AD2605 commented Sep 25, 2023

Some generic optimizations benefiting subgroup and workgroup DFTs and bringing performance improvements on all devices

Checklist

Tick if relevant:

  • New files have a copyright
  • New headers have an include guards
  • API is documented with Doxygen
  • New functionalities are tested
  • Tests pass locally
  • Files are clang-formatted

@@ -100,12 +101,15 @@ __attribute__((always_inline)) inline void cross_sg_naive_dft(T& real, T& imag,
T res_imag = 0;

unrolled_loop<0, N, 1>([&](int idx_in) __attribute__((always_inline)) {
const T multi_re = twiddle<T>::Re[N][idx_in * idx_out % N];
const T multi_re = static_cast<T>(
sycl::cos(static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / static_cast<float>(N)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) / static_cast<float>(N) in a variable, it's a lot to read.

}
return -twiddle<T>::Im[N][idx_in * idx_out % N];
return static_cast<T>(sycl::sin(static_cast<float>(-2 * M_PI) * static_cast<float>(idx_in * idx_out % N) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negation missing

}
return -twiddle<T>::Im[N * M][k * n];
return static_cast<T>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negation missing

Comment on lines 26 to 27
#include <common/twiddle.hpp>
#include <common/twiddle_calc.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are twiddle.hpp and twiddle_calc.hpp needed?

Comment on lines 118 to 119
const T multi_re = static_cast<T>(sycl::cos(theta));
const T multi_im = static_cast<T>(sycl::sin(theta));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip the multiplication by M_PI (which - fun-fact - is not part of the standard) if we use cospi and sinpi.

Comment on lines 105 to 107
using theta_t = T;
#else
using theta_t = float;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we're doing a double precision DFT? If its acceptable to use single-precision twiddles in a double-precision DFT, do we get a speedup if we use half-precision twiddles in a single-precision DFT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not acceptable. We previously had accuracy issues with twiddles being calculated to lower precision due to fast math. Lower precision twiddles will create the same issues.

Copy link
Contributor Author

@AD2605 AD2605 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we're doing a double precision DFT? If its acceptable to use single-precision twiddles in a double-precision DFT

Hence the introduction of PORTFFT_USE_FAST_TRIG_APPROX, that is left to the user, if they are okay with some mixed precision compute, at the cost of some accuracy ( I haven't run into it yet until now, for both float and double tests, on Nvidia, AMD and locally, maybe because other twiddles are computed at full precision on host)

do we get a speedup if we use half-precision twiddles in a single-precision DFT

No, the speedup does not come from lowering the precision for computing sin/cos, but rather the idea is to generate very specific instructions and depending on the target device, may or may not be scheduled on a different hardware.
For Example On Nvidia:

  1. With optimization level O3, and when the input to sycl::sin/cos ( I have not tested it using sycl::sinpi yet) is float. this generates sin.approx.ftz, resulting in the SASS MUFU.SIN. MUFU (previously known as SFU on pre Volta ( I Think ?)), is a separate set of ALUs who's function is to just calculate transcendental functions, and some other like square root and inverse etc etc. When sin/cos get scheduled on this (rather than using the long polynomial approximation), they use a bit less precise approximation, which is much faster to compute ( see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions for accuracy ranges). The MUFU takes much less cycles to compute(I remember reading MUFU utilizes just 4 cycles to generate the result per op per warp, but which doc was it, I dont have it at the top of my head). In addition to faster compute, this leads to ILP possibilities as sin/cos compute runs on different hardware, normal ALUs ops / shuffle ops can be performed till the time there are no data dependencies (see this line). As far as my knowledge goes, there are no special units on AMD to compute sin/cos, but this will generate V_SIN_FP32 vector operation, and this is faster than fetching each twiddle from global memory ( see Vega Instruction set for more, especially global fetching mechanisms.), hence getting a speedup there as well.
  2. This being said, faster compute should be able to beat the caching effect. For example, if MUFU.SIN is not generated, that would result in a long polynomial approximation, and is a lot lot slower than our current implementation. I am yet to compare the two approaches on GPUs which have a large amount of L1 (like 512KB) and cache loads aggressively, but lack such special transcendental computing capabilities. If our current approach is preferred on such devices, maybe I will use another CMake variable to guide the compute accordingly
  3. AMD does have V_SIN_FP16, so it could be faster, but I am shying away from fp16 compute at the moment, as it might hamper portability ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants