Subgroup Level Bluestein Algorithm #145

AD2605 · 2024-03-14T07:58:50Z

Adds the Bluestein Algorithm at the subgroup level an enables the following support in portFFT

1D Prime sized DFTs which after padding fit in the subgroup level, and supports the following data layout:
- Supports SPLIT_COMPLEX AND INTERLEAVE_COMPLEX storage scheme
- Supports PACKED and BATCH_INTERLEAVED storage
2D and 3D FFTs where each prime dimension after padding must fit in the subgroup level
- Supports SPLIT_COMPLEX AND INTERLEAVE_COMPLEX storage scheme

Checklist

Tick if relevant:

…prime tests

Rbiessy · 2024-03-14T09:34:37Z

Thanks for the PR, I won't have time to review it myself until next week

hjabird

I think this would be easier to review if some of the changes to transfer functions were reverted.

.clang-tidy

src/portfft/committed_descriptor_impl.hpp

hjabird · 2024-03-14T10:18:03Z

src/portfft/committed_descriptor_impl.hpp

    Idx used_sg_size;
    Idx num_batches_in_l2;
    Idx num_factors;
+    bool is_prime;


Now we have length and committed_length, what does this (and eg. num_factors) refer to?

'length' and 'committed length' fields are used calculation of twiddles and setting some spec-constants like fft size.
Later down the control flow when we need to calculate twiddles , I would've repeatedly need to compare length and committed_length to check if there is any padding or not, instead this is set in the constructor of dimension_struct itself, indicating that a prime algorithm is used

I'm lost - I don't think you've really answered my question of whether num_factors is a the number of factors of committed_lenght or length?

Sorry I forgot to add the line about num_factors,
num_factors is the number of factors of length

src/portfft/committed_descriptor_impl.hpp

test/unit_test/fft_test_utils.hpp

src/portfft/utils.hpp

src/portfft/defines.hpp

hjabird · 2024-03-14T16:04:12Z

src/portfft/common/transfers.hpp

+ * @param global_data global_data_struct associated with the kernel launch
+ */
+template <detail::level Group, detail::transfer_direction Direction, Idx SubgroupSize, typename LocView, typename T>
+PORTFFT_INLINE void local_global_packed_copy(T* global_ptr, LocView& loc_view, IdxGlobal global_offset,


Most of these new transfer functions are refactoring that - as far as I can tell - is not required for the Bluestein implementation. I'd prefer it if they could be reverted, so that we could discuss the course we want to take with them in a separate PR in the future.

I could not have done Bluestein without the refactoring, as it otherwise would have been very difficult to debug. The number of lines in subgroup_impl has reduced significantly, whereas previously I found myself scrolling for some time and as it turned out I had only traversed a small fraction of the entire kernel !

Also, quite a lot of refactoring was necessary . For example in the sg_bluestein functions, when I need to load/store from/to local memory, and how sg_bluestein_packed_copy uses local_global_packed_copy.
Without the refactor, this would have resulted in a massive amount of duplicated code (repeated view creations)

Also, we can keep perfecting the refactor by taking it further from here later on whenever we come back to it, but at least it leaves the code in a usable state. I have applied the same to workgroup level bluestein as well

I agree. Refactoring should be in a separate PR. Nothing here looks essential for Bluestein.

I still think this should be a separate PR, even if the code might look ugly.

On principal, 1 PR should only do 1 thing.

We used to have lots of transfer functions, do we want that again?

Separating out functions like this makes it harder to unify the implementation later.

On principal, 1 PR should only do 1 thing.

...Ideally, assuming given sufficient time and all the resources (including the ease of implementation) available to implement the goal of the PR. This is also why I have always advocated for a refactor as you go approach before putting up the PR

We used to have lots of transfer functions, do we want that again?

If anything they make the code more verbose, and you can read the order our operations as if reading phrases of English, i,e, copy_global_to_local ---> local_to_private --> compute --> private_to_local ---> local_to_global. All the complexity involved with how many views to be created is shifted elsewhere. We can keep trickling this down to our other sections later on, and adjust if required.

Sorry for taking a firm stance here, but I do not suppose I will be reverting this change, given the time crunch and amount of things I would need to change, and no downside associated with it

Just catching up on this comment, I think Hugh made good points here. We're also trying to push for splitting the PRs as much as possible and discussing early when a large change is needed. Given that we have already refactored transfer functions multiple times we would have needed long discussions which we don't have time for anymore.

I strongly suggest that we don't bother merging Bluestein in main but instead:

Keep all the Bluestein work in a bluestein branch.

Create and close the other 2 PRs for Bluestein that you had plan if they are ready already. The could be useful if we try to merge this in main again in the future.

src/portfft/dispatcher/subgroup_dispatcher.hpp

.clang-tidy

t4c1 · 2024-03-14T10:25:27Z

src/portfft/committed_descriptor_impl.hpp

    std::size_t length;
+    // The committed length for the particular dimension, will be different from length in the case of bluestein and
+    // radar fft algorithms
+    std::size_t committed_length;


Suggested change

std::size_t committed_length;

std::size_t padded_length;

The other lenght is also committed, making this variable name confusing.

See #145 (comment)

I still think committed_length is confusing. How about ct_length (standing for Cooley-Tukey)?

committed_length is not the cooley_tukey length. Committed length is the user provided length (i.e the length of the dimension as in the descriptor)

In that case the other one could be ct_length. We just need something to make it obvious which is which.

It cannot be ct_length, we should not be tying it with the name of an algorithm, as it isn't always cooley tukey.

I believe length, along with its description above should be suffiicient

Which other algorithm can it be?

src/portfft/committed_descriptor_impl.hpp

src/portfft/common/subgroup.hpp

t4c1 · 2024-03-15T09:39:03Z

src/portfft/common/transfers.hpp

+                                             Idx local_offset, Idx n_elements_to_copy,
+                                             detail::global_data_struct<1>& global_data) {
+  global_data.log_message(__func__, "storage scheme: INTERLEAVED_COMPLEX");
+  if constexpr (Direction == detail::transfer_direction::GLOBAL_TO_LOCAL) {


All this function is doing is that if based on template parameter. Replace the calls to this functions by calls directly to global2local or local2global. Then you can delete this function.

I know this is just an extra step to call local2global / global2local, but the idea here was to abstract away view creations at the call site

I think views on the call site make code significantly easier to understand.

View creation at the call site adds a lot of extra lines. I have been able to shave off ~120 lines from subgroup_dispatcher function simply by shifting the view creation elsewhere. It had became very difficult for me go through the code and make changes at the required places

Yes, but that was at a cost of reducing code readability and adding double that number of lines to transpose. Not worth it in my opinion.

The views are still being used, only the construction of views is now shifted to a separate function, and rightfully so

What transposition are you referring to ?

Sorry that was a typo. I meant transfers.hpp.

t4c1 · 2024-03-15T09:41:18Z

src/portfft/common/transfers.hpp

+                                             IdxGlobal global_offset, Idx local_offset, Idx local_imag_offset,
+                                             Idx n_elements_to_copy, detail::global_data_struct<1>& global_data) {
+  global_data.log_message(__func__, "storage scheme: SPLIT_COMPLEX");
+  if constexpr (Direction == detail::transfer_direction::GLOBAL_TO_LOCAL) {


Same comment as for the previous function. Actually the same can be done for the other new functions in this file.

Combining this comment with the one above

t4c1 · 2024-03-15T09:50:46Z

src/portfft/common/transfers.hpp

+ * @param global_data global_data_struct associated with the kernel launch
+ */
+template <detail::level Group, detail::transfer_direction Direction, Idx SubgroupSize, typename LocView, typename T>
+PORTFFT_INLINE void local_global_packed_copy(T* global_ptr, LocView& loc_view, IdxGlobal global_offset,


I agree. Refactoring should be in a separate PR. Nothing here looks essential for Bluestein.

t4c1 · 2024-03-15T10:02:54Z

src/portfft/dispatcher/subgroup_dispatcher.hpp

+    for (Idx i = 0; i < factor_sg; i++) {
+      for (Idx j = 0; j < factor_wi; j++) {
+        // Not using sycl::cospi / sycl::sinpi as std::cos/std::sin provides better accuracy in float and double tests
+        // Also why this was moved to host, this way the tolerance value needs to be bumped up by a smaller value
+        double theta = -2 * M_PI * static_cast<double>(i * j) / static_cast<double>(factor_wi * factor_sg);
+        auto twiddle = std::complex<Scalar>(static_cast<Scalar>(std::cos(theta)), static_cast<Scalar>(std::sin(theta)));
+        host_twiddles[static_cast<std::size_t>(j * factor_sg + i)] = twiddle.real();
+        host_twiddles[static_cast<std::size_t>((j + factor_wi) * factor_sg + i)] = twiddle.imag();
+      }
+    }


Why are you moving calculation of twiddles to host?

Much better precision, I could lower the threshold value as result

What threshold?

The threshold which is calculated in the tests

I am not following - can you point me to the code?

t4c1 · 2024-03-19T08:23:57Z

src/portfft/common/subgroup_bluestein.hpp

+
+  // TODO: Currently local memory is being used to load the data back in natural order for the backward phase, as the
+  // result of sg_dft is transposed. However, the ideal way to this is using shuffles. Implement a batched matrix
+  // transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we


Suggested change

// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we

// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. This we

t4c1 · 2024-03-19T08:24:08Z

src/portfft/common/subgroup_bluestein.hpp

+  // TODO: Currently local memory is being used to load the data back in natural order for the backward phase, as the
+  // result of sg_dft is transposed. However, the ideal way to this is using shuffles. Implement a batched matrix
+  // transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we
+  // way can even avoid the 2 sg_bluestein functions that we have today


Which 2 functions?

sg_bluestien packed and sg_bluestein batch_interleaved

How does that change let us avoid them?

t4c1 · 2024-03-19T08:26:44Z

src/portfft/common/transfers.hpp

+                                             Idx local_offset, Idx n_elements_to_copy,
+                                             detail::global_data_struct<1>& global_data) {
+  global_data.log_message(__func__, "storage scheme: INTERLEAVED_COMPLEX");
+  if constexpr (Direction == detail::transfer_direction::GLOBAL_TO_LOCAL) {


Yes, but that was at a cost of reducing code readability and adding double that number of lines to transpose. Not worth it in my opinion.

t4c1 · 2024-03-20T09:07:26Z

src/portfft/common/subgroup_bluestein.hpp

+}
+}  // namespace portfft
+
+#endif


Newline at the end of the file is missing.

AD2605 added 17 commits March 8, 2024 10:55

subgroup level batch_interleaved and packed bluestein algorithm working

cef2ad0

resolve warnings and add backward and multi-dim subgroup level small …

8cc05db

…prime tests

further changes

5237a37

modifier offset calculation bugfix

5292f41

initialize local memory with zeros to avoid nans

a3aa5d6

format

5b89d10

not copy in between an aligned pointer

b0d1661

prevent OOB read/writes in packed format

5afe834

prevent OOB read writes in PACKED bluestein condition

8a83350

Merge remote-tracking branch 'origin/main' into atharva/sg_wg_bluestein

eb3291f

refactor

011e780

bugfix after refactor

8c3b40b

doxygens and logging, and lower tolerance value

20e3c78

remove unused shuffle_transpose function

ae929d3

remove unused bluestein header from workgroup_dispatcher.hpp

708893c

remove unused headers from subgroup.hpp

594d224

add missing array header

5f1ab4d

Rbiessy requested a review from t4c1 March 14, 2024 09:34

AD2605 added 2 commits March 14, 2024 10:37

slightly bump tolerance value for tests to pass on Nvidia

99d8cfb

format

243c793

hjabird reviewed Mar 14, 2024

View reviewed changes

t4c1 reviewed Mar 15, 2024

View reviewed changes

AD2605 added 3 commits March 18, 2024 14:48

review comments 1

3a77953

updated doxygens

cfd2ab8

addressed missed comments from first round of review

d0e705d

t4c1 reviewed Mar 19, 2024

View reviewed changes

prevent OOB read from global memory

823b84f

t4c1 reviewed Mar 20, 2024

View reviewed changes

src/portfft/common/subgroup_bluestein.hpp

}

} // namespace portfft

#endif

Copy link

Contributor

t4c1 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline at the end of the file is missing.

	// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. his we
	// transpose to transpose a matrix stored in the private memory of workitems of a subgroup using shuffles only. This we

Subgroup Level Bluestein Algorithm #145

Are you sure you want to change the base?

Subgroup Level Bluestein Algorithm #145

Conversation

AD2605 commented Mar 14, 2024

Checklist

Rbiessy commented Mar 14, 2024

hjabird left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AD2605 Mar 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AD2605 Mar 15, 2024 •

edited