Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add complex.hlsl #592

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

add complex.hlsl #592

wants to merge 14 commits into from

Conversation

nahiim
Copy link
Contributor

@nahiim nahiim commented Nov 3, 2023

Description

Start drafting complex.hlsl

@devshgraphicsprogrammingjenkins
Copy link
Contributor

[CI]: Can one of the admins verify this patch?

Comment on lines 17 to 19
// Fast Fourier Transform
template<typename T, uint32_t N>
void fft_radix2(complex::complex_t<T> data[N], bool is_inverse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the is_inverse a template parameter

assert(N<gl_SubgroupSize)

btw the data need not be data[N], each invocation comes with its own register value and you can cooperatively exchange via the use of subgroupShuffleXor (add it to https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/hlsl/glsl_compat/subgroup_shuffle.hlsl please).

I mean the original GLSL code is here
https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/glsl/subgroup/fft.glsl

// Cooley-Tukey
for (uint32_t stride = 2; stride <= N; stride *= 2)
{
complex::complex_t<T> twiddle_factor = { cos((is_inverse ? 2.0 : -2.0) * 3.14 / stride), sin((is_inverse ? 2.0 : -2.0) * 3.14 / stride) };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the proper full FP64 value of PI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btwa nice piece of trivia @Fletterio , only GLSL defines fp64 transcendentals, HLSL doesn't (the sin,cos etc. functions always return float32, and if you try to use the GL.450.std extended instruction set to compute cos(double) this will end up emitting Float64 capability which is a definite DO NOT WANT)

So these definitely need to be tabulated as very precise constants and C-casted into the correct nbl::hlsl::scalar_type_t<T> so we don't accidentally emit Float64 capability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really following - we want to use the full PI value in f64, but not emit f64 capability?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you want to cast the constant into a type you'd be using, see the Gauss Legendre / Quadrature header stuff that Erfan is using for example 62 CAD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm about the tabulation... its not a perf win compared to just calling sin and cos

for now just assert that T for the FFT complex numbers is NOT float64_t because that opens a whole can of worms of std.glsl spir-v intrinsic header

or if you feel like adding sin and cos in doubles, do it here:

[[vk::ext_instruction(GLSLstd450MatrixInverse)]]

@devshgraphicsprogramming
Copy link
Member

devshgraphicsprogramming commented May 14, 2024

I see a certain ladder of features, and each should have their own branch (so I can merge in order without having all comments addressed):

  1. having a complex_t struct to use
  2. subgroup::fft utilities for DIT and DIF
  3. workgroup::fft utilities for DIT and DIF
  4. a way to do multiple input items per-workgroup, i.e. do a 2048 sized FFT with a 512 workgroup while having the option of only using enough shared memory for 512 FFT + auxilary accessor
  5. optimizations for FFT real signals
  6. resurrect the nbl::ext and 1 or both of the old FFT examples to test with

They can branch off each other.

P.S. Let me know if you need "runtime" transcendental/special functions for complex_t then we can look into doing up our tgmath.hlsl

Missing compound operators and template specializations for integer
types. Arithmetic logic is off, different from the STL
Comment on lines +15 to +17
template<typename Scalar>
struct complex_t
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont indent extra because of namespace

Comment on lines +9 to +20
#define COMPOUND_ASSIGN(NAME,OP) template<typename T> struct assign_ ## NAME ## _t { \
void operator()(NBL_REF_ARG(T) lhs, NBL_CONST_REF_ARG(T) rhs) \
{ \
lhs = lhs OP rhs; \
} \
}
COMPOUND_ASSIGN(add,+);
COMPOUND_ASSIGN(subtract,-);
COMPOUND_ASSIGN(mul,*);
COMPOUND_ASSIGN(div,/);

#undef COMPOUND_ASSIGN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add them here

and make sure they have names plus_assign, multiplies_assign to be inline with the rest.

Also because its just 4 struct defs and not specialization, IMHO you shouldn't use macros because that just makes the code iffy to read. Or if you are to use a macro make sure it has the same form as the ALIAS_STD so you have to put th e }; afterwards to be able to add extra fields (i.e. identity)

Comment on lines +28 to +31
static complex_t create(const complex_t other)
{
return other;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no needed, regular assignment will do

Comment on lines +116 to +120
return !(uint64_t(m_real) ^ uint64_t(rhs.m_real)) && !(uint64_t(m_imag) ^ uint64_t(rhs.m_imag));
}
bool operator!=(const complex_t rhs)
{
return uint64_t(m_real) ^ uint64_t(rhs.m_real) || uint64_t(m_imag) ^ uint64_t(rhs.m_imag);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't cast to uint64_t :

  1. you're not comparing bitpatterns, you're casting to uint64_T which truncatest the floats, etc.
  2. Scalar=float32_t would have been 32bit
  3. != and == work on floats, so use them

Comment on lines +125 to +141
// ---------------------- Compound assignment functors ----------------------
template <typename Scalar>
struct ComplexCompoundHelper{
using add = assign_add_t<complex_t<Scalar> >;
using subtract = assign_subtract_t<complex_t<Scalar> >;
using mul = assign_mul_t<complex_t<Scalar> >;
using div = assign_div_t<complex_t<Scalar> >;
};

template <typename Scalar>
using assign_add_complex = typename ComplexCompoundHelper<Scalar>::add;
template <typename Scalar>
using assign_subtract_complex = typename ComplexCompoundHelper<Scalar>::subtract;
template <typename Scalar>
using assign_mul_complex = typename ComplexCompoundHelper<Scalar>::mul;
template <typename Scalar>
using assign_div_complex = typename ComplexCompoundHelper<Scalar>::div;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this, we don't have them for regular plus<T>

{
namespace hlsl
{
namespace fft::common {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. HLSL doesn't support nested namespace syntax
  2. use detail or impl for things you don't want to make public to others

template<typename Scalar, bool inverse>
struct DIX {
//Assumes I have precomputed twiddles, so just pass the index.
static void radix2Butterfly(NBL_CONST_REF_ARG(complex_t<Scalar>) twiddle, NBL_REF_ARG(complex_t<Scalar>) lo, NBL_REF_ARG(complex_t<Scalar>) hi){

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop Butterfly from the name, just radix2 is fine

Comment on lines +17 to +37
template<typename Scalar>
complex_t<Scalar> getHigh(uint32_t highIdx);

template<typename Scalar>
complex_t<Scalar> getMid(uint32_t midIdx);

template<typename Scalar>
complex_t<Scalar> getLow(uint32_t lowIdx);

// Define twiddles and getters for each Scalar template specialization
#define float_t float32_t
#define TYPED_NUMBER(N) NBL_CONCATENATE(N, f) // to add f after floating point numbers and avoid casting warnings and emitting ShaderFloat64 Caps
#include <nbl/builtin/hlsl/fft/common/twiddles_impl.hlsl>
#undef TYPED_NUMBER
#undef float_t

#define float_t float64_t
#define TYPED_NUMBER(N) N
#include <nbl/builtin/hlsl/fft/common/twiddles_impl.hlsl>
#undef TYPED_NUMBER
#undef float_t

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't bother precomputing twiddles, did this benchmark in shadertoy

const float _cos[32] = float[32](
1.0,0.980785,0.92388,0.83147,0.707107,0.55557,0.382683,0.19509,-4.37114e-08,-0.19509,-0.382684,-0.55557,-0.707107,-0.83147,-0.92388,-0.980785,-1.0,-0.980785,-0.92388,-0.83147,-0.707107,-0.55557,-0.382683,-0.19509,1.19249e-08,0.19509,0.382684,0.55557,0.707107,0.83147,0.92388,0.980785
);

void mainImage( out vec4 fragColor, in vec2 fragCoord )
{
    // Normalized pixel coordinates (from 0 to 1)
    vec2 uv = fragCoord/iResolution.xy;

    // Time varying pixel color
    vec3 col = vec3(0.0);

    const int spins = 2000;
    for (int i=0; i<spins; i++)
    {
        int iarg = i;//+int(fragCoord.x)+int(fragCoord.y);
        float arg = float(iarg)/float(spins);
        float cs = cos(arg*6.2831853);
        col.rg += vec2(_cos[iarg]);
    }
    // Output to screen
    fragColor = vec4(col,1.0);
}

perf dropped 50%, it gets worse the bigger the array gets

namespace hlsl
{

const float64_t PI = 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make a numbers.hlsl a counterpart to https://en.cppreference.com/w/cpp/numeric/constants

Comment on lines +18 to +22
// Fast Fourier Transform
template<typename T, uint32_t N, bool is_inverse>
void fft_radix2(complex_t<T> data[N])
{
const uint32_t log2N = uint32_t(log2(N));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you make this function?

I explicitly left behind a comment #592 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be a workgroup FFT? a subgroup FFT or what?

Comment on lines +24 to +37
// Bit reversal permutation
for (uint32_t i = 0; i < N; ++i)
{
uint32_t j = 0;
for (uint32_t bit = 0; bit < log2N; ++bit)
j |= ((i >> bit) & 1) << (log2N - 1 - bit);

if (j > i)
{
complex_t<T> temp = data[i];
data[i] = data[j];
data[j] = temp;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do the bit reversal for me, its often not needed

Comment on lines +45 to +52
if(is_inverse)
{
twiddle_factor = { cos(2.0 * PI / stride), sin(2.0 * PI / stride) };
}
else
{
twiddle_factor = { cos(-2.0 * PI / stride), sin(-2.0 * PI / stride) };
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the cosine is the same, because its an even function
  2. sine is an odd function

TL;DR its enough for you to do

const float arg = pi_v<T>*2.f;
twiddle_factor = {cos(arg),sin(arg)};
if (is_inverse)
    twiddle_factor.imag = -twiddle_factor.imag; // compiler should optimize to a bitflip

// Cooley-Tukey
assign_mul_complex<T> mul_assign;
assign_div_complex<T> div_assign;
for (uint32_t stride = 2; stride <= N; stride *= 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do a loop from stride=1 to stride=N/2 instead of 2->N because that's the actual stride between your butterfly elements, what you have right now is a width

@@ -30,6 +30,14 @@ T subgroupShuffleDown(T value, uint32_t delta)
return spirv::groupShuffleDown<T>(spv::ScopeSubgroup, value, delta);
}


template<typename T>
T subgroupShuffleXor(T value, uint32_t delta)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the spec call it a delta? I'd call it a mask

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants