Buffer Device Address in HLSL #690

devshgraphicsprogramming · 2024-05-02T13:23:40Z

Description

Testing

TODO list:

include/nbl/builtin/hlsl/sort/counting.hlsl

include/nbl/builtin/hlsl/bda/__ptr.hlsl

include/nbl/builtin/hlsl/bda/__ref.hlsl

include/nbl/builtin/hlsl/bda/__ptr.hlsl

include/nbl/builtin/hlsl/sort/counting.hlsl

include/nbl/builtin/hlsl/spirv_intrinsics/core.hlsl

include/nbl/builtin/hlsl/sort/counting.hlsl

devshgraphicsprogramming · 2024-05-30T18:37:48Z

include/nbl/builtin/hlsl/sort/common.hlsl

    uint32_t dataElementCount;
    uint32_t elementsPerWT;
+    uint32_t workGroupIndex;
    Key minimum;
    Key maximum;


its a little weird to shove non-uniform (ones that change between workgroups) parameters together with uniform

devshgraphicsprogramming · 2024-05-30T18:42:29Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+        sdata.workgroupExecutionAndMemoryBarrier();
+
+        sdata.set(tid % GroupSize, sum);
+        uint32_t shifted_tid = (tid + glsl::gl_SubgroupSize() * params.workGroupIndex) % GroupSize;
+
+        sdata.workgroupExecutionAndMemoryBarrier();


argh, crap I forgot you'd need to store the scan result back to smem to be able to

ok lets go back to the original simple histogram.atomicAdd(tid,sdata.get(tid));

I forgot the prefix sum is not already in smem, so the extra barriers will just make us slower.

devshgraphicsprogramming · 2024-05-30T18:44:00Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+        for (; tid < KeyBucketCount; tid += GroupSize) {
+            sdata.set(tid, 0);


this permamently leaves your tid changed, did you want to do this?

look, my snippet sometimes spawn a new temp so you leave tid unchanged after you exit the loop
#690 (comment)

devshgraphicsprogramming · 2024-05-30T18:44:32Z

include/nbl/builtin/hlsl/sort/counting.hlsl

        }

        sdata.workgroupExecutionAndMemoryBarrier();

-        for (int i = 0; i < data.elementsPerWT; i++)
+        uint32_t index = params.workGroupIndex * GroupSize * params.elementsPerWT + tid % GroupSize;


why areyou using % operator? This is literally the most expensive operator on the GPU (unless when the rhs is a Power of Two)

devshgraphicsprogramming · 2024-05-30T19:13:44Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+template<
+    uint16_t GroupSize,
+    uint16_t KeyBucketCount,
+    typename Key,


try adding a

using key_t = decltype(declval<KeyAccessor>().get(0));

and using that throughout instead of typename Key

devshgraphicsprogramming · 2024-05-30T20:57:49Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+        const uint16_t adjusted_key_bucket_count = ((KeyBucketCount - 1) / GroupSize + 1) * GroupSize;
+
+        for (tid += GroupSize; tid < adjusted_key_bucket_count; tid += GroupSize)


this is actually the one of the times you should use

const uint32_t RoundedUpBucketsPerThread = KeyBucketCount-1)/GroupSize+1; for (uint32_t i=1; i<RoundedUpBucketsPerThread; i++)

formulation (you're making it clear every thread hits the loop in uniform control flow (inclusive_scan requires this)

devshgraphicsprogramming · 2024-05-30T20:59:23Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+        {
+            if (is_last_wg_invocation)
+            {
+                uint32_t startIndex = tid - tid % GroupSize;


I don't get why you calculate the index with %

Anyway if you go back to a uniform loop, you can compute it as i*GroupSize

devshgraphicsprogramming · 2024-05-30T21:01:01Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+                sdata.set(startIndex, sdata.get(startIndex) + sum);
+            }
+
+            sum = inclusive_scan(sdata.get(tid), sdata);


you might do a little bit of "GroupSize not multiple of KeyBucketCount" handling here with

const uint32_t val = tid<KeyBucketCount ? sdata.get(tid):0; sum = inclusive_scan(val,sdata);

devshgraphicsprogramming · 2024-05-30T21:36:40Z

include/nbl/builtin/hlsl/sort/counting.hlsl

+        uint32_t shifted_tid = (tid + glsl::gl_SubgroupSize() * params.workGroupIndex) % GroupSize;
+
+        for (; shifted_tid < KeyBucketCount; shifted_tid += GroupSize)
+        {
+            uint32_t bucket_value = sdata.get(shifted_tid);
+            uint32_t exclusive_value = histogram.atomicSub(shifted_tid, bucket_value) - bucket_value;
+
+            sdata.set(shifted_tid, exclusive_value);
+        }


ok this still makes sense to do with a toroidal shift in spite of https://github.com/Devsh-Graphics-Programming/Nabla/pull/690/files#r1621277804

but there's a different way to write this

const uint32_t shift = glsl::gl_SubgroupSize()*params.workGroupIndex; for (uint32_t vtid=tid; vtid<KeyBucketCount; vtid+=GroupSize) { // have to use modulo operator in case `KeyBucketCount<=2*GroupSize`, better hope KeyBucketCount is Power of Two const uint32_t shifted_tid = (vtid+shift)%KeyBucketCount; const uint32_t bucket_value = sdata.get(shifted_tid); const uint32_t firstOutputIndex = histogram.atomicSub(shifted_tid,bucket_value)-bucket_value; sdata.set(shifted_tid,firstOutputIndex); }

doing a loop conditioned on shifted_tid would break unrolling during optimization, plus your %GroupSize wraps around differently than a %KeyBucketCount would (you get stuck within the [0,GroupSize) block, not really easing contetion).

devshgraphicsprogramming commented May 2, 2024

View reviewed changes

include/nbl/builtin/hlsl/sort/counting.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented May 2, 2024

View reviewed changes

include/nbl/builtin/hlsl/sort/counting.hlsl Outdated Show resolved Hide resolved

nipunG314 added 4 commits May 3, 2024 15:12

Change submodule dep

95ca151

Add integral bda type that supports atomics

f803f11

Add workaround for DXC issue #6576

c2b0409

Refactor the counting sort logic into general utility

bc1f092

nipunG314 force-pushed the master branch from f6244cc to bc1f092 Compare May 3, 2024 09:42

nipunG314 added 3 commits May 3, 2024 17:19

Separate the bda types into their own headers

42edf28

Clean up barriers

397f018

Load consecutive values in consecutive invocations

2c57524

devshgraphicsprogramming commented May 3, 2024

View reviewed changes

include/nbl/builtin/hlsl/bda/__ptr.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented May 3, 2024

View reviewed changes

include/nbl/builtin/hlsl/bda/__ptr.hlsl Show resolved Hide resolved

devshgraphicsprogramming commented May 3, 2024

View reviewed changes

include/nbl/builtin/hlsl/bda/__ref.hlsl Outdated Show resolved Hide resolved