Work on property pool HLSL impl #649

deprilula28 · 2024-01-25T02:54:59Z

Description

Implementing CPropertyPoolHandler and CPropertyPool in HLSL, using direct buffer address instead of allocating descriptors sets for buffers.
Notes about impl:

-> Currently uses descritor pools (needs to allocate every time)
    -> Use BDA and root constants with the addresses instead
-> Device capabilities traits 
    -> Example version: https://github.com/Devsh-Graphics-Programming/Nabla/blob/vulkan_1_3/include/nbl/builtin/hlsl/device_capabilities_traits.hlsl
    -> maxOptimallyResidentWorkgroupInvocations
    -> Can use nbl::hlsl::jit::device_capabilities struct with JIT generated "constexpr" variables for maximally optimal workgroup invocations
    https://github.com/Devsh-Graphics-Programming/Nabla-Examples-and-Tests/blob/master/23_ArithmeticUnitTest/app_resources/shaderCommon.hlsl#L9
    https://github.com/microsoft/DirectXShaderCompiler/issues/6144


=== tasks ===

-> Port https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/glsl/property_pool/copy.comp to HLSL
    -> Persitently resident threads, scrolling over, maximally using the GPU workgroup size
    -> Dispatch: 2D (x: DWORD in the property ID, y: property ID)
        -> property id: which buffer youre touching/analogous to draw id
            -> indexes into transferData
            -> new version: use null pointer as invalid pointer
    -> transferData: List of copy "commands"
        -> new version: Replaced by push constant with BDA address
    -> addresses: "Index buffer"
        -> invalid pointer: IOTA (analogous to not using an index buffer, use iteration index as the fetching index)
    -> Use shorts (uint16) instead of DWORDs (uint32)
        -> Transfer data struct uses bytes for future proofing
    -> Specialize on:
        -> Whether or not source is a fill
        -> Type of index (uint8, uint16, uint32, uint64)
        -> Src index is IOTA
        -> Dst index is IOTA
    -> Keep optimization for modulos (line 38 & 52)

-> CPU Code
    -> CPropertyPoolHandler
        -> Nuke m_maxPropertiesPerPass, getMaxScratchSize (not relevant with BDA version)
    -> TransferRequest on CPU keeps reference to the buffer and places it in the command buffer for lifetime tracking
        -> Have a custom command that just keeps track of a **variable number** of reference counted objects for preserving lifetimes (LinkedPreservedLifetimes?)
            -> Take a span of IGPUReferenceCounted
            -> Example: https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/IGPUCommandBuffer.cpp#L104C54-L104C54
            -> For variable amount of stuff: https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/IGPUCommandBuffer.cpp#L403C90-L403C90
            -> Signature example: `IGPUCommandBuffer::preserveLifetime(std::span<const core::IReferenceCounted>)`
    -> New transfer property signature
        -> make pipeline barriers more robust (or require everything to be done properly outside the function)
        -> First parameter: SIntendedSubmitInfo (IUtilities-independent submit info struct thing for handling overflows)
            -> Source for it: https://github.com/Devsh-Graphics-Programming/Nabla/blob/vulkan_1_3/include/nbl/video/utilities/SIntendedSubmitInfo.h
            -> Move IUtilities::autoSubmit and IUtilities::autoSubmitAndBlock to SIntendedSubmitInfo as static method (no more relation to IUtilities)
        -> Second parameter: struct with parameters
            `const asset::SBufferBinding<video::IGPUBuffer>& scratch, system::logger_opt_ptr logger, const size_t baseDWORD=0ull, const size_t endDWORD=~0u`
            -> Additional parameters that are optional including additional pipeline barrier values
            bitfield/boolean [pre|post]ScratchBarrier = true
    -> lets keep MaxPropertiesPerDispatch and have it equal to 64kb/sizeof(nbl::hlsl::property_pools::transferTrequest)
        -> instead of copy lambda logic at https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/utilities/CPropertyPoolHandler.cpp#L172, fail if over MaxPropertiesPerDispatch
    -> leave upstreaming thing & contiguous buffers for later (#ifdef 0 it out)
        -> transferProperties with upstreaming & freeProperties
    -> IPropertyPool
        -> allocateProperties: use span instead of begin & end
            -> (behaviour) 
                -> goes through indices to find empty ones and allocate them
                -> if it's contiguous: add mapping from index to addr and addr to index
        -> nuke descriptor set stuff (line 198 -> 211)
        -> validateBlocks: change offset check (https://github.com/Devsh-Graphics-Programming/Nabla/blob/64cbb652e39acf0239a61bcee7fc26d70ab8d089/src/nbl/video/utilities/IPropertyPool.cpp#L38) to BDA
            -> check usages & non null address
    -> CPropertyPool: don't change anything, just make sure identation is right

    -> MegaDescriptorSet (Descriptor set sub-allocate)
        -> Have a multi-timeline event functor with IFuture await
        ```cpp
            MultiTimelineEventHandlerST<DeferredFreeFunctor> deferredFrees;
            deferredFrees.latch(futureWait,std::move(functor));
        ```
            -> Also have it on IPropertyPool
            -> Solve synchronization issues

    -> create example testing downloads, uploads of properties
        -> with IB, without IB, fills, etc etc
        -> use regular buffer for everything
        -> later test the streaming buffers (ifdef them back in)

Testing

TODO list:

Verify why things aren't being written accurately
Implement address buffer handling
Baseline test
Test with IOTA
Test with fill buffers
Test with different element sizes
Test with different element counts
Test with different transfer amounts

…ies function

include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl

include/nbl/builtin/hlsl/property_pool/transfer.hlsl

devshgraphicsprogramming · 2024-02-13T17:11:34Z

include/nbl/builtin/hlsl/property_pool/transfer.hlsl

+    uint64_t endOffset;
+};
+
+NBL_CONSTEXPR uint32_t MaxPropertiesPerDispatch = 128;


is there any reason to keep this around anymore?

devshgraphicsprogramming · 2024-02-13T17:12:16Z

include/nbl/builtin/hlsl/property_pool/transfer.hlsl

+    // Define the range of invocations (X axis) that will be transfered over in this dispatch
+    // May be sectioned off in the case of overflow or any other situation that doesn't allow
+    // for a full transfer
+    uint64_t beginOffset;
+    uint64_t endOffset;


would be useful to make it clear we're counting in DWORDs or shorts (if you want to do 16bit transfer atoms instead)

include/nbl/video/utilities/IPropertyPool.h

include/nbl/video/utilities/CPropertyPoolHandler.h

devshgraphicsprogramming · 2024-02-13T17:22:10Z

include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl

+template<bool Fill, bool SrcIndexIota, bool DstIndexIota>
+struct TransferLoopPermutationDstIota
+{
+    void copyLoop(uint baseInvocationIndex, uint propertyId, TransferRequest transferRequest, uint dispatchSize)
+    {
+       if (transferRequest.srcIndexSizeLog2 == 0) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 0> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
+       else if (transferRequest.srcIndexSizeLog2 == 1) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 1> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
+       else if (transferRequest.srcIndexSizeLog2 == 2) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 2> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
+       else /*if (transferRequest.srcIndexSizeLog2 == 3)*/ { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 3> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
+    }
+};
+
+template<bool Fill, bool SrcIndexIota>
+struct TransferLoopPermutationSrcIota
+{
+    void copyLoop(uint baseInvocationIndex, uint propertyId, TransferRequest transferRequest, uint dispatchSize)
+    {
+        bool dstIota = transferRequest.dstIndexAddr == 0;
+        if (dstIota) { TransferLoopPermutationDstIota<Fill, SrcIndexIota, true> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
+        else { TransferLoopPermutationDstIota<Fill, SrcIndexIota, false> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
+    }
+};
+
+template<bool Fill>
+struct TransferLoopPermutationFill
+{


only use structs instead of templated functions when you need partial specialization

I'm not sure what you mean

The struct functor only makes sense if:

you do a partial specialization, i.e. template<typename ArbitraryType> struct MyFunctor<true,ArbitraryType> because functions can be only fully specialized

you need to pass the functor as a template arg/lambda because HLSL202x doesn't allow function pointers/references or you want a stateful functor

template<typename Accessor, typename Compare> uint32_t find_first(inout Accessor accessor, const Compare comparator);

if neither of the above applies, just use a templated function

ah ok, I think your original comment was the wrong way around

ah ok, I think your original comment was the wrong way around

"only use structs instead of templated functions when you need partial specialization"

you don't need partial specialization

you're using structs

but you only use structs when you have partial specialization

dont use structs here.

include/nbl/video/alloc/SubAllocatedDescriptorSet.h

include/nbl/video/utilities/CPropertyPoolHandler.h

include/nbl/video/utilities/IPropertyPool.h

devshgraphicsprogramming · 2024-02-13T22:03:34Z

include/nbl/video/utilities/IPropertyPool.h

-        static inline constexpr auto invalid = PropertyAddressAllocator::invalid_address;
-
+        static inline constexpr uint64_t invalid = 0; 


when did we agree to change the invalid from the AddressAllocator invalid (0xdeadbeefu) to 0?

include/nbl/video/utilities/IPropertyPool.h

devshgraphicsprogramming · 2024-02-13T22:06:54Z

src/nbl/video/utilities/CPropertyPoolHandler.cpp

+	// TODO: Reuse asset manager from elsewhere?
+	auto assetManager = core::make_smart_refctd_ptr<asset::IAssetManager>(core::smart_refctd_ptr<system::ISystem>(system));
+
+	auto loadShader = [&](const char* path)
+		{
+			asset::IAssetLoader::SAssetLoadParams params = {};
+			auto assetBundle = assetManager->getAsset(path, params);
+			auto assets = assetBundle.getContents();
+			assert(!assets.empty());
+
+			auto cpuShader = asset::IAsset::castDown<asset::ICPUShader>(assets[0]);
+			auto shader = m_device->createShader(cpuShader.get());
+			return shader;
+		};
+	auto shader = loadShader("../../../include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl");


I'd rather use system->openFile and use the nbl/builtin/hlsl/property_pool/copy.comp.hlsl path directly toi opena memory mapped file and create IGPUShader from it directly

look at the old loadBuiltinData code

devshgraphicsprogramming · 2024-02-13T22:11:57Z

src/nbl/video/utilities/CPropertyPoolHandler.cpp

+			transferRequest.srcIndexAddr = srcRequest->srcAddressesOffset ? addressBufferDeviceAddr + srcRequest->srcAddressesOffset : 0;
+			transferRequest.dstIndexAddr = srcRequest->dstAddressesOffset ? addressBufferDeviceAddr + srcRequest->dstAddressesOffset : 0;


I think I told you explicitly to get rid of the addresses buffer, and just have each transferRequest supply the correctly offsetted BDA or SBufferBinding instead

given how you're doing srcAddr and dstAddr, the SBufferBinding route is preferrable

src/nbl/video/utilities/CPropertyPoolHandler.cpp

devshgraphicsprogramming · 2024-02-13T22:15:56Z

src/nbl/video/utilities/CPropertyPoolHandler.cpp

+	uint32_t maxScratchSize = MaxPropertiesPerDispatch * sizeof(nbl::hlsl::property_pools::TransferRequest);
+	if (scratch.offset + maxScratchSize > scratch.buffer->getSize())
+	    logger.log("CPropertyPoolHandler: The scratch buffer binding provided might not be big enough in the worst case! (Scratch buffer size: %i Max scratch size: %i)",
+			system::ILogger::ELL_WARNING,
+			scratch.buffer->getSize() - scratch.offset,
+			maxScratchSize);


I'd validate better, also right now your code will choke/crash on a scratch buffer thats too small, you're still using MaxPropertiesPerDispatch to figure out the numberOfPasses

src/nbl/video/utilities/CPropertyPoolHandler.cpp

devshgraphicsprogramming · 2024-02-13T22:17:36Z

src/nbl/video/utilities/CPropertyPoolHandler.cpp

+			pushConstants.beginOffset = baseDWORD;
+			pushConstants.endOffset = endDWORD;


this is why naming your variables matters, Offset == ByteOffset, but we're clearly counting in DWORDs?

src/nbl/video/utilities/CPropertyPoolHandler.cpp

devshgraphicsprogramming · 2024-02-13T22:21:27Z

src/nbl/video/utilities/IPropertyPool.cpp

-        m_indexToAddr = reinterpret_cast<uint32_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity));
+        m_indexToAddr = reinterpret_cast<uint64_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity));


you can't touch this up without making the reserved mem bigger!

devshgraphicsprogramming · 2024-02-15T12:40:29Z

include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl

+// TODO: instead use some sort of replace function for getting optimal size?
+[numthreads(512,1,1)]


I already wrote on discord to codegen a 5 line compute shader with main in the C++ and leave copy.hlsl as stage agnostic

devshgraphicsprogramming · 2024-02-15T12:42:04Z

include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl

+    transferRequest.srcAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr,8);
+    transferRequest.dstAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t),8);
+    transferRequest.srcIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t) * 2,8);
+    transferRequest.dstIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t) * 3,8);


make a wrapper for vk::RawBufferLoad and Store in nbl::hlsl::legacy which uses our type traits to default the alignment

include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl

devshgraphicsprogramming · 2024-02-19T10:34:19Z

include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl

+// Loading transfer request from the pointer (can't use struct
+// with BDA on HLSL SPIRV)
+static TransferRequest TransferRequest::newFromAddress(const uint64_t transferCmdAddr)
+{   


keep it with the struct

The struct is shared with c++ code, so i wouldn't be able to use vk::rawbufferread; I could take the 64 bit value though

you can use #ifndef __HLSL_VERSION in the impl of the method

devshgraphicsprogramming · 2024-02-19T10:35:26Z

include/nbl/builtin/hlsl/property_pool/transfer.hlsl

+// TODO: instead use some sort of replace function for getting optimal size?
+NBL_CONSTEXPR uint32_t OptimalDispatchSize = 256;


you can use the device JIT to query the max compute dispatch size, I'd round it down to nearest PoT though, so the divisions aren't expensive

deprilula28 added 6 commits January 16, 2024 21:20

Work on HLSL impl of property pools

a1747c6

Fix compilation issues with shader

adc4d57

Nuke code that won't be used (for now)

d9ddf41

Fix compilation problems

1707158

Temp fix for last compilation issue

279c220

Fix implementation problems in HLSL and write initial transferPropert…

4be1a3c

…ies function

deprilula28 mentioned this pull request Jan 25, 2024

Property pool hlsl Devsh-Graphics-Programming/Nabla-Examples-and-Tests#94

Draft

devshgraphicsprogramming changed the base branch from master to vulkan_1_3 January 25, 2024 09:05

deprilula28 added 9 commits January 27, 2024 11:02

Merge branch 'vulkan_1_3' into property_pool_hlsl

3570c03

Fix merge build issues

c44bb49

Fix vulkan_1_3 incompatibilities

9460e24

Add temporary fill for testing

88d1d00

WIP testing

52d6972

Update to latest testing

706000d

Diff that breaks things

b625153

WIP Suballocated descriptor set

ef4b779

Work on sub allocated descriptor set

b8db8c9