Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on property pool HLSL impl #649

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from
Draft

Conversation

deprilula28
Copy link
Contributor

Description

Implementing CPropertyPoolHandler and CPropertyPool in HLSL, using direct buffer address instead of allocating descriptors sets for buffers.
Notes about impl:

-> Currently uses descritor pools (needs to allocate every time)
    -> Use BDA and root constants with the addresses instead
-> Device capabilities traits 
    -> Example version: https://github.com/Devsh-Graphics-Programming/Nabla/blob/vulkan_1_3/include/nbl/builtin/hlsl/device_capabilities_traits.hlsl
    -> maxOptimallyResidentWorkgroupInvocations
    -> Can use nbl::hlsl::jit::device_capabilities struct with JIT generated "constexpr" variables for maximally optimal workgroup invocations
    https://github.com/Devsh-Graphics-Programming/Nabla-Examples-and-Tests/blob/master/23_ArithmeticUnitTest/app_resources/shaderCommon.hlsl#L9
    https://github.com/microsoft/DirectXShaderCompiler/issues/6144


=== tasks ===

-> Port https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/glsl/property_pool/copy.comp to HLSL
    -> Persitently resident threads, scrolling over, maximally using the GPU workgroup size
    -> Dispatch: 2D (x: DWORD in the property ID, y: property ID)
        -> property id: which buffer youre touching/analogous to draw id
            -> indexes into transferData
            -> new version: use null pointer as invalid pointer
    -> transferData: List of copy "commands"
        -> new version: Replaced by push constant with BDA address
    -> addresses: "Index buffer"
        -> invalid pointer: IOTA (analogous to not using an index buffer, use iteration index as the fetching index)
    -> Use shorts (uint16) instead of DWORDs (uint32)
        -> Transfer data struct uses bytes for future proofing
    -> Specialize on:
        -> Whether or not source is a fill
        -> Type of index (uint8, uint16, uint32, uint64)
        -> Src index is IOTA
        -> Dst index is IOTA
    -> Keep optimization for modulos (line 38 & 52)

-> CPU Code
    -> CPropertyPoolHandler
        -> Nuke m_maxPropertiesPerPass, getMaxScratchSize (not relevant with BDA version)
    -> TransferRequest on CPU keeps reference to the buffer and places it in the command buffer for lifetime tracking
        -> Have a custom command that just keeps track of a **variable number** of reference counted objects for preserving lifetimes (LinkedPreservedLifetimes?)
            -> Take a span of IGPUReferenceCounted
            -> Example: https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/IGPUCommandBuffer.cpp#L104C54-L104C54
            -> For variable amount of stuff: https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/IGPUCommandBuffer.cpp#L403C90-L403C90
            -> Signature example: `IGPUCommandBuffer::preserveLifetime(std::span<const core::IReferenceCounted>)`
    -> New transfer property signature
        -> make pipeline barriers more robust (or require everything to be done properly outside the function)
        -> First parameter: SIntendedSubmitInfo (IUtilities-independent submit info struct thing for handling overflows)
            -> Source for it: https://github.com/Devsh-Graphics-Programming/Nabla/blob/vulkan_1_3/include/nbl/video/utilities/SIntendedSubmitInfo.h
            -> Move IUtilities::autoSubmit and IUtilities::autoSubmitAndBlock to SIntendedSubmitInfo as static method (no more relation to IUtilities)
        -> Second parameter: struct with parameters
            `const asset::SBufferBinding<video::IGPUBuffer>& scratch, system::logger_opt_ptr logger, const size_t baseDWORD=0ull, const size_t endDWORD=~0u`
            -> Additional parameters that are optional including additional pipeline barrier values
            bitfield/boolean [pre|post]ScratchBarrier = true
    -> lets keep MaxPropertiesPerDispatch and have it equal to 64kb/sizeof(nbl::hlsl::property_pools::transferTrequest)
        -> instead of copy lambda logic at https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/utilities/CPropertyPoolHandler.cpp#L172, fail if over MaxPropertiesPerDispatch
    -> leave upstreaming thing & contiguous buffers for later (#ifdef 0 it out)
        -> transferProperties with upstreaming & freeProperties
    -> IPropertyPool
        -> allocateProperties: use span instead of begin & end
            -> (behaviour) 
                -> goes through indices to find empty ones and allocate them
                -> if it's contiguous: add mapping from index to addr and addr to index
        -> nuke descriptor set stuff (line 198 -> 211)
        -> validateBlocks: change offset check (https://github.com/Devsh-Graphics-Programming/Nabla/blob/64cbb652e39acf0239a61bcee7fc26d70ab8d089/src/nbl/video/utilities/IPropertyPool.cpp#L38) to BDA
            -> check usages & non null address
    -> CPropertyPool: don't change anything, just make sure identation is right

    -> MegaDescriptorSet (Descriptor set sub-allocate)
        -> Have a multi-timeline event functor with IFuture await
        ```cpp
            MultiTimelineEventHandlerST<DeferredFreeFunctor> deferredFrees;
            deferredFrees.latch(futureWait,std::move(functor));
        ```
            -> Also have it on IPropertyPool
            -> Solve synchronization issues

    -> create example testing downloads, uploads of properties
        -> with IB, without IB, fills, etc etc
        -> use regular buffer for everything
        -> later test the streaming buffers (ifdef them back in)

Testing

TODO list:

  • Verify why things aren't being written accurately
  • Implement address buffer handling
  • Baseline test
  • Test with IOTA
  • Test with fill buffers
  • Test with different element sizes
  • Test with different element counts
  • Test with different transfer amounts

uint64_t endOffset;
};

NBL_CONSTEXPR uint32_t MaxPropertiesPerDispatch = 128;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to keep this around anymore?

Comment on lines +43 to +47
// Define the range of invocations (X axis) that will be transfered over in this dispatch
// May be sectioned off in the case of overflow or any other situation that doesn't allow
// for a full transfer
uint64_t beginOffset;
uint64_t endOffset;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be useful to make it clear we're counting in DWORDs or shorts (if you want to do 16bit transfer atoms instead)

Comment on lines 72 to 97
template<bool Fill, bool SrcIndexIota, bool DstIndexIota>
struct TransferLoopPermutationDstIota
{
void copyLoop(uint baseInvocationIndex, uint propertyId, TransferRequest transferRequest, uint dispatchSize)
{
if (transferRequest.srcIndexSizeLog2 == 0) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 0> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
else if (transferRequest.srcIndexSizeLog2 == 1) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 1> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
else if (transferRequest.srcIndexSizeLog2 == 2) { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 2> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
else /*if (transferRequest.srcIndexSizeLog2 == 3)*/ { TransferLoopPermutationSrcIndexSizeLog<Fill, SrcIndexIota, DstIndexIota, 3> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
}
};

template<bool Fill, bool SrcIndexIota>
struct TransferLoopPermutationSrcIota
{
void copyLoop(uint baseInvocationIndex, uint propertyId, TransferRequest transferRequest, uint dispatchSize)
{
bool dstIota = transferRequest.dstIndexAddr == 0;
if (dstIota) { TransferLoopPermutationDstIota<Fill, SrcIndexIota, true> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
else { TransferLoopPermutationDstIota<Fill, SrcIndexIota, false> loop; loop.copyLoop(baseInvocationIndex, propertyId, transferRequest, dispatchSize); }
}
};

template<bool Fill>
struct TransferLoopPermutationFill
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only use structs instead of templated functions when you need partial specialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct functor only makes sense if:

  • you do a partial specialization, i.e. template<typename ArbitraryType> struct MyFunctor<true,ArbitraryType> because functions can be only fully specialized
  • you need to pass the functor as a template arg/lambda because HLSL202x doesn't allow function pointers/references or you want a stateful functor
template<typename Accessor, typename Compare>
uint32_t find_first(inout Accessor accessor, const Compare comparator);

if neither of the above applies, just use a templated function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I think your original comment was the wrong way around

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I think your original comment was the wrong way around

"only use structs instead of templated functions when you need partial specialization"

  1. you don't need partial specialization
  2. you're using structs
  3. but you only use structs when you have partial specialization
  4. dont use structs here.

Comment on lines -24 to +29
static inline constexpr auto invalid = PropertyAddressAllocator::invalid_address;

static inline constexpr uint64_t invalid = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when did we agree to change the invalid from the AddressAllocator invalid (0xdeadbeefu) to 0?

Comment on lines +12 to +26
// TODO: Reuse asset manager from elsewhere?
auto assetManager = core::make_smart_refctd_ptr<asset::IAssetManager>(core::smart_refctd_ptr<system::ISystem>(system));

auto loadShader = [&](const char* path)
{
asset::IAssetLoader::SAssetLoadParams params = {};
auto assetBundle = assetManager->getAsset(path, params);
auto assets = assetBundle.getContents();
assert(!assets.empty());

auto cpuShader = asset::IAsset::castDown<asset::ICPUShader>(assets[0]);
auto shader = m_device->createShader(cpuShader.get());
return shader;
};
auto shader = loadShader("../../../include/nbl/builtin/hlsl/property_pool/copy.comp.hlsl");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use system->openFile and use the nbl/builtin/hlsl/property_pool/copy.comp.hlsl path directly toi opena memory mapped file and create IGPUShader from it directly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the old loadBuiltinData code

Comment on lines 133 to 134
transferRequest.srcIndexAddr = srcRequest->srcAddressesOffset ? addressBufferDeviceAddr + srcRequest->srcAddressesOffset : 0;
transferRequest.dstIndexAddr = srcRequest->dstAddressesOffset ? addressBufferDeviceAddr + srcRequest->dstAddressesOffset : 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I told you explicitly to get rid of the addresses buffer, and just have each transferRequest supply the correctly offsetted BDA or SBufferBinding instead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given how you're doing srcAddr and dstAddr, the SBufferBinding route is preferrable

Comment on lines +107 to +112
uint32_t maxScratchSize = MaxPropertiesPerDispatch * sizeof(nbl::hlsl::property_pools::TransferRequest);
if (scratch.offset + maxScratchSize > scratch.buffer->getSize())
logger.log("CPropertyPoolHandler: The scratch buffer binding provided might not be big enough in the worst case! (Scratch buffer size: %i Max scratch size: %i)",
system::ILogger::ELL_WARNING,
scratch.buffer->getSize() - scratch.offset,
maxScratchSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd validate better, also right now your code will choke/crash on a scratch buffer thats too small, you're still using MaxPropertiesPerDispatch to figure out the numberOfPasses

Comment on lines 152 to 153
pushConstants.beginOffset = baseDWORD;
pushConstants.endOffset = endDWORD;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is why naming your variables matters, Offset == ByteOffset, but we're clearly counting in DWORDs?

m_indexToAddr = reinterpret_cast<uint32_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity));
m_indexToAddr = reinterpret_cast<uint64_t*>(reinterpret_cast<uint8_t*>(reserved)+getReservedSize(capacity));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't touch this up without making the reserved mem bigger!

Comment on lines 157 to 158
// TODO: instead use some sort of replace function for getting optimal size?
[numthreads(512,1,1)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already wrote on discord to codegen a 5 line compute shader with main in the C++ and leave copy.hlsl as stage agnostic

Comment on lines +118 to +121
transferRequest.srcAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr,8);
transferRequest.dstAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t),8);
transferRequest.srcIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t) * 2,8);
transferRequest.dstIndexAddr = vk::RawBufferLoad<uint64_t>(transferCmdAddr + sizeof(uint64_t) * 3,8);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a wrapper for vk::RawBufferLoad and Store in nbl::hlsl::legacy which uses our type traits to default the alignment

Comment on lines +108 to +111
// Loading transfer request from the pointer (can't use struct
// with BDA on HLSL SPIRV)
static TransferRequest TransferRequest::newFromAddress(const uint64_t transferCmdAddr)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it with the struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct is shared with c++ code, so i wouldn't be able to use vk::rawbufferread; I could take the 64 bit value though

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use #ifndef __HLSL_VERSION in the impl of the method

Comment on lines +55 to +56
// TODO: instead use some sort of replace function for getting optimal size?
NBL_CONSTEXPR uint32_t OptimalDispatchSize = 256;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the device JIT to query the max compute dispatch size, I'd round it down to nearest PoT though, so the divisions aren't expensive

Base automatically changed from vulkan_1_3 to master March 11, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants