Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer Device Address in HLSL #690

Open
wants to merge 71 commits into
base: master
Choose a base branch
from

Conversation

devshgraphicsprogramming
Copy link
Member

Description

Testing

TODO list:

Comment on lines 20 to 24
uint32_t dataElementCount;
uint32_t elementsPerWT;
uint32_t workGroupIndex;
Key minimum;
Key maximum;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a little weird to shove non-uniform (ones that change between workgroups) parameters together with uniform

Comment on lines +38 to +43
sdata.workgroupExecutionAndMemoryBarrier();

sdata.set(tid % GroupSize, sum);
uint32_t shifted_tid = (tid + glsl::gl_SubgroupSize() * params.workGroupIndex) % GroupSize;

sdata.workgroupExecutionAndMemoryBarrier();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, crap I forgot you'd need to store the scan result back to smem to be able to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok lets go back to the original simple histogram.atomicAdd(tid,sdata.get(tid));

I forgot the prefix sum is not already in smem, so the extra barriers will just make us slower.

Comment on lines +52 to +53
for (; tid < KeyBucketCount; tid += GroupSize) {
sdata.set(tid, 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this permamently leaves your tid changed, did you want to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look, my snippet sometimes spawn a new temp so you leave tid unchanged after you exit the loop
#690 (comment)

}

sdata.workgroupExecutionAndMemoryBarrier();

for (int i = 0; i < data.elementsPerWT; i++)
uint32_t index = params.workGroupIndex * GroupSize * params.elementsPerWT + tid % GroupSize;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why areyou using % operator? This is literally the most expensive operator on the GPU (unless when the rhs is a Power of Two)

template<
uint16_t GroupSize,
uint16_t KeyBucketCount,
typename Key,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try adding a

using key_t = decltype(declval<KeyAccessor>().get(0));

and using that throughout instead of typename Key

Comment on lines +85 to +87
const uint16_t adjusted_key_bucket_count = ((KeyBucketCount - 1) / GroupSize + 1) * GroupSize;

for (tid += GroupSize; tid < adjusted_key_bucket_count; tid += GroupSize)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually the one of the times you should use

const uint32_t RoundedUpBucketsPerThread = KeyBucketCount-1)/GroupSize+1;
for (uint32_t i=1; i<RoundedUpBucketsPerThread; i++)

formulation (you're making it clear every thread hits the loop in uniform control flow (inclusive_scan requires this)

{
if (is_last_wg_invocation)
{
uint32_t startIndex = tid - tid % GroupSize;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why you calculate the index with %

Anyway if you go back to a uniform loop, you can compute it as i*GroupSize

sdata.set(startIndex, sdata.get(startIndex) + sum);
}

sum = inclusive_scan(sdata.get(tid), sdata);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might do a little bit of "GroupSize not multiple of KeyBucketCount" handling here with

const uint32_t val = tid<KeyBucketCount ? sdata.get(tid):0;
sum = inclusive_scan(val,sdata);

Comment on lines +105 to +113
uint32_t shifted_tid = (tid + glsl::gl_SubgroupSize() * params.workGroupIndex) % GroupSize;

for (; shifted_tid < KeyBucketCount; shifted_tid += GroupSize)
{
uint32_t bucket_value = sdata.get(shifted_tid);
uint32_t exclusive_value = histogram.atomicSub(shifted_tid, bucket_value) - bucket_value;

sdata.set(shifted_tid, exclusive_value);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this still makes sense to do with a toroidal shift in spite of https://github.com/Devsh-Graphics-Programming/Nabla/pull/690/files#r1621277804

but there's a different way to write this

const uint32_t shift = glsl::gl_SubgroupSize()*params.workGroupIndex;
for (uint32_t vtid=tid; vtid<KeyBucketCount; vtid+=GroupSize)
{
   // have to use modulo operator in case `KeyBucketCount<=2*GroupSize`, better hope KeyBucketCount is Power of Two
   const uint32_t shifted_tid = (vtid+shift)%KeyBucketCount;

   const uint32_t bucket_value = sdata.get(shifted_tid);
   const uint32_t firstOutputIndex = histogram.atomicSub(shifted_tid,bucket_value)-bucket_value;
   sdata.set(shifted_tid,firstOutputIndex);
}

doing a loop conditioned on shifted_tid would break unrolling during optimization, plus your %GroupSize wraps around differently than a %KeyBucketCount would (you get stuck within the [0,GroupSize) block, not really easing contetion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants