-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer Device Address in HLSL #690
base: master
Are you sure you want to change the base?
Buffer Device Address in HLSL #690
Conversation
uint32_t dataElementCount; | ||
uint32_t elementsPerWT; | ||
uint32_t workGroupIndex; | ||
Key minimum; | ||
Key maximum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a little weird to shove non-uniform (ones that change between workgroups) parameters together with uniform
sdata.workgroupExecutionAndMemoryBarrier(); | ||
|
||
sdata.set(tid % GroupSize, sum); | ||
uint32_t shifted_tid = (tid + glsl::gl_SubgroupSize() * params.workGroupIndex) % GroupSize; | ||
|
||
sdata.workgroupExecutionAndMemoryBarrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argh, crap I forgot you'd need to store the scan result back to smem to be able to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok lets go back to the original simple histogram.atomicAdd(tid,sdata.get(tid));
I forgot the prefix sum is not already in smem, so the extra barriers will just make us slower.
for (; tid < KeyBucketCount; tid += GroupSize) { | ||
sdata.set(tid, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this permamently leaves your tid
changed, did you want to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look, my snippet sometimes spawn a new temp so you leave tid
unchanged after you exit the loop
#690 (comment)
} | ||
|
||
sdata.workgroupExecutionAndMemoryBarrier(); | ||
|
||
for (int i = 0; i < data.elementsPerWT; i++) | ||
uint32_t index = params.workGroupIndex * GroupSize * params.elementsPerWT + tid % GroupSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why areyou using % operator? This is literally the most expensive operator on the GPU (unless when the rhs is a Power of Two)
template< | ||
uint16_t GroupSize, | ||
uint16_t KeyBucketCount, | ||
typename Key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try adding a
using key_t = decltype(declval<KeyAccessor>().get(0));
and using that throughout instead of typename Key
const uint16_t adjusted_key_bucket_count = ((KeyBucketCount - 1) / GroupSize + 1) * GroupSize; | ||
|
||
for (tid += GroupSize; tid < adjusted_key_bucket_count; tid += GroupSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually the one of the times you should use
const uint32_t RoundedUpBucketsPerThread = KeyBucketCount-1)/GroupSize+1;
for (uint32_t i=1; i<RoundedUpBucketsPerThread; i++)
formulation (you're making it clear every thread hits the loop in uniform control flow (inclusive_scan
requires this)
{ | ||
if (is_last_wg_invocation) | ||
{ | ||
uint32_t startIndex = tid - tid % GroupSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why you calculate the index with %
Anyway if you go back to a uniform loop, you can compute it as i*GroupSize
sdata.set(startIndex, sdata.get(startIndex) + sum); | ||
} | ||
|
||
sum = inclusive_scan(sdata.get(tid), sdata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might do a little bit of "GroupSize not multiple of KeyBucketCount" handling here with
const uint32_t val = tid<KeyBucketCount ? sdata.get(tid):0;
sum = inclusive_scan(val,sdata);
uint32_t shifted_tid = (tid + glsl::gl_SubgroupSize() * params.workGroupIndex) % GroupSize; | ||
|
||
for (; shifted_tid < KeyBucketCount; shifted_tid += GroupSize) | ||
{ | ||
uint32_t bucket_value = sdata.get(shifted_tid); | ||
uint32_t exclusive_value = histogram.atomicSub(shifted_tid, bucket_value) - bucket_value; | ||
|
||
sdata.set(shifted_tid, exclusive_value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this still makes sense to do with a toroidal shift in spite of https://github.com/Devsh-Graphics-Programming/Nabla/pull/690/files#r1621277804
but there's a different way to write this
const uint32_t shift = glsl::gl_SubgroupSize()*params.workGroupIndex;
for (uint32_t vtid=tid; vtid<KeyBucketCount; vtid+=GroupSize)
{
// have to use modulo operator in case `KeyBucketCount<=2*GroupSize`, better hope KeyBucketCount is Power of Two
const uint32_t shifted_tid = (vtid+shift)%KeyBucketCount;
const uint32_t bucket_value = sdata.get(shifted_tid);
const uint32_t firstOutputIndex = histogram.atomicSub(shifted_tid,bucket_value)-bucket_value;
sdata.set(shifted_tid,firstOutputIndex);
}
doing a loop conditioned on shifted_tid
would break unrolling during optimization, plus your %GroupSize
wraps around differently than a %KeyBucketCount
would (you get stuck within the [0,GroupSize) block, not really easing contetion).
Description
Testing
TODO list: