Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgMin and ArgMax reductions #574

Open
4 tasks
neworderofjamie opened this issue Mar 2, 2023 · 0 comments
Open
4 tasks

ArgMin and ArgMax reductions #574

neworderofjamie opened this issue Mar 2, 2023 · 0 comments

Comments

@neworderofjamie
Copy link
Contributor

neworderofjamie commented Mar 2, 2023

@FabianSchubert and @tnowotny - our discussion got me thinking, these operations would be pretty simple to implement.

Batch reduction

Max reduction currently generates code something like (thread per synapse/neuron):

scalar lrMax = SCALAR_MIN;
for(unsigned int batch = 0; batch < 512; batch++) {
    const unsigned int batchOffset = size * batch;
    const scalar lValue = group->Value[batchOffset + lid];

    lrMax = fmax(lrMax, lValue);
}
group->Max[lid] = lrMax;

Argmax would simply look like:

unsigned int lrArgMax = 0;
scalar lrMax = SCALAR_MIN;
for(unsigned int batch = 0; batch < 512; batch++) {
    const unsigned int batchOffset = size * batch;
    const scalar lValue = group->Value[batchOffset + lid];
    
    if(lValue > lrMax) {
        lrMax = lValue;
        lrArgMax = batch;
    }
}
group->ArgMax[lid] = lrArgMax;

Neuron reductions

These are a little gnarlier but max currently looks like (32-thread warp per batch):

scalar lrMax = SCALAR_MIN;
for(unsigned int idx = lane; idx < group->size; idx += 32) {
    const scalar lVal = group->Val[batchOffset + idx];
    lrMax = fmax(lrMax, lVal);
}
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 16));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 8));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 4));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 2));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 1));
if(lane == 0) {
    group->Max[batch] = lrMax;
}

Argmax would look like (totally untested):

unsigned int lrLaneArgMax = 0;
scalar lrLaneMax = SCALAR_MIN;
for(unsigned int idx = lane; idx < group->size; idx += 32) {
    const scalar lVal = group->Val[batchOffset + idx];
    if(lValue > lrMax) {
        lrLaneMax = lValue;
        lrLaneArgMax = idx;
    }
}
scalar lrMax = lrLaneMax;
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 16));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 8));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 4));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 2));
lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 1));

// Find which lane(s) provided max value
const unsigned int ballot = __ballot_sync(0xFFFFFFFF, lrMax == lrLaneMax);

// If this is the first lane (arbitrary decision), use the index of the max within the neurons we've considered
if(lane == (__ffs(ballot) − 1)) {
    group->ArgMax[batch] = lrLaneArgMax;
}

Most of this will be trivia and involve:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant