delete use of cooperative groups in kernels #292

karpathy · 2024-04-29T15:38:00Z

We use a lot of cooperative groups functionality in our kernels. This is an additional dependency that is likely mildly convenient, but it is also likely that the code could be written without them, without too much added complexity, and just as fast. As a general feature ideally llm.c is very careful in the "dependency surface" of its code, which would make it very portable, easy to skim/read even if slightly longer, and easy to run or port to any hardware, old/new/edge/exotic/ or otherwise unthought of.

I would accept PRs that develop cooperative-groups-free kernels in dev/cuda that:

aren't too much more complex or more LOC
have the same speed

On top of dev/cuda I'd be happy to merge these into "mainline" train_gpt2.cu and the fp32 version train_gpt2fp32.cu.

The text was updated successfully, but these errors were encountered:

ChrisDryden · 2024-05-01T23:29:46Z

Just posting some notes here on my research of how to remove all of the CG related code to remove the dependency:

    sum = cg::reduce(warp, sum, cg::plus<float>{});

Can be replaced with the following

__device__ float warpReduceSum(float val) {
    for (int offset = 16; offset > 0; offset /= 2) {
        val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

sum = warpReduceSum(sum)

Without the need for any thread syncs.

Also the other variables that are used can be replaced by the following:

int warpSize = 32;
int laneId = threadIdx.x % warpSize;
int warpId = threadIdx.x / warpSize;
int warpsPerBlock = (blockDim.x / warpSize);

warp.thread_rank() == laneId
warp.size() == warpSize
warp.meta_group_size() == warpsPerBlock
warp.meta_group_rank() == warpId

I have replaced most of the kernel to test for performance improvement and I was not able to see any noticable change by removing the cooperative groups.

ngc92 · 2024-05-02T00:17:40Z

in many cases, I also find it quite convenient to just have a blockSize of 32 in x direction, and the rest in y direction.
Then threadIdx.x corresponds to laneId and threadIdx.y is warpId. Doesn't work when the block naturally already uses the other block dims.

karpathy added the feature-request label Apr 29, 2024

ademeure mentioned this issue Apr 29, 2024

Reused shared memory usage for sum and max #282

Closed

lancerts mentioned this issue Apr 29, 2024

Remove the cooperative groups in matmul_backward_bias.cu #297

Closed

This was referenced May 6, 2024

Implementation of online softmax forward kernel without cooperative groups. #370

Closed

Free cooperative groups implementation of online softmax forward #383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete use of cooperative groups in kernels #292

delete use of cooperative groups in kernels #292

karpathy commented Apr 29, 2024

ChrisDryden commented May 1, 2024

ngc92 commented May 2, 2024

delete use of cooperative groups in kernels #292

delete use of cooperative groups in kernels #292

Comments

karpathy commented Apr 29, 2024

ChrisDryden commented May 1, 2024

ngc92 commented May 2, 2024