-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delete use of cooperative groups in kernels #292
Comments
Just posting some notes here on my research of how to remove all of the CG related code to remove the dependency:
Can be replaced with the following
Without the need for any thread syncs. Also the other variables that are used can be replaced by the following:
I have replaced most of the kernel to test for performance improvement and I was not able to see any noticable change by removing the cooperative groups. |
in many cases, I also find it quite convenient to just have a blockSize of 32 in x direction, and the rest in y direction. |
We use a lot of cooperative groups functionality in our kernels. This is an additional dependency that is likely mildly convenient, but it is also likely that the code could be written without them, without too much added complexity, and just as fast. As a general feature ideally llm.c is very careful in the "dependency surface" of its code, which would make it very portable, easy to skim/read even if slightly longer, and easy to run or port to any hardware, old/new/edge/exotic/ or otherwise unthought of.
I would accept PRs that develop cooperative-groups-free kernels in
dev/cuda
that:On top of
dev/cuda
I'd be happy to merge these into "mainline"train_gpt2.cu
and the fp32 versiontrain_gpt2fp32.cu
.The text was updated successfully, but these errors were encountered: