-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while running "make train_gpt2fp32cu" on Ubuntu #237
Comments
It might well be a case of the CUDA toolkit being too old, most people are using CUDA 12.2 or 12.3... and we're aiming for 12.4 here. NB Your CUDA toolchain version needs to match the kernel driver for this, it seems. Otherwise you get:
|
Also, this doesn't even work with 12.1:
|
anyone knows how to modify the code to make it work with older versions of cuda, my gpu is quite old and doesnt support cuda 12, it supports upto cuda 10 |
I also still can't make train_gpt2fp32cu |
That's where the educational part of this project comes in...
The correct command to build it is:
make
If you have `nvcc` in your path, it will spit out errors. You will just
have to go over them one by one, read the CUDA docs and changelogs... post
here maybe someone can help or offer advice. If you don't have it in the
path, make sure you `export PATH=/usr/local/cuda/bin:$PATH` or wherever
your `nvcc` is located (`apt install mlocate; sudo updatedb; locate nvcc`)
You know that `python -m venv .venv; source .venv/bin/activate; pip install
-r requirements.txt` will set up a python environment for you, and that
`python train_gpt2.py` will also do the same, right? So that's a good
benchmark, and that should work - because I assume Pytorch supports Cuda
10... if not, you can let it run on CPU.
I think getting older GPU's to work is worthwhile, because they are also
much faster than CPU, and there are a lot of them out there. But maybe
`tinygrad` will already work for your GPU? Try it and let us (and them)
know!
|
As you can see from what I sent, I have CUDA Version: 12.3, but what you can't see is that CUDA Version 12.4 is also installed. I did not mean to install both at once, but the CUDA installation instructions are scattered all over the place and hence not clear. Maybe I shoud remove 12.3? Advice appreciated. |
Yes, just run
|
I encountered a similar issue while compiling with CUDA 11. /usr/local/cuda/bin/nvcc -O3 --use_fast_math train_gpt2.cu -lcublas -lcublasLt -o train_gpt2cu train_gpt2.cu(105): error: identifier "__halves2bfloat162" is undefined train_gpt2.cu(107): error: no instance of overloaded function "atomicAdd" matches the argument list train_gpt2.cu(407): error: no instance of overloaded function "__ldcs" matches the argument list train_gpt2.cu(408): error: no instance of overloaded function "__ldcs" matches the argument list |
Next step: Look at See if you can rewrite the function to use a different type, which doesn't seem to be defined on Cuda 10... so maybe you can make it work if you can create an atomicAdd that works without Not sure about what |
Try this at line 97:
As for From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html:
Looks like you can just change all EDIT: Rename the issue to "Error while running "make" for train_gpt2fp32cu on Cuda 10" |
@dagelf thanks! I updated CUDA to version 12.3.107, compiled successfully, and it runs normally. |
Still awaiting confirmation that the above changes make it run on Cuda 10... which might be useful if there are any older Nvidia cards that don't work with Cuda 12... are there? This issue should be named "Error while running "make" for train_gpt2fp32cu on Cuda 10" |
FYI, on ubuntu 2004 / CUDA 11.2 / cuDNN 8 / GPU RTX 4090D 24G got this err when try to run: USE_CUDNN=1 make train_gpt2cu the err print as: ---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -O3 -t=0 --use_fast_math -DENABLE_CUDNN -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/root/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include -lmpi -lnccl -lcublas -lcublasLt -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -o train_gpt2cu
/root/cudnn-frontend/include/cudnn_frontend_utils.h(96): error: namespace "std" has no member "variant"
...
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1470): error: identifier "is_inference" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1482): error: identifier "attn_scale_value" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1526): error: identifier "dropout_probability" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: qualified name is not allowed
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: this declaration has no storage class or type specifier
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1552): error: expected a ";"
...
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1584): error: identifier "is_inference" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1596): error: identifier "attn_scale_value" is undefined
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1616): error: qualified name is not allowed
/root/cudnn-frontend/include/cudnn_frontend/node/../graph_properties.h(1616): error: this declaration has no storage class or type specifier
Error limit reached.
100 errors detected in the compilation of "train_gpt2.cu".
Compilation terminated. from the above info, I have to update CUDA to at least 12.3.107 for a try and update my cuDNN version accordingly? use same version of CUDA, cuDNN but try make with without cuDNN support, still got err make train_gpt2cu ---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -O3 -t=0 --use_fast_math -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include -lmpi -lnccl -lcublas -lcublasLt -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -o train_gpt2cu
train_gpt2.cu(212): error: identifier "__ushort_as_bfloat16" is undefined
train_gpt2.cu(212): error: identifier "__halves2bfloat162" is undefined
train_gpt2.cu(214): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (__nv_bfloat162 *, __nv_bfloat162)
train_gpt2.cu(253): error: no operator "+=" matches these operands
operand types are: floatX += float
train_gpt2.cu(267): warning #20012-D: __device__ annotation is ignored on a function("Packed128") that is explicitly defaulted on its first declaration
...
train_gpt2.cu(1348): error: no operator "+=" matches these operands
operand types are: floatX += floatX
train_gpt2.cu(82): warning #177-D: variable "ncclFloatN" was declared but never referenced
20 errors detected in the compilation of "train_gpt2.cu". I'll try update CUDA and update here if the version of it is to blamed with. |
same ERROR |
Does it work without USE_CUDNN? |
I am trying to train using CUDA but while running
make train_gpt2fp32cu
i get this error:
`OpenMP found, compiling with OpenMP support
nvcc found, including CUDA builds
/usr/bin/nvcc -O3 --use_fast_math train_gpt2_fp32.cu -lcublas -lcublasLt -o train_gpt2fp32cu
nvcc fatal : Path to libdevice library not specified
make: *** [Makefile:94: train_gpt2fp32cu] Error 1
`
Clearly the error is not in OpenMP and i have installed cuda as well, the CUDA toolkit installed is v10.1.243
Anyone know how to resolve this?
The text was updated successfully, but these errors were encountered: