-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: WSL Cuda out of Memory when Trying to Load GGUF Model #360
Comments
Try disabling CUDA graphs |
The same error occurs. It appears that one of the ray workers is trying to allocate way too much memory.
|
You may need to lower your context length by specifying --max-model-len 4096 |
Different error after lowering context length.
|
I'm getting CUDA oom, the system has 128GB of system memory available. |
Error appears to be happening on line 563 in api_server.py
Not sure where exactly it fails since efforts at putting in debug code in the format of
doesn't yield any messages in the console. |
Sorry I've been away for a while. Have you tried the docker image? This is probably a WSL issue. GPU docker on windows uses WSL too, but who knows... |
Not sure how to run a docker image on windows. Hopefully official support for windows will be added in the future. |
Your current environment
Please note that the system is WSL on windows 11, therefore the env.py is not able to gather the correct version of CUDA due to a known bug. The CUDA version installed is 12.1
🐛 Describe the bug
Trying to run a gguf model that has been converted to safetensors results in a Cuda out of memory error. This occurs after some of the ray workers have finished loading the model. 4 RTX2080ti 22GB were used which should have 88GB of VRAM available.
Launch Parameters
Error Log
The text was updated successfully, but these errors were encountered: