Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668

Open
jldroid19 opened this issue Apr 11, 2024 · 8 comments
Labels
type/bug Bug in code

Comments

@jldroid19
Copy link

🐛 Bug

q.app
q.user
q.client
report_error: True
q.events
q.args
report_error: True
stacktrace
Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)

File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))

File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Error
None

Git Version
fatal: not a git repository (or any of the parent directories): .git

To Reproduce

I'm not sure why this is happening. Hard to reproduce

LLM Studio version

v1.4.0-dev

@jldroid19 jldroid19 added the type/bug Bug in code label Apr 11, 2024
@psinger
Copy link
Collaborator

psinger commented Apr 11, 2024

This means you have no GPUs available. Can you run nvidia-smi to confirm everything is fine?

@jldroid19
Copy link
Author

image

image

What's interesting is the environment just suddenly drops. It's like the GPU's just disappear after a few hours of training.

@psinger
Copy link
Collaborator

psinger commented Apr 15, 2024

This seems to be an issue on your environment/system then unfortunately.

@psinger
Copy link
Collaborator

psinger commented Apr 22, 2024

@jldroid19 did you figure the issue out?

@jldroid19
Copy link
Author

@psinger I have not.

@psinger
Copy link
Collaborator

psinger commented Apr 24, 2024

are you running this in docker?

@jldroid19
Copy link
Author

are you running this in docker?

Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.

@psinger
Copy link
Collaborator

psinger commented May 3, 2024

I stumbled upon this recently, might be related:
NVIDIA/nvidia-docker#1469

NVIDIA/nvidia-container-toolkit#465 (comment)

There seems to be some issue of gpus being suddenly gone in Docker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

2 participants