ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668

jldroid19 · 2024-04-11T13:12:05Z

🐛 Bug

q.app
q.user
q.client
report_error: True
q.events
q.args
report_error: True
stacktrace
Traceback (most recent call last):

File “/workspace/./llm_studio/app_utils/handlers.py”, line 78, in handle await home(q)

File “/workspace/./llm_studio/app_utils/sections/home.py”, line 66, in home stats.append(ui.stat(label=“Current GPU load”, value=f"{get_gpu_usage():.1f}%"))

File “/workspace/./llm_studio/app_utils/utils.py”, line 1949, in get_gpu_usage all_gpus = GPUtil.getGPUs()

File “/home/llmstudio/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/GPUtil/GPUtil.py”, line 102, in getGPUs deviceIds = int(vals[i])

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’

Error
None

Git Version
fatal: not a git repository (or any of the parent directories): .git

To Reproduce

I'm not sure why this is happening. Hard to reproduce

LLM Studio version

v1.4.0-dev

psinger · 2024-04-11T13:24:42Z

This means you have no GPUs available. Can you run nvidia-smi to confirm everything is fine?

jldroid19 · 2024-04-11T13:35:05Z

What's interesting is the environment just suddenly drops. It's like the GPU's just disappear after a few hours of training.

psinger · 2024-04-15T06:17:58Z

This seems to be an issue on your environment/system then unfortunately.

psinger · 2024-04-22T07:51:45Z

@jldroid19 did you figure the issue out?

jldroid19 · 2024-04-22T11:11:28Z

@psinger I have not.

psinger · 2024-04-24T12:42:29Z

are you running this in docker?

jldroid19 · 2024-04-24T14:32:36Z

are you running this in docker?

Yes I am running it using docker. It's strange due to the fact, we can run a dataset on it with an expected finish of 5 days and it'll finish. We then go to start another experiment and 3 hours later the container stops. Cause it to fail the experiment. With a quick docker restart the app is back up and running, but the training that had been going is lost.

psinger · 2024-05-03T11:55:48Z

I stumbled upon this recently, might be related:
NVIDIA/nvidia-docker#1469

NVIDIA/nvidia-container-toolkit#465 (comment)

There seems to be some issue of gpus being suddenly gone in Docker.

jldroid19 added the type/bug Bug in code label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668

jldroid19 commented Apr 11, 2024

psinger commented Apr 11, 2024

jldroid19 commented Apr 11, 2024

psinger commented Apr 15, 2024

psinger commented Apr 22, 2024

jldroid19 commented Apr 22, 2024

psinger commented Apr 24, 2024

jldroid19 commented Apr 24, 2024

psinger commented May 3, 2024 •

edited

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ #668

Comments

jldroid19 commented Apr 11, 2024

🐛 Bug

To Reproduce

LLM Studio version

psinger commented Apr 11, 2024

jldroid19 commented Apr 11, 2024

psinger commented Apr 15, 2024

psinger commented Apr 22, 2024

jldroid19 commented Apr 22, 2024

psinger commented Apr 24, 2024

jldroid19 commented Apr 24, 2024

psinger commented May 3, 2024 • edited

psinger commented May 3, 2024 •

edited