CUDA error: unknown error #5482

liuhui0401 · 2024-04-30T04:47:37Z

When I finetuned the G-LLava on 8 A100s, I met such a problem several times.

The full trace is here
https://drive.google.com/file/d/195PO96uWKnx4LE3BWjxm0DsrQxWbj3QP/view?usp=sharing

The script is here
https://github.com/pipilurj/G-LLaVA/blob/main/scripts

It worked well for finetuning the first stage using run_alignment.sh. But when I finetuned the second stage using run_qa.sh, I met the aboved problem. Now when I input "nvidia-smi" in the terminal, it shows "Unable to determine the device handle for GPU 0000:4F:00.0: Unknown Error". Can anyone please help me solve my problem? Thank you!

loadams · 2024-04-30T21:21:10Z

@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?

liuhui0401 · 2024-05-01T01:00:32Z

@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?

Yes. But if I finetune again, I will meet the same problem again. I don't know the reason.

loadams · 2024-05-06T16:36:04Z

I see, what cuda version are you using currently and can you try with a newer version as well?

loadams self-assigned this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: unknown error #5482

CUDA error: unknown error #5482

liuhui0401 commented Apr 30, 2024

loadams commented Apr 30, 2024

liuhui0401 commented May 1, 2024

loadams commented May 6, 2024

CUDA error: unknown error #5482

CUDA error: unknown error #5482

Comments

liuhui0401 commented Apr 30, 2024

loadams commented Apr 30, 2024

liuhui0401 commented May 1, 2024

loadams commented May 6, 2024