"Could not find executable `nvidia-smi`" for `./configure.py --backend=CUDA` #12017

joelberkeley · 2024-05-01T13:23:59Z

When I run this line in the tensorflow docker container ./configure.py --backend=CUDA, as specified in the docs, I get

tf-docker /spidr/spidr/backend/xla > ./configure.py --backend=CUDA
INFO:root:Trying to find path to clang...
INFO:root:Found path to clang at /usr/lib/llvm-17/bin/clang
INFO:root:Running echo __clang_major__ | /usr/lib/llvm-17/bin/clang -E -P -
INFO:root:/usr/lib/llvm-17/bin/clang reports major version 17.
INFO:root:Trying to find path to nvidia-smi...
INFO:root:Could not find nvidia-smi, or nvidia-smi command failed. Please pass capabilities directly using --cuda_compute_capabilities.
Traceback (most recent call last):
  File "/spidr/spidr/backend/xla/./configure.py", line 538, in <module>
    raise SystemExit(main())
  File "/spidr/spidr/backend/xla/./configure.py", line 516, in main
    bazelrc_lines = config.to_bazelrc_lines(
  File "/spidr/spidr/backend/xla/./configure.py", line 349, in to_bazelrc_lines
    dpav.get_relevant_paths_and_versions(self)
  File "/spidr/spidr/backend/xla/./configure.py", line 256, in get_relevant_paths_and_versions
    self.cuda_compute_capabilities = _get_cuda_compute_capabilities_or_die()
  File "/spidr/spidr/backend/xla/./configure.py", line 124, in _get_cuda_compute_capabilities_or_die
    raise e
  File "/spidr/spidr/backend/xla/./configure.py", line 107, in _get_cuda_compute_capabilities_or_die
    nvidia_smi = _find_executable_or_die("nvidia-smi")
  File "/spidr/spidr/backend/xla/./configure.py", line 87, in _find_executable_or_die
    raise RuntimeError(
RuntimeError: Could not find executable `nvidia-smi`! Please change your $PATH or pass the path directly like`--nvidia-smi_path=path/to/executable.

I don't see this error if I specify --gpus all in the docker run command. I believe using that option requires nvidia container runtime.

I guess the docs need updating, but I'd like to build XLA targets (specifically the CUDA PJRT plugin) without access to GPUs, or the nvidia container runtime, so I can build it in GitHub actions. My minimal understanding of CUDA says this should be possible.

Do I even need to run ./configure.py for that target? I have so far been unable to get the CUDA plugin working, and wonder if this error might be the problem.

The text was updated successfully, but these errors were encountered:

beckerhe · 2024-05-02T04:18:35Z

My minimal understanding of CUDA says this should be possible.

Yeah, agreed. That should be possible. I would try bypassing the configure script and directly call Bazel in your container.

I believe bazel build --config release_gpu_linux //path/to/target should do what you want.

joelberkeley · 2024-05-02T19:47:38Z

thanks. That built, but my CUDA tests are still failing

beckerhe · 2024-05-03T06:20:47Z

Can you share the log?

Also, didn't you say you only wanted to build things? For running the tests you will need a GPU.

joelberkeley · 2024-05-03T11:25:46Z

here's the log

2024-05-03 11:22:21.865095: E xla/stream_executor/cuda/cuda_dnn.cc:536] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2024-05-03 11:22:21.865160: E xla/stream_executor/cuda/cuda_dnn.cc:540] Memory usage: 7272202240 bytes free, 8497594368 bytes total.
2024-05-03 11:22:21.871304: E xla/stream_executor/cuda/cuda_dnn.cc:536] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2024-05-03 11:22:21.871379: E xla/stream_executor/cuda/cuda_dnn.cc:540] Memory usage: 7272202240 bytes free, 8497594368 bytes total.

with PJRT_Error message

"DNN library initialization failed. Look at the errors above for more details."

It occurs on PJRT_Client_Create

Also, didn't you say you only wanted to build things? For running the tests you will need a GPU.

sort of. I am ultimately running it in a GPU environment, but before that, I'm building it without a GPU. I have tried several different environments and several different argument sets to PJRT_Client_Create. None work. This GitHub issue was aimed at fixing the build step, but I'm still seeing the error. I think it must be either something to do with the build, or the runtime configuration, for which I've raised a question in the google group.

beckerhe · 2024-05-03T11:37:51Z

That should work. Unfortunately a failing cuDNN initialization can have a whole lot of reasons.

I would recommend enabling debug logging for cuDNN. This can be done by setting some environment variables, see here: https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html.

(Note that the environment variables changed with cuDNN 9.0.0, so if you use cuDNN prior to version 9.0.0 you need to check the docs for your version in the NVIDIA docs archive)

I can't really comment on PJRT and its options. I'm not an expert on that.

joelberkeley · 2024-05-03T12:40:47Z

It's working! God that took too long. Thanks for your help. I ran it in the same container I built it in (with extra stuff). I assume that means there's a missing package or version conflict. I'll investigate, but seeing it working is extremely promising. I'll leave this ticket open so the docs and/or configure.py are updated, but I may be able to do the rest from here

joelberkeley changed the title ~~Could not find executable nvidia-smi for ./configure.py --backend=CUDA`~~ Could not find executable nvidia-smi` for ./configure.py --backend=CUDA` May 1, 2024

joelberkeley changed the title ~~Could not find executable nvidia-smi` for ./configure.py --backend=CUDA`~~ Could not find executable \nvidia-smi\` for ./configure.py --backend=CUDA` May 1, 2024

joelberkeley changed the title ~~Could not find executable \nvidia-smi\` for ./configure.py --backend=CUDA`~~ "Could not find executable nvidia-smi" for ./configure.py --backend=CUDA May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Could not find executable `nvidia-smi`" for `./configure.py --backend=CUDA` #12017

"Could not find executable `nvidia-smi`" for `./configure.py --backend=CUDA` #12017

joelberkeley commented May 1, 2024 •

edited

beckerhe commented May 2, 2024

joelberkeley commented May 2, 2024 •

edited

beckerhe commented May 3, 2024

joelberkeley commented May 3, 2024 •

edited

beckerhe commented May 3, 2024

joelberkeley commented May 3, 2024 •

edited

"Could not find executable nvidia-smi" for ./configure.py --backend=CUDA #12017

"Could not find executable nvidia-smi" for ./configure.py --backend=CUDA #12017

Comments

joelberkeley commented May 1, 2024 • edited

beckerhe commented May 2, 2024

joelberkeley commented May 2, 2024 • edited

beckerhe commented May 3, 2024

joelberkeley commented May 3, 2024 • edited

beckerhe commented May 3, 2024

joelberkeley commented May 3, 2024 • edited

"Could not find executable `nvidia-smi`" for `./configure.py --backend=CUDA` #12017

"Could not find executable `nvidia-smi`" for `./configure.py --backend=CUDA` #12017

joelberkeley commented May 1, 2024 •

edited

joelberkeley commented May 2, 2024 •

edited

joelberkeley commented May 3, 2024 •

edited

joelberkeley commented May 3, 2024 •

edited