Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Could not find executable nvidia-smi" for ./configure.py --backend=CUDA #12017

Open
joelberkeley opened this issue May 1, 2024 · 6 comments

Comments

@joelberkeley
Copy link

joelberkeley commented May 1, 2024

When I run this line in the tensorflow docker container ./configure.py --backend=CUDA, as specified in the docs, I get

tf-docker /spidr/spidr/backend/xla > ./configure.py --backend=CUDA
INFO:root:Trying to find path to clang...
INFO:root:Found path to clang at /usr/lib/llvm-17/bin/clang
INFO:root:Running echo __clang_major__ | /usr/lib/llvm-17/bin/clang -E -P -
INFO:root:/usr/lib/llvm-17/bin/clang reports major version 17.
INFO:root:Trying to find path to nvidia-smi...
INFO:root:Could not find nvidia-smi, or nvidia-smi command failed. Please pass capabilities directly using --cuda_compute_capabilities.
Traceback (most recent call last):
  File "/spidr/spidr/backend/xla/./configure.py", line 538, in <module>
    raise SystemExit(main())
  File "/spidr/spidr/backend/xla/./configure.py", line 516, in main
    bazelrc_lines = config.to_bazelrc_lines(
  File "/spidr/spidr/backend/xla/./configure.py", line 349, in to_bazelrc_lines
    dpav.get_relevant_paths_and_versions(self)
  File "/spidr/spidr/backend/xla/./configure.py", line 256, in get_relevant_paths_and_versions
    self.cuda_compute_capabilities = _get_cuda_compute_capabilities_or_die()
  File "/spidr/spidr/backend/xla/./configure.py", line 124, in _get_cuda_compute_capabilities_or_die
    raise e
  File "/spidr/spidr/backend/xla/./configure.py", line 107, in _get_cuda_compute_capabilities_or_die
    nvidia_smi = _find_executable_or_die("nvidia-smi")
  File "/spidr/spidr/backend/xla/./configure.py", line 87, in _find_executable_or_die
    raise RuntimeError(
RuntimeError: Could not find executable `nvidia-smi`! Please change your $PATH or pass the path directly like`--nvidia-smi_path=path/to/executable.

I don't see this error if I specify --gpus all in the docker run command. I believe using that option requires nvidia container runtime.

I guess the docs need updating, but I'd like to build XLA targets (specifically the CUDA PJRT plugin) without access to GPUs, or the nvidia container runtime, so I can build it in GitHub actions. My minimal understanding of CUDA says this should be possible.

Do I even need to run ./configure.py for that target? I have so far been unable to get the CUDA plugin working, and wonder if this error might be the problem.

@joelberkeley joelberkeley changed the title Could not find executable nvidia-smi for ./configure.py --backend=CUDA` Could not find executable nvidia-smi` for ./configure.py --backend=CUDA` May 1, 2024
@joelberkeley joelberkeley changed the title Could not find executable nvidia-smi` for ./configure.py --backend=CUDA` Could not find executable \nvidia-smi\` for ./configure.py --backend=CUDA` May 1, 2024
@joelberkeley joelberkeley changed the title Could not find executable \nvidia-smi\` for ./configure.py --backend=CUDA` "Could not find executable nvidia-smi" for ./configure.py --backend=CUDA May 1, 2024
@beckerhe
Copy link
Member

beckerhe commented May 2, 2024

My minimal understanding of CUDA says this should be possible.

Yeah, agreed. That should be possible. I would try bypassing the configure script and directly call Bazel in your container.

I believe bazel build --config release_gpu_linux //path/to/target should do what you want.

@joelberkeley
Copy link
Author

joelberkeley commented May 2, 2024

thanks. That built, but my CUDA tests are still failing

@beckerhe
Copy link
Member

beckerhe commented May 3, 2024

Can you share the log?

Also, didn't you say you only wanted to build things? For running the tests you will need a GPU.

@joelberkeley
Copy link
Author

joelberkeley commented May 3, 2024

here's the log

2024-05-03 11:22:21.865095: E xla/stream_executor/cuda/cuda_dnn.cc:536] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2024-05-03 11:22:21.865160: E xla/stream_executor/cuda/cuda_dnn.cc:540] Memory usage: 7272202240 bytes free, 8497594368 bytes total.
2024-05-03 11:22:21.871304: E xla/stream_executor/cuda/cuda_dnn.cc:536] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2024-05-03 11:22:21.871379: E xla/stream_executor/cuda/cuda_dnn.cc:540] Memory usage: 7272202240 bytes free, 8497594368 bytes total.

with PJRT_Error message

"DNN library initialization failed. Look at the errors above for more details."

It occurs on PJRT_Client_Create

Also, didn't you say you only wanted to build things? For running the tests you will need a GPU.

sort of. I am ultimately running it in a GPU environment, but before that, I'm building it without a GPU. I have tried several different environments and several different argument sets to PJRT_Client_Create. None work. This GitHub issue was aimed at fixing the build step, but I'm still seeing the error. I think it must be either something to do with the build, or the runtime configuration, for which I've raised a question in the google group.

@beckerhe
Copy link
Member

beckerhe commented May 3, 2024

That should work. Unfortunately a failing cuDNN initialization can have a whole lot of reasons.

I would recommend enabling debug logging for cuDNN. This can be done by setting some environment variables, see here: https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html.

(Note that the environment variables changed with cuDNN 9.0.0, so if you use cuDNN prior to version 9.0.0 you need to check the docs for your version in the NVIDIA docs archive)

I can't really comment on PJRT and its options. I'm not an expert on that.

@joelberkeley
Copy link
Author

joelberkeley commented May 3, 2024

It's working! God that took too long. Thanks for your help. I ran it in the same container I built it in (with extra stuff). I assume that means there's a missing package or version conflict. I'll investigate, but seeing it working is extremely promising. I'll leave this ticket open so the docs and/or configure.py are updated, but I may be able to do the rest from here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants