Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcudart.so.10.1 (and others) are not in the built docker image #497

Open
gaborvecsei opened this issue Sep 14, 2020 · 6 comments
Open

Comments

@gaborvecsei
Copy link

Steps to reproduce the error:

  • Docker image is built with: python perfzero/lib/setup.py --tensorflow_pip_spec=tensorflow==2.3.0
  • Start image with: docker run -it --gpus all --rm -v $(pwd):/workspace perfzero/tensorflow bash
  • When you are inside the container execute: python3 /workspace/perfzero/lib/benchmark.py --git_repos="https://github.com/tensorflow/models.git;benchmark" --python_path=models --gcloud_key_file_url="" --benchmark_methods=official.benchmark.keras _cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat

The benchmark starts but only on CPUs because of the error:

Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2020-09-14 08:21:40.392385: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:40.392413: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-09-14 08:21:41,815 INFO: Adding path models to sys.path
2020-09-14 08:21:41,818 INFO: Checking out repository from https://github.com/tensorflow/models.git to /workspace/perfzero/workspace/site-packages/models
2020-09-14 08:21:43,650 INFO: Checked-out repository from https://github.com/tensorflow/models.git to /workspace/perfzero/workspace/site-packages/models
2020-09-14 08:21:43,698 INFO: The following benchmark methods will be executed: ['official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat']
2020-09-14 08:21:43,698 INFO: The following benchmark methods will be executed: ['official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat']
Setup complete. Running 1 trials
Running trial 1 / 1
2020-09-14 08:21:43,715 INFO: Created directory /workspace/perfzero/workspace/output/2020-09-14-08-21-43-714984
2020-09-14 08:21:43,715 INFO: Created directory /workspace/perfzero/workspace/output/2020-09-14-08-21-43-714984
2020-09-14 08:21:43,767 INFO: root_data_dir: None
2020-09-14 08:21:43,767 INFO: root_data_dir: None
2020-09-14 08:21:43,767 INFO: Started process information tracker.
2020-09-14 08:21:43,767 INFO: Started process information tracker.
2020-09-14 08:21:43,767 INFO: Starting benchmark execution: official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat
2020-09-14 08:21:43,767 INFO: Starting benchmark execution: official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat
2020-09-14 08:21:43.775096: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-14 08:21:44.235446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.237785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:1b:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.240085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:3d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.242330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 3 with properties:
pciBusID: 0000:3e:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.244621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 4 with properties:
pciBusID: 0000:88:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.246886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 5 with properties:
pciBusID: 0000:89:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.249171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 6 with properties:
pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.251456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 7 with properties:
pciBusID: 0000:b3:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.72GiB deviceMemoryBandwidth: 836.37GiB/s
2020-09-14 08:21:44.251562: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251688: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251774: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251829: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251885: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.251939: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-09-14 08:21:44.282221: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-14 08:21:44.282238: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

When I am searching for the "libcudart.so.*" (find / -name "libcudart.so.*") the results are the following:

root@9c9ea0bde86d:/workspace# find / -name "libcudart.so.*"

/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130

So only the wrong version is installed.

@lindong28
Copy link
Contributor

@reedwm Toby is the person who added Docker support for PerfZero. I don't have experience with docker and I don't know which part of service in our infra uses docker. I don't know who is able to maintain this feature now that Toby has left this project.

@gaborvecsei
Copy link
Author

@lindong28 I can look into the code and create a PR with the necessary changes

@lindong28
Copy link
Contributor

Thank you @gaborvecsei for offering to fix this issue!

If the PR is easy to review (e.g. it just changed a version), it will be great and I can just approve it. If the PR involves something that requires docker expertise, I will ask around and see who can help with this.

@TobiasMei
Copy link

I have the same problem.
I tried different containers nothing worked. Also tried to install the version named in the dockerfile.
Nothing really helped that the benchmark is run on the gpu. Only the cpu is used.

python3 perfzero/lib/setup.py --dockerfile_path=docker/Dockerfile_ubuntu_1804_tf_v2
python3 perfzero/lib/setup.py --dockerfile_path=docker/Dockerfile_ubuntu_1804_tf_v2 --tensorflow_pip_spec=tensorflow-gpu==2.1.0
nvidia-docker run -it --rm -v $(pwd):/workspace -v /data:/data perfzero/tensorflow bash

Then i run the benchmark the error is similar:

root@cb1e8eb587b0:/# python3 /workspace/perfzero/lib/benchmark.py --git_repos="https://github.com/tensorflow/models.git;benchmark" --python_path=models --gcloud_key_file_url="" --benchmark_methods=official.benchmark.keras_cifar_benchmark.Resnet56KerasBenchmarkSynth.benchmark_1_gpu_no_dist_strat
Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2021-04-19 06:35:56.185318: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-19 06:35:56.185416: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-19 06:35:56.185429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Then i seraching for the libcuadart* i get:

root@cb1e8eb587b0:/# find / -name "libcudart.so.*"
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1.243

@gaborvecsei
You mentioned it should be only the false version. Did you find a version that is still working and can you share the solution you find.

Any other way to solve the problem?

@Gabriel-Gardin
Copy link

Having the same issue

@TobiasMei
Copy link

@gabrielgardin

I found out that for me the dockerfile with Ubuntu 18.04 and Cuda 11.0 works, when using tensorflow version 2.4.

The command to build the docker looks like this:
python3 perfzero/lib/setup.py --dockerfile_path=docker/Dockerfile_ubuntu_1804_tf_cuda_11_0 --tensorflow_pip_spec=tensorflow==2.4

@lindong28 lindong28 removed their assignment May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants