Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: CUDA_HOME environment variable not set when python setup.py in Dockerfile #95

Open
stevezkw1998 opened this issue May 24, 2023 · 5 comments

Comments

@stevezkw1998
Copy link

My Dockerfile

FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

RUN apt-get update && apt-get install -y git gcc build-essential

RUN mkdir /app
WORKDIR /app

# Install Pytorch Correlation
RUN git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
RUN cd Pytorch-Correlation-extension && python setup.py install
RUN cd -

EXPOSE 5252

CMD ["python", "app.py"]

Then raise an Error:
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
The full error logs:

 => ERROR [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install                                                                      2.2s
------
 > [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install:
#0 1.843 Traceback (most recent call last):
#0 1.843   File "/app/Pytorch-Correlation-extension/setup.py", line 57, in <module>
#0 1.843     launch_setup()
#0 1.844   File "/app/Pytorch-Correlation-extension/setup.py", line 36, in launch_setup
#0 1.844     Extension('spatial_correlation_sampler_backend',
#0 1.844   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
#0 1.844     library_dirs += library_paths(cuda=True)
#0 1.844   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
#0 1.845     if (not os.path.exists(_join_cuda_home(lib_dir)) and
#0 1.845   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
#0 1.845     raise EnvironmentError('CUDA_HOME environment variable is not set. '
#0 1.845 OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
------
Dockerfile:33
--------------------
  31 |     # Install Pytorch Correlation
  32 |     RUN git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
  33 | >>> RUN cd Pytorch-Correlation-extension && python setup.py install
  34 |     RUN cd -
  35 |
--------------------
ERROR: failed to solve: process "/bin/sh -c cd Pytorch-Correlation-extension && python setup.py install" did not complete successfully: exit code: 1
@ClementPinard
Copy link
Owner

Hi, looks like to met that you would need to use the devel image and not the runtime since you need to be able to compile against torch and cuda. SO I would try changing the docker image name from pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime to pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel

@stevezkw1998
Copy link
Author

Hi @ClementPinard Thank you for your advice
After I changed the docker image name from pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime to pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
The former issues fixed, but I has new issue:

 => ERROR [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install                                                                                                      15.9s 
------
 > [13/14] RUN cd Pytorch-Correlation-extension && python setup.py install:
#0 1.665 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#0 1.689 running install
#0 1.689 /opt/conda/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
#0 1.689   warnings.warn(
#0 1.752 /opt/conda/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
#0 1.752   warnings.warn(
#0 1.818 running bdist_egg
#0 1.830 running egg_info
#0 1.830 creating Correlation_Module/spatial_correlation_sampler.egg-info
#0 1.835 writing Correlation_Module/spatial_correlation_sampler.egg-info/PKG-INFO
#0 1.836 writing dependency_links to Correlation_Module/spatial_correlation_sampler.egg-info/dependency_links.txt
#0 1.836 writing requirements to Correlation_Module/spatial_correlation_sampler.egg-info/requires.txt
#0 1.836 writing top-level names to Correlation_Module/spatial_correlation_sampler.egg-info/top_level.txt
#0 1.836 writing manifest file 'Correlation_Module/spatial_correlation_sampler.egg-info/SOURCES.txt'
#0 1.842 /opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
#0 1.842   warnings.warn(msg.format('we could not find ninja.'))
#0 1.846 reading manifest file 'Correlation_Module/spatial_correlation_sampler.egg-info/SOURCES.txt'
#0 1.847 adding license file 'LICENSE'
#0 1.847 writing manifest file 'Correlation_Module/spatial_correlation_sampler.egg-info/SOURCES.txt'
#0 1.848 installing library code to build/bdist.linux-x86_64/egg
#0 1.848 running install_lib
#0 1.848 running build_py
#0 1.849 creating build
#0 1.849 creating build/lib.linux-x86_64-cpython-310
#0 1.849 creating build/lib.linux-x86_64-cpython-310/spatial_correlation_sampler
#0 1.849 copying Correlation_Module/spatial_correlation_sampler/spatial_correlation_sampler.py -> build/lib.linux-x86_64-cpython-310/spatial_correlation_sampler
#0 1.850 copying Correlation_Module/spatial_correlation_sampler/__init__.py -> build/lib.linux-x86_64-cpython-310/spatial_correlation_sampler
#0 1.850 running build_ext
#0 1.868 building 'spatial_correlation_sampler_backend' extension
#0 1.868 creating build/temp.linux-x86_64-cpython-310
#0 1.868 creating build/temp.linux-x86_64-cpython-310/Correlation_Module
#0 1.869 gcc -pthread -B /opt/conda/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/include -fPIC -O2 -isystem /opt/conda/include -fPIC -DUSE_CUDA -I/opt/conda/lib/python3.10/site-packages/torch/include -I/opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.10 -c Correlation_Module/correlation.cpp -o build/temp.linux-x86_64-cpython-310/Correlation_Module/correlation.o -std=c++14 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=spatial_correlation_sampler_backend -D_GLIBCXX_USE_CXX11_ABI=0
#0 15.65 Traceback (most recent call last):
#0 15.65   File "/app/Pytorch-Correlation-extension/setup.py", line 69, in <module>
#0 15.65     launch_setup()
#0 15.65   File "/app/Pytorch-Correlation-extension/setup.py", line 37, in launch_setup
#0 15.65     setup(
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
#0 15.65     return distutils.core.setup(**attrs)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
#0 15.65     return run_commands(dist)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
#0 15.65     dist.run_commands()
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
#0 15.65     self.run_command(cmd)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 1208, in run_command
#0 15.65     super().run_command(command)
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
#0 15.65     cmd_obj.run()
#0 15.65   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/install.py", line 74, in run
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
#0 15.66     _build_ext.build_extension(self, ext)
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 549, in build_extension
#0 15.66     objects = self.compiler.compile(
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/setuptools/_distutils/ccompiler.py", line 599, in compile
#0 15.66     self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 581, in unix_wrap_single_compile
#0 15.66     cflags = unix_cuda_flags(cflags)
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 548, in unix_cuda_flags
#0 15.66     cflags + _get_cuda_arch_flags(cflags))
#0 15.66   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1773, in _get_cuda_arch_flags
#0 15.66     arch_list[-1] += '+PTX'
#0 15.66 IndexError: list index out of range
------
Dockerfile:33
--------------------
  31 |     # Install Pytorch Correlation
  32 |     RUN git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
  33 | >>> RUN cd Pytorch-Correlation-extension && python setup.py install
  34 |     RUN cd -
  35 |
--------------------
ERROR: failed to solve: process "/bin/sh -c cd Pytorch-Correlation-extension && python setup.py install" did not complete successfully: exit code: 1
Docker build failed with error: Command 'docker build -t sam-track:1.0.0 ..' returned non-zero exit status 1.

@ClementPinard
Copy link
Owner

See this related issue : #90

GPU is not available during docker build so you need to figure out your compute capbilities beforehand and set the TORCH_CUDA_ARCH_LIST environment variable accordingly

@stevezkw1998
Copy link
Author

Hi @ClementPinard Thank you for your solution
But I may need to deploy my docker image to different computer
Is there any general solution to solve TORCH_CUDA_ARCH_LIST env var issue?

@ClementPinard
Copy link
Owner

If you don't know what the gpu cuda capabilties of your machine will be, your best bet is to compile for as much architectures as possible, or wait for the docker to be launched to compile the library. Compiled code cannot be generic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants