Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE YET Replacing nightly CUDA11.0 builds with 11.1 #47938

Closed
wants to merge 7 commits into from

Conversation

janeyx99
Copy link
Contributor

@janeyx99 janeyx99 commented Nov 13, 2020

Based on #43366

Testing CUDA 11.1 build with split torch_cuda (#49050), previously, linking failed due to big binary size.

Using builder PR #627 to integrate the changes.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janeyx99 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Nov 13, 2020

💊 CI failures summary and remediations

As of commit bb1c18e (more details on the Dr. CI page):



🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build binary_linux_libtorch_3_7m_cu111_gcc5_4_cxx11-abi_nightly_static-with-deps_build (1/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):

-- Adding OpenMP CXX_FLAGS: -fopenmp
-- Will link against OpenMP libraries: /usr/lib/gcc/x86_64-linux-gnu/5/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.so
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn_static.a  
-- Found cuDNN: v8.0.5  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn_static.a)
CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):
  Cannot specify link libraries for target "caffe2::cudnn" which is not built
  by this project.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1148 (include)
  CMakeLists.txt:556 (include)


-- Configuring incomplete, errors occurred!
See also "/pytorch/build/CMakeFiles/CMakeOutput.log".
See also "/pytorch/build/CMakeFiles/CMakeError.log".

See CircleCI build binary_linux_libtorch_3_7m_cu111_gcc5_4_cxx11-abi_nightly_static-without-deps_build (2/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):

-- Adding OpenMP CXX_FLAGS: -fopenmp
-- Will link against OpenMP libraries: /usr/lib/gcc/x86_64-linux-gnu/5/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.so
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn_static.a  
-- Found cuDNN: v8.0.5  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn_static.a)
CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):
  Cannot specify link libraries for target "caffe2::cudnn" which is not built
  by this project.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1148 (include)
  CMakeLists.txt:556 (include)


-- Configuring incomplete, errors occurred!
See also "/pytorch/build/CMakeFiles/CMakeOutput.log".
See also "/pytorch/build/CMakeFiles/CMakeError.log".

See CircleCI build binary_linux_libtorch_3_7m_cu111_gcc5_4_cxx11-abi_nightly_shared-with-deps_build (3/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):

-- Adding OpenMP CXX_FLAGS: -fopenmp
-- Will link against OpenMP libraries: /usr/lib/gcc/x86_64-linux-gnu/5/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.so
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn_static.a  
-- Found cuDNN: v8.0.5  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn_static.a)
CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):
  Cannot specify link libraries for target "caffe2::cudnn" which is not built
  by this project.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1148 (include)
  CMakeLists.txt:556 (include)


-- Configuring incomplete, errors occurred!
See also "/pytorch/build/CMakeFiles/CMakeOutput.log".
See also "/pytorch/build/CMakeFiles/CMakeError.log".

See CircleCI build binary_linux_libtorch_3_7m_cu111_gcc5_4_cxx11-abi_nightly_shared-without-deps_build (4/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):

-- Adding OpenMP CXX_FLAGS: -fopenmp
-- Will link against OpenMP libraries: /usr/lib/gcc/x86_64-linux-gnu/5/libgomp.so;/usr/lib/x86_64-linux-gnu/libpthread.so
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn_static.a  
-- Found cuDNN: v8.0.5  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn_static.a)
CMake Error at cmake/public/cuda.cmake:279 (target_link_libraries):
  Cannot specify link libraries for target "caffe2::cudnn" which is not built
  by this project.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1148 (include)
  CMakeLists.txt:556 (include)


-- Configuring incomplete, errors occurred!
See also "/pytorch/build/CMakeFiles/CMakeOutput.log".
See also "/pytorch/build/CMakeFiles/CMakeError.log".

5 failures not recognized by patterns:

Job Step Action
CircleCI binary_windows_libtorch_3_7_cu111_release_nightly_test Test 🔁 rerun
CircleCI binary_linux_conda_3_9_cu111_devtoolset7_nightly_test Run in docker 🔁 rerun
CircleCI binary_linux_conda_3_7_cu111_devtoolset7_nightly_test Run in docker 🔁 rerun
CircleCI binary_linux_conda_3_8_cu111_devtoolset7_nightly_test Run in docker 🔁 rerun
CircleCI binary_linux_conda_3_6_cu111_devtoolset7_nightly_test Run in docker 🔁 rerun

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build binary_windows_libtorch_3_7_cu111_debug_nightly_build (1/1)

Step: "Persisting to workspace" (full log | diagnosis details | 🔁 rerun) ❄️

Error archiving workspace files: Error archiving files to tarball C:\Users\circleci\AppData\Local\Temp\workspace-layer-365ff348-af85-4679-b593-f8e215809d88531392954 : stdout: No space left on device

gzip: /c/Program Files/Git/usr/bin/tar: C:\Users\circleci\AppData\Local\Temp\workspace-layer-365ff348-af85-4679-b593-f8e215809d88531392954: Cannot write: Broken pipe /c/Program Files/Git/usr/bin/tar: Child returned status 1 /c/Program Files/Git/usr/bin/tar: Error is not recoverable: exiting now : exit status 2

Creating workspace archive...


Error archiving workspace files: Error archiving files to tarball C:\Users\circleci\AppData\Local\Temp\workspace-layer-365ff348-af85-4679-b593-f8e215809d88531392954 : stdout: No space left on device
 
 gzip: /c/Program Files/Git/usr/bin/tar: C\:\\Users\\circleci\\AppData\\Local\\Temp\\workspace-layer-365ff348-af85-4679-b593-f8e215809d88531392954: Cannot write: Broken pipe /c/Program Files/Git/usr/bin/tar: Child returned status 1 /c/Program Files/Git/usr/bin/tar: Error is not recoverable: exiting now : exit status 2


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janeyx99 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please convert it to a draft if it is intended for testing purposes?

CMakeLists.txt Outdated Show resolved Hide resolved
@janeyx99 janeyx99 marked this pull request as draft December 1, 2020 16:11
@janeyx99 janeyx99 force-pushed the nightly-11.1 branch 3 times, most recently from a7b7962 to 8ab61b0 Compare December 8, 2020 19:29
@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 28, 2021

As for the Windows jobs, the problems are listed below.

  1. "No space left on device". We will need to remove some large directories. Will do later.
  2. "CUDA driver initialization failed, you might not have a CUDA gpu". driver_update.bat in pytorch/builder needs to be updated.
  3. For conda jobs, you'll need to update bld.bat in pytorch/builder.

@janeyx99
Copy link
Contributor Author

janeyx99 commented Jan 28, 2021

Hey @peterjc123 thanks so much for looking into the Windows side!

  1. "No space left on device". We will need to remove some large directories. Will do later.

Just curious; how and where is this normally done?

  1. "CUDA driver initialization failed, you might not have a CUDA gpu". driver_update.bat in pytorch/builder needs to be updated.

I thought #574 in builder (pytorch/builder#574) updated the driver here...do we need an even newer version?

  1. For conda jobs, you'll need to update bld.bat in pytorch/builder.

I just checked conda/pytorch-nightly/bld.bat and noticed it's been (partially?) updated for CUDA 11.1. I will look into this more but if you have any ideas on what needs to be updated, let me know!

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 29, 2021

  1. "CUDA driver initialization failed, you might not have a CUDA gpu". driver_update.bat in pytorch/builder needs to be updated.

I thought #574 in builder (pytorch/builder#574) updated the driver here...do we need an even newer version?

It should be smoke_test.bat. See pytorch/builder#629 and pytorch/builder#631

  1. "No space left on device". We will need to remove some large directories. Will do later.

Just curious; how and where is this normally done?

We may log on to the CircleCI machine and use sth like Space Sniffer to get those directories.

@peterjc123
Copy link
Collaborator

  1. "No space left on device". We will need to remove some large directories. Will do later.

Will be addressed in #51405.

@janeyx99
Copy link
Contributor Author

janeyx99 commented Feb 9, 2021

This will not be used as we have 11.2

@janeyx99 janeyx99 closed this Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants