Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Centos7 support #2010

Merged
merged 9 commits into from May 14, 2024
Merged

Drop Centos7 support #2010

merged 9 commits into from May 14, 2024

Conversation

NvTimLiu
Copy link
Collaborator

@NvTimLiu NvTimLiu commented May 2, 2024

To fix: #1991

Drop Centos7 support, switch to build in a Rocky 8 Docker image

Update the script to support both amd64 and arm64 CPUs

To fix: NVIDIA#1991

Drop Centos7 support, switch to build in a Rocky 8 Docker image

Update the script to support both amd64 and arm64 CPUs

Signed-off-by: Tim Liu <timl@nvidia.com>
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/run-in-docker Outdated Show resolved Hide resolved
ci/Dockerfile Show resolved Hide resolved
@sameerz sameerz added the build label May 4, 2024
Signed-off-by: Tim Liu <timl@nvidia.com>
Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented May 6, 2024

build

build/build-in-docker Outdated Show resolved Hide resolved
ci/Dockerfile Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
# Set env for arm64 build
if [ "$arch" == "arm64" ]; then
profiles="${profiles},arm64"
USE_GDS="OFF"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we don't build GDS on arm64? It used to be separate but now is part of the CUDA toolkit. Is it not part of the arm64 CUDA toolkit?

Copy link
Collaborator Author

@NvTimLiu NvTimLiu May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we don't build GDS on arm64? It used to be separate but now is part of the CUDA toolkit. Is it not part of the arm64 CUDA toolkit?

Yes, GDS cuFiles RDMA lib is not in the arm64 CUDA toolkit


[INFO] [exec] Could NOT find cuFile (missing: cuFile_LIBRARY cuFileRDMA_LIBRARY
[INFO] [exec] cuFile_INCLUDE_DIR)


cufaultinj links to static cupti_static is not found in arm64 CUDA toolkit

[INFO] [exec] -- Generating done (0.0s)
[INFO] [exec] CMake Error at faultinj/CMakeLists.txt:34 (target_link_libraries):
[INFO] [exec] Target "cufaultinj" links to:
[INFO] [exec]
[INFO] [exec] CUDA::cupti_static
[INFO] [exec]
[INFO] [exec] but the target was not found. Possible reasons include:
[INFO] [exec]
[INFO] [exec] * There is a typo in the target name.
[INFO] [exec] * A find_package call is missing for an IMPORTED target.
[INFO] [exec] * An ALIAS target is missing.
[INFO] [exec]
[INFO] [exec]
[INFO] [exec]
[INFO] [exec] CMake Generate step failed. Build files cannot be regenerated correctly.


rmm OOM issue reported as below for arm64 test if USE_SANITIZER=OFF, but I've no idea what the root cause of the issue

[ERROR] There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/nvidia/timl/spark-rapids-jni/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:254: Maximum pool size exceeded
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.ja

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix the tests on arm64, not hack the build script to run the sanitizer to hide the problem. We will not be running the sanitizer in production, so letting this slip through by hacking the build script is not what we want.

build/build-in-docker Outdated Show resolved Hide resolved
Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented May 7, 2024

build

build/build-in-docker Outdated Show resolved Hide resolved
# Set env for arm64 build
if [ "$arch" == "arm64" ]; then
profiles="${profiles},arm64"
USE_GDS="OFF"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix the tests on arm64, not hack the build script to run the sanitizer to hide the problem. We will not be running the sanitizer in production, so letting this slip through by hacking the build script is not what we want.

Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented May 8, 2024

@jlowe I've filed an issue for the rmm OOM UT failed on arm64 if USE_SANITIZER=OFF

Should we leave it unfixed and follow it up in the issue below?
We can merge this PR first to drop centOS7 support for 24.06 release since there're only 2-3 weeks remaining before v24.06 release, thanks!

#2022

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented May 8, 2024

build

build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
@jlowe
Copy link
Member

jlowe commented May 8, 2024

If tests are failing it's a blocker for this PR. Hacking the script to pass the test is not the right answer. The goal is to produce a good build on rocky8, but the failing tests indicate there may be issues with the build. The tests are there for a reason, to indicate whether the code is working properly. If the tests used to pass when building on centos7 but fail when building on rocky8 then there may be an issue with the code produced by the rocky8 build. We need to investigate the test failures and fix them.

Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented May 8, 2024

build

build/build-in-docker Show resolved Hide resolved
build/run-in-docker Outdated Show resolved Hide resolved
ci/Dockerfile Show resolved Hide resolved
@jlowe jlowe marked this pull request as draft May 8, 2024 18:34
@jlowe
Copy link
Member

jlowe commented May 8, 2024

Converted this to a draft PR, as we don't want this merged until we get to the bottom of the arm64 test failures after building on rocky8, tracked by #2022.

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

Here's the relevant fix for the issue2022 : rapidsai/cudf#15706

@NvTimLiu
Copy link
Collaborator Author

build

@NvTimLiu NvTimLiu marked this pull request as ready for review May 13, 2024 05:13
@NvTimLiu NvTimLiu merged commit f1e5dbb into NVIDIA:branch-24.06 May 14, 2024
3 checks passed
@ttnghia
Copy link
Collaborator

ttnghia commented May 15, 2024

Unfortunately this breaks my local build:

[INFO]      [exec] /usr/local/cmake-3.26.4-linux-x86_64/bin/cmake --regenerate-during-build -S/home/nghiat/Devel/jni/1/thirdparty/cudf/cpp -B/home/nghiat/Devel/jni/1/thirdparty/cudf/cpp/build
[INFO]      [exec] CMake Error at CMakeLists.txt:27 (project):
[INFO]      [exec]   The CMAKE_C_COMPILER:
[INFO]      [exec] 
[INFO]      [exec]     /opt/rh/devtoolset-11/root/usr/bin/cc
[INFO]      [exec] 
[INFO]      [exec]   is not a full path to an existing compiler tool.
[INFO]      [exec] 
[INFO]      [exec]   Tell CMake where to find the compiler by setting either the environment
[INFO]      [exec]   variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
[INFO]      [exec]   the compiler, or to the compiler name if it is in the PATH.
[INFO]      [exec] 
[INFO]      [exec] 
[INFO]      [exec] CMake Error at CMakeLists.txt:27 (project):
[INFO]      [exec]   The CMAKE_CXX_COMPILER:
[INFO]      [exec] 
[INFO]      [exec]     /opt/rh/devtoolset-11/root/usr/bin/c++
[INFO]      [exec] 
[INFO]      [exec]   is not a full path to an existing compiler tool.
[INFO]      [exec] 
[INFO]      [exec]   Tell CMake where to find the compiler by setting either the environment
[INFO]      [exec]   variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
[INFO]      [exec]   to the compiler, or to the compiler name if it is in the PATH.
[INFO]      [exec] 
[INFO]      [exec] 
[INFO]      [exec] ninja: error: rebuilding 'build.ninja': subcommand failed

@jlowe
Copy link
Member

jlowe commented May 15, 2024

I suspect if you clean out thirdparty/cudf/cpp/build it will fix that issue. The compiler changed locations between the old and new image, so anything caching the compiler location (such as an existing cmake cache file) will break.

@ttnghia
Copy link
Collaborator

ttnghia commented May 15, 2024

Oh yes, that fixes it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Drop CentOS7 support in 24.06
4 participants