Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HYPRE CUDA examples only work in Debug builds #1051

Open
slapgas opened this issue Jan 15, 2024 · 3 comments
Open

HYPRE CUDA examples only work in Debug builds #1051

slapgas opened this issue Jan 15, 2024 · 3 comments

Comments

@slapgas
Copy link

slapgas commented Jan 15, 2024

Issue description

  • HYPRE examples with CUDA only seem to work in Debug builds
  • Behavior is consistent and present both autoconf and CMake builds
  • Behavior is consistent and present using both CUDA from cuda-toolkit packages and CUDA from nvhpc packages

Important note

Steps to reproduce the behavior

Setting CUDA_HOME to your CUDA installation home directory

I have tested this with both cuda-toolkit and nvhpc bundled CUDA versions and with both CUDA 11.8 and 12.3.

I am on Pop!OS 22.04. I have the latest nvhpc-cuda-multi package from the NVHPC repos, as well as the two latest CUDA versions from the CUDA repos.

I use:

export CUDA_HOME=/usr/local/cuda-12/

or

export CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/cuda/12.3/

Building HYPRE either via autoconf or via CMake

For autoconf Debug build:

./configure --with-cuda --enable-unified-memory --with-gpu-arch='86' --enable-gpu-aware-mpi --enable-debug

make -j 4

For CMake:

TARGET=$PWD

cmake -G Ninja
-DCMAKE_BUILD_TYPE=Debug
-DCMAKE_INSTALL_PREFIX=$TARGET
-DHYPRE_WITH_CUDA=ON
-DHYPRE_CUDA_SM='86'
-DHYPRE_ENABLE_UNIFIED_MEMORY=ON
-DHYPRE_WITH_GPU_AWARE_MPI=ON
../src/

ninja
ninja install

For release builds I remove the --enable-debug option from the autoconf command and change Debug to Release in the CMake commands.

Building examples with make

make use_cuda=1

Actual behavior

mpirun -np 2 ./ex1

Debug build output

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[mtndew:298978] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[mtndew:298979] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
<C*b,b>: 1.800000e+01


Iters       ||r||_C     conv.rate  ||r||_C/||b||_C
-----    ------------    ---------  ------------
    1    2.509980e+00    0.591608    5.916080e-01
    2    9.888265e-01    0.393958    2.330686e-01
    3    4.572262e-01    0.462393    1.077693e-01
    4    1.706474e-01    0.373223    4.022197e-02
    5    7.473022e-02    0.437922    1.761408e-02
    6    3.402624e-02    0.455321    8.020061e-03
    7    1.214929e-02    0.357057    2.863616e-03
    8    3.533113e-03    0.290808    8.327628e-04
    9    1.343893e-03    0.380371    3.167586e-04
    10    2.968745e-04    0.220906    6.997400e-05
    11    5.329671e-05    0.179526    1.256215e-05
    12    7.308483e-06    0.137128    1.722626e-06
    13    7.411552e-07    0.101410    1.746920e-07

I don't know why I get the warnings, however the results are consistent with what is discussed in issue #845.

Release build output

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[mtndew:298978] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[mtndew:298979] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
<C*b,b>: 0.000000e+00

Expected behavior

Both Debug and Release builds should yield the same results.
As it seems, the release build does not do anything.

EDIT #1: Fixed hyperlinks

@ruohai0925
Copy link
Contributor

Same issue here. Just wonder if there are more suggestions about this issue.

@liruipeng
Copy link
Contributor

liruipeng commented Mar 21, 2024

Thank you for reporting this issue. The reason why the debug mode works but the release mode doesn't is that we reply on unified memory to transfer data to GPUs in the examples. The debug mode implicitly forces device synchornization. In principle, we should use device memory where this wouldn't be an issue but the goal of the examples is to show basics of using hypre, so we keep the GPU code as simple as possible. For your own code, you can still follow the example code whereas the memory should be on device and populated on device as well, or add adequate explicit device synchronization.

@liruipeng
Copy link
Contributor

You can also turn on CUDA_LAUNCH_BLOCKING to get the correct results of the examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants