HYPRE CUDA examples only work in Debug builds #1051

slapgas · 2024-01-15T20:11:57Z

Issue description

HYPRE examples with CUDA only seem to work in Debug builds
Behavior is consistent and present both autoconf and CMake builds
Behavior is consistent and present using both CUDA from cuda-toolkit packages and CUDA from nvhpc packages

Important note

Building examples via CMake is broken as discussed in issue CUDA Error 700, Illegal Memory Access, for a trivial example using Struct interface #845. As far as I can test, this hasn't been fixed
Examples need to be built separately via Make

Steps to reproduce the behavior

Setting CUDA_HOME to your CUDA installation home directory

I have tested this with both cuda-toolkit and nvhpc bundled CUDA versions and with both CUDA 11.8 and 12.3.

I am on Pop!OS 22.04. I have the latest nvhpc-cuda-multi package from the NVHPC repos, as well as the two latest CUDA versions from the CUDA repos.

I use:

export CUDA_HOME=/usr/local/cuda-12/

or

export CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/cuda/12.3/

Building HYPRE either via autoconf or via CMake

For autoconf Debug build:

./configure --with-cuda --enable-unified-memory --with-gpu-arch='86' --enable-gpu-aware-mpi --enable-debug

make -j 4

For CMake:

TARGET=$PWD

cmake -G Ninja
-DCMAKE_BUILD_TYPE=Debug
-DCMAKE_INSTALL_PREFIX=$TARGET
-DHYPRE_WITH_CUDA=ON
-DHYPRE_CUDA_SM='86'
-DHYPRE_ENABLE_UNIFIED_MEMORY=ON
-DHYPRE_WITH_GPU_AWARE_MPI=ON
../src/

ninja
ninja install

For release builds I remove the --enable-debug option from the autoconf command and change Debug to Release in the CMake commands.

Building examples with make

make use_cuda=1

Actual behavior

mpirun -np 2 ./ex1

Debug build output

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[mtndew:298978] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[mtndew:298979] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
<C*b,b>: 1.800000e+01


Iters       ||r||_C     conv.rate  ||r||_C/||b||_C
-----    ------------    ---------  ------------
    1    2.509980e+00    0.591608    5.916080e-01
    2    9.888265e-01    0.393958    2.330686e-01
    3    4.572262e-01    0.462393    1.077693e-01
    4    1.706474e-01    0.373223    4.022197e-02
    5    7.473022e-02    0.437922    1.761408e-02
    6    3.402624e-02    0.455321    8.020061e-03
    7    1.214929e-02    0.357057    2.863616e-03
    8    3.533113e-03    0.290808    8.327628e-04
    9    1.343893e-03    0.380371    3.167586e-04
    10    2.968745e-04    0.220906    6.997400e-05
    11    5.329671e-05    0.179526    1.256215e-05
    12    7.308483e-06    0.137128    1.722626e-06
    13    7.411552e-07    0.101410    1.746920e-07

I don't know why I get the warnings, however the results are consistent with what is discussed in issue #845.

Release build output

[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[mtndew:298978] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[mtndew:298979] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
<C*b,b>: 0.000000e+00

Expected behavior

Both Debug and Release builds should yield the same results.
As it seems, the release build does not do anything.

EDIT #1: Fixed hyperlinks

The text was updated successfully, but these errors were encountered:

ruohai0925 · 2024-03-21T15:51:04Z

Same issue here. Just wonder if there are more suggestions about this issue.

liruipeng · 2024-03-21T19:38:29Z

Thank you for reporting this issue. The reason why the debug mode works but the release mode doesn't is that we reply on unified memory to transfer data to GPUs in the examples. The debug mode implicitly forces device synchornization. In principle, we should use device memory where this wouldn't be an issue but the goal of the examples is to show basics of using hypre, so we keep the GPU code as simple as possible. For your own code, you can still follow the example code whereas the memory should be on device and populated on device as well, or add adequate explicit device synchronization.

liruipeng · 2024-03-21T19:40:41Z

You can also turn on CUDA_LAUNCH_BLOCKING to get the correct results of the examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HYPRE CUDA examples only work in Debug builds #1051

HYPRE CUDA examples only work in Debug builds #1051

slapgas commented Jan 15, 2024 •

edited

ruohai0925 commented Mar 21, 2024

liruipeng commented Mar 21, 2024 •

edited

liruipeng commented Mar 21, 2024

HYPRE CUDA examples only work in Debug builds #1051

HYPRE CUDA examples only work in Debug builds #1051

Comments

slapgas commented Jan 15, 2024 • edited

Issue description

Important note

Steps to reproduce the behavior

Setting CUDA_HOME to your CUDA installation home directory

Building HYPRE either via autoconf or via CMake

Building examples with make

Actual behavior

Expected behavior

ruohai0925 commented Mar 21, 2024

liruipeng commented Mar 21, 2024 • edited

liruipeng commented Mar 21, 2024

slapgas commented Jan 15, 2024 •

edited

liruipeng commented Mar 21, 2024 •

edited