Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code stalling in HYPRE_ADSSolve in v2.29.0 with CUDA 11.6.0 #981

Open
v-dobrev opened this issue Oct 7, 2023 · 6 comments
Open

Code stalling in HYPRE_ADSSolve in v2.29.0 with CUDA 11.6.0 #981

v-dobrev opened this issue Oct 7, 2023 · 6 comments

Comments

@v-dobrev
Copy link

v-dobrev commented Oct 7, 2023

While testing MFEM 4.6 (build with CUDA) with hypre 2.29.0 (build with CUDA) using CUDA 11.6.0 on Lassen, I noticed that a few of the MFEM examples stall. This seems to happen inside calls to HYPRE_ADSSolve. If I use either hypre 2.28.0 or older CUDA (10.1.243) then there are no issues.

Digging a little deeper with Totalview, it looks like the issue happens in the function hypre_CSRMatrixTriLowerUpperSolveCusparse, specifically in this call:

HYPRE_CUSPARSE_CALL( cusparseSpSV_analysis(handle, operation,
&alpha, matA, vecF, vecU, data_type,
CUSPARSE_SPSV_ALG_DEFAULT,
hypre_CsrsvDataInfoL(csrsv_data),
hypre_CsrsvDataBufferL(csrsv_data)) );

Steps to reproduce the issue:

  • On Lassen load the modules cuda/11.6.0 and gcc/7.3.1.
  • Build hypre 2.29.0 with CUDA 11.6.0. I'm testing with GCC 7.3.1 and the default MPI on Lassen, however I'm not sure if that is important.
  • Download and build METIS next to the hypre directory: see the METIS section here: https://mfem.org/building/#parallel-mpi-version-of-mfem (METIS dowload link: https://github.com/mfem/tpls/raw/gh-pages/metis-4.0.3.tar.gz).
  • Clone MFEM next to the hypre directory from https://github.com/mfem/mfem.git (or git@github.com:mfem/mfem.git)
  • Checkout the tag v4.6 of MFEM (older versions may have the same issue but I have not checked).
  • Build MFEM with make pcudebug CUDA_ARCH=sm_70 -j 40
  • In the examples directory, build ex4p: make ex4p.
  • Run the following example: lrun -n 4 ./ex4p -no-vis -m ../data/fichera.mesh -- this should stall indefinitely.
@liruipeng
Copy link
Contributor

Can you try CUDA 12 and see if this issue persists? We've worked with the CUDA team for this function in various versions and for various issues. Thanks @v-dobrev

@v-dobrev
Copy link
Author

v-dobrev commented Oct 9, 2023

@liruipeng, CUDA 12.0.0 seems to work fine.

@victorapm
Copy link
Contributor

Can we close this issue? Thanks!

@v-dobrev
Copy link
Author

v-dobrev commented Dec 11, 2023

Can we close this issue? Thanks!

It is up to you guys -- if you want to fix the issue with CUDA 11.6.0, or recommend to users to not use that version. I don't know if other CUDA and hypre versions are affected.

@victorapm
Copy link
Contributor

if you want to fix the issue with CUDA 11.6.0, or recommend to users to not use that version

@liruipeng What do you recommend? From this thread, it seems we should go with the second option, correct?

@liruipeng
Copy link
Contributor

There are always bugs/issues in different versions of TPLs. We can't do "fixes" in hypre. The users just need to try a different version that fixes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants