Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation failures in OpenACC variant with GCC and NVHPC #153

Open
jhdavis8 opened this issue Jun 21, 2023 · 4 comments
Open

Validation failures in OpenACC variant with GCC and NVHPC #153

jhdavis8 opened this issue Jun 21, 2023 · 4 comments

Comments

@jhdavis8
Copy link

jhdavis8 commented Jun 21, 2023

I'm encountering validation failures in BabelStream's OpenACC version on the main branch related to the number of iterations. Specifically, when the number of iterations is less than 723, validation failures appear:

$ acc-stream -n 722
BabelStream
Version: 4.0
Implementation: OpenACC
Running kernels 722 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Validation failed on c[]. Average error 2.3104e-14
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        797592.848  0.00067     0.00069     0.00068     
Mul         792595.514  0.00068     0.00068     0.00068     
Add         831047.225  0.00097     0.00098     0.00097     
Triad       831176.744  0.00097     0.00098     0.00097     
Dot         719506.962  0.00075     0.00077     0.00075

compared to

$ acc-stream -n 723
BabelStream
Version: 4.0
Implementation: OpenACC
Running kernels 723 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        796974.794  0.00067     0.00069     0.00067     
Mul         791823.981  0.00068     0.00068     0.00068     
Add         830542.399  0.00097     0.00098     0.00097     
Triad       830553.534  0.00097     0.00098     0.00097     
Dot         719081.005  0.00075     0.00077     0.00075

The average error quantity increases with lower numbers of iterations. This exact behavior appears in all the following test environments:

  • OLCF Summit system, compiled with NVHPC 21.3 to target NVIDIA V100 GPUs
  • OLCF Summit system, compiled with GCC 12.1.0 to target NVIDIA V100 GPUs
  • NERSC Perlmutter system, compiled with NVHPC 22.7 to target NVIDIA A100 GPUs
  • NERSC Perlmutter system, compiled with GCC 11.2.0 to target NVIDIA A100 GPUs
  • Personal laptop, compiled with NVHPC 23.5 to target a NVIDIA GeForce RTX 3060 Mobile GPU
  • Personal laptop, compiled with GCC 12.1.0 to target a NVIDIA GeForce RTX 3060 Mobile GPU

Some possible causes that Tom suggested are synchronisations being skipped somewhere, probably with the memory transfers, or, some bad type punning, or something funny happening with the pointer captures (they're pulled out to local variables because all OpenACC compilers failed to work otherwise).

@tomdeakin
Copy link
Contributor

One more thought: the wording of the wait clause is pretty weird in OpenACC 2.6, so I wonder if this line is missing the wait clause as we copy back to the host.
Does adding the clause fix anything?

Note: if it does this will be strange as all the other kernels have the wait clause so I would have expected that all kernels will have finished before the copy back starts...

@jhdavis8
Copy link
Author

I just tried adding the wait clause to that copy back directive. Still seeing the same failures in all the test environments.

@tomdeakin
Copy link
Contributor

Is this related to #17?

@tom91136
Copy link
Member

I can reproduce this on AArch64 CPUs with both GCC and NVHPC, likely the same for x86 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants