Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvidia] resampling primitive fails correctness check #1728

Open
dzarukin opened this issue Sep 15, 2023 · 0 comments
Open

[nvidia] resampling primitive fails correctness check #1728

dzarukin opened this issue Sep 15, 2023 · 0 comments
Assignees
Labels
bug A confirmed library bug platform:nvidia-gpu

Comments

@dzarukin
Copy link
Contributor

Summary

oneDNN validation for Nvidia backend hits a correctness issue on forward linear resampling for specific shapes under benchdnn.

Version

Latest master.

Environment

Hardware:

NVIDIA A100 80GB PCIe
(A10 should also work for most cases).

Software

SYCL Compiler with Nvidia support.
Any version that compiles without issues, preferable no later than April.
[Optional] TBB
Any version.
[Optional] OpenCL CPU
Latest version is preferable.

  • Optional means that CPU backend can be enabled if dependency is satisfied. Otherwise, should be switched off.

Steps to reproduce

Build

mkdir -p build
cd build
cmake .. -DCMAKE_BUILD_TYPE=release (or debug) -DDNNL_CPU_RUNTIME=DPCPP (or NONE) -DDNNL_GPU_RUNTIME=DPCPP -DDNNL_GPU_VENDOR=NVIDIA -DONEDNN_BUILD_GRAPH=OFF
cmake --build . --target benchdnn

Run

<env_vars> ./build/tests/benchdnn/benchdnn --resampling --engine=gpu --alg=linear ic32iw151ow300

For a full suite validation, use --batch=test_resampling_gpu instead of a specific test case.

Helper env vars:

CUDA_LOGINFO_DBG=1 CUDA_LOGDEST_DBG=stdout -- enables cuda API dump
CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout -- enables cudnn API dump
DNNL_VERBOSE=all (or desired level) -- enables oneDNN execution information

Helper tips:

benchdnn supports verbosity through -vX. Most info is available at v6. It's possible to dump destination with -v99 when really needed.
benchdnn documentation is here: https://github.com/oneapi-src/oneDNN/tree/master/tests/benchdnn (scroll down). Reorder doc and others may be found through links.
benchdnn binary also supports --help command, which will tip to use --bnorm --help to dump all supported options.

Observed behavior

Failures are reproducible within a single run, there are total of 55 failures of similar nature.

create: --resampling --engine=gpu --alg=linear ic32iw151ow300
run: --resampling --engine=gpu --alg=linear ic32iw151ow300
[  88][DST][0:0:88] exp_f32:    0.674973 exp:    0.674973 got:     0.67503 diff:5.72205e-05 rdiff:8.47745e-05
[ 122][DST][0:0:122] exp_f32:     2.37499 exp:     2.37499 got:     2.37505 diff:5.72205e-05 rdiff:2.40929e-05
[ 128][DST][0:0:128] exp_f32:     5.17506 exp:     5.17506 got:     5.17494 diff:0.000114441 rdiff:2.21139e-05
[ 243][DST][0:0:243] exp_f32:     2.17503 exp:     2.17503 got:     2.17491 diff:0.000114441 rdiff:5.26159e-05
[ 249][DST][0:0:249] exp_f32:     4.97498 exp:     4.97498 got:     4.97509 diff:0.000114441 rdiff:2.30033e-05
[ 257][DST][0:0:257] exp_f32:     1.62506 exp:     1.62506 got:     1.62483 diff:0.000228882 rdiff:0.000140845
[ 276][DST][0:0:276] exp_f32:     3.30212 exp:     3.30212 got:     3.30202 diff:9.53674e-05 rdiff:2.88807e-05
[ 278][DST][0:0:278] exp_f32:     11.4249 exp:     11.4249 got:     11.4252 diff:0.000228882 rdiff:2.00335e-05
[ 324][DST][0:1:24] exp_f32:     1.05208 exp:     1.05208 got:     1.05211 diff:2.98023e-05 rdiff:2.8327e-05
0:FAILED (errors:629 total:19200) __REPRO: --resampling --engine=gpu --alg=linear ic32iw151ow300

The nature of the mismatch is unknown. Since both diff and rdiff are high for fp32, there's a need to understand what's going on. The only assumption I have is original grid is not aligned between oneDNN and cuDNN. To check that, suggest to update resampling source tensor filling to a single value and see how it affects the output. Additional findings may base on findings from this one.

Expected behavior

The issue is not appearing during the single run validation nor under full batch.

@dzarukin dzarukin added the bug A confirmed library bug label Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug platform:nvidia-gpu
Projects
None yet
Development

No branches or pull requests

3 participants