[nvidia] batch normalization primitive fails correctness check #1725

dzarukin · 2023-09-14T21:18:07Z

Summary

oneDNN validation for Nvidia backend hits a correctness issue on backward batch normalization with shift and scale under benchdnn.

Version

Latest master.

Environment

Hardware:

NVIDIA A100 80GB PCIe
(A10 should also work for most cases).

Software

SYCL Compiler with Nvidia support.
Any version that compiles without issues, preferable no later than April.
[Optional] TBB
Any version.
[Optional] OpenCL CPU
Latest version is preferable.

Optional means that CPU backend can be enabled if dependency is satisfied. Otherwise, should be switched off.

Steps to reproduce

Build

mkdir -p build
cd build
cmake .. -DCMAKE_BUILD_TYPE=release (or debug) -DDNNL_CPU_RUNTIME=DPCPP (or NONE) -DDNNL_GPU_RUNTIME=DPCPP -DDNNL_GPU_VENDOR=NVIDIA -DONEDNN_BUILD_GRAPH=OFF
cmake --build . --target benchdnn

Run

<env_vars> ./build/tests/benchdnn/benchdnn --bnorm --engine=gpu --dir=BWD_DW --flags=CHR mb16ic64ih147

For a full suite validation, use --batch=test_bnorm_gpu instead of a specific test case.

Helper env vars:

CUDA_LOGINFO_DBG=1 CUDA_LOGDEST_DBG=stdout -- enables cuda API dump
CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout -- enables cudnn API dump
DNNL_VERBOSE=all (or desired level) -- enables oneDNN execution information

Helper tips:

benchdnn supports verbosity through -vX. Most info is available at v6. It's possible to dump destination with -v99 when really needed.
benchdnn documentation is here: https://github.com/oneapi-src/oneDNN/tree/master/tests/benchdnn (scroll down). Reorder doc and others may be found through links.
benchdnn binary also supports --help command, which will tip to use --bnorm --help to dump all supported options.

Observed behavior

Failures are reproducible within a single run, there are total of 8 failures of similar nature.

run: --bnorm --engine=gpu --dir=BWD_DW --flags=CHR mb4ic16ih147
[SRC][L0] = 14406.4
[SRC][L1] exp:  138595 got:  156490 diff: 17897.3 rel_diff:0.129133
[SRC][L2] exp: 1614.94 got: 1647.45 diff: 325.677 rel_diff:0.201665
[SRC][L8] exp:   57.33 got:   57.33 diff: 7.16227 rel_diff:0.124931
[SH][L0] = 0.416667
[SH][L1] exp:     425 got:     430 diff:       5 rel_diff:0.0117647
[SH][L2] exp: 152.076 got: 152.552 diff:       5 rel_diff:0.0328784
[SH][L8] exp:      87 got:      87 diff:       5 rel_diff:0.0574713
0:FAILED (errors:1 total:16) __REPRO: --bnorm --engine=gpu --dir=BWD_DW --flags=CHR mb4ic16ih147

The output means that diff_shift and diff_src were not computed properly. Suggest to check what's going with diff_shift first, because this is just a diff_dst reduction over the channel dimension and should be easier to figure out than diff_src issue.

When dumping each point with -v99, the output is like this:

[COMPARE][SH]: trh=5e-06 zero_trust%=30.00% extra=use_norm:true
[   0][SH][0] exp_f32:           7 exp:           7 got:           7 diff:       0 rdiff:       0
[   1][SH][1] exp_f32:          -8 exp:          -8 got:          -8 diff:       0 rdiff:       0
[   2][SH][2] exp_f32:          51 exp:          51 got:          51 diff:       0 rdiff:       0
[   3][SH][3] exp_f32:          -1 exp:          -1 got:          -1 diff:       0 rdiff:       0
[   4][SH][4] exp_f32:         -11 exp:         -11 got:         -11 diff:       0 rdiff:       0
[   5][SH][5] exp_f32:          44 exp:          44 got:          44 diff:       0 rdiff:       0
[   6][SH][6] exp_f32:           1 exp:           1 got:           1 diff:       0 rdiff:       0
[   7][SH][7] exp_f32:         -42 exp:         -42 got:         -42 diff:       0 rdiff:       0
[   8][SH][8] exp_f32:         -35 exp:         -35 got:         -35 diff:       0 rdiff:       0
[   9][SH][9] exp_f32:           6 exp:           6 got:           6 diff:       0 rdiff:       0
[  10][SH][10] exp_f32:          12 exp:          12 got:          17 diff:       5 rdiff:0.416667
[  11][SH][11] exp_f32:         -84 exp:         -84 got:         -84 diff:       0 rdiff:       0
[  12][SH][12] exp_f32:          12 exp:          12 got:          12 diff:       0 rdiff:       0
[  13][SH][13] exp_f32:           4 exp:           4 got:           4 diff:       0 rdiff:       0
[  14][SH][14] exp_f32:         -87 exp:         -87 got:         -87 diff:       0 rdiff:       0
[  15][SH][15] exp_f32:          20 exp:          20 got:          20 diff:       0 rdiff:       0
[SH][L0] = 0.416667
[SH][L1] exp:     425 got:     430 diff:       5 rel_diff:0.0117647
[SH][L2] exp: 152.076 got: 152.552 diff:       5 rel_diff:0.0328784
[SH][L8] exp:      87 got:      87 diff:       5 rel_diff:0.0574713

A single point value is incorrect.

Expected behavior

The issue is not appearing during the single run validation nor under full batch.

The text was updated successfully, but these errors were encountered:

sgeor255 · 2023-11-09T15:22:54Z

To summarise, the forward pass of batchnorm calculates means close to the expected value but there is still a difference of near zero values. Hence, because the src for the forward batchnorm has values equal to the expected mean, the output of the forward batchnorm has close to but not zero values (2.98e-8 to be exact) for those entries when they were expected to be zero. Thus, these values are propagated to the backwards pass and cause a gradient from dy to be propagated to dx from the relu backward pass which causes the error that was found.

As discussed offline, we won't be making any changes to benchdnn to account for this issue. Is there any further analysis or work for this issue or can we close it?

dzarukin · 2023-11-09T22:15:39Z

@sgeor255 I'll close the ticket once the code suppressing failures will land to master. I'll handle it. Thank you.

dzarukin added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Sep 14, 2023

dzarukin assigned mehdi-goli Sep 14, 2023

dzarukin added bug A confirmed library bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Sep 14, 2023

dzarukin changed the title ~~[nvidia] batch normalization primitive fails correctness check, case 1~~ [nvidia] batch normalization primitive fails correctness check Sep 15, 2023

vpirogov assigned dzarukin and unassigned mehdi-goli Dec 1, 2023

vpirogov added the platform:nvidia-gpu label Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvidia] batch normalization primitive fails correctness check #1725

[nvidia] batch normalization primitive fails correctness check #1725

dzarukin commented Sep 14, 2023

sgeor255 commented Nov 9, 2023 •

edited

dzarukin commented Nov 9, 2023

[nvidia] batch normalization primitive fails correctness check #1725

[nvidia] batch normalization primitive fails correctness check #1725

Comments

dzarukin commented Sep 14, 2023

Summary

Version

Environment

Hardware:

Software

Steps to reproduce

Build

Run

Helper env vars:

Helper tips:

Observed behavior

Expected behavior

sgeor255 commented Nov 9, 2023 • edited

dzarukin commented Nov 9, 2023

sgeor255 commented Nov 9, 2023 •

edited