Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? #1902

feixuedudiao · 2024-05-08T09:33:59Z

I run the code of oneDnn at i7-10700 @2.9GHZ, and gpu is hd630. The reorder of GPU is slower a lot than cpu，and the data type of both is f32.
Verbose log converter for cpu:

prim_kind shapes reorder 192x192x1x1 5 reorder 256x256x1x1 2 reorder 256x192x1x1 1 reorder 48x48x3x3 2 reorder 64x64x3x3 1 reorder 32x32x3x3 3 reorder 192x96x1x1 1 reorder 96x96x1x1 2 reorder 32x32x1x1 9 reorder 192x1x1x3x3 6 reorder 64x256x1x1 2 reorder 256x1x1x3x3 2 reorder 32x4x3x3 1 reorder 48x192x1x1 1 reorder 48x64x1x1 2 reorder 48x48x1x1 2 reorder 96x1x1x3x3 3 reorder 32x1x1x3x3 3 reorder 64x64x1x1 1 reorder 32x48x1x1 2 reorder 96x32x1x1 1 reorder 32x96x1x1 1 reorder 1x32x3x3 1 ncalls time(ms) overall% agg_ncalls(avg) agg_time(ms) agg_overall%
0.19 30.82 5.00 0.19 30.82
0.15 24.73 3.50 0.35 55.55
0.05 8.10 2.67 0.40 63.65
0.04 7.20 2.50 0.44 70.85
0.04 6.62 2.20 0.48 77.48
0.03 4.84 2.33 0.51 82.32
0.03 4.49 2.14 0.54 86.80
0.03 4.45 2.12 0.57 91.25
0.01 1.51 2.89 0.58 92.77
0.01 1.38 3.20 0.59 94.15
0.01 1.37 3.09 0.59 95.51
0.00 0.72 3.00 0.60 96.24
0.00 0.66 2.85 0.60 96.90
0.00 0.50 2.71 0.61 97.40
0.00 0.45 2.67 0.61 97.85
0.00 0.37 2.63 0.61 98.22
0.00 0.34 2.65 0.61 98.55
0.00 0.32 2.67 0.61 98.87
0.00 0.29 2.58 0.62 99.16
0.00 0.27 2.55 0.62 99.44
0.00 0.24 2.48 0.62 99.68
0.00 0.18 2.41 0.62 99.86
0.00 0.14 2.35 0.62 100.00

Verbose log converter for gpu:
prim_kind shapes ncalls time(ms) overall% agg_ncalls(avg) agg_time(ms) agg_overall%
reorder 32x32x1x1 9 3.57 16.12 9.00 3.57 16.12
reorder 32x32x3x3 3 1.60 7.22 6.00 5.17 23.34
reorder 192x192x1x1 5 1.42 6.41 5.67 6.60 29.75
reorder 192x1x1x3x3 6 1.37 6.17 5.75 7.96 35.91
reorder 32x1x1x3x3 3 1.26 5.70 5.20 9.23 41.61
reorder 96x1x1x3x3 3 1.17 5.26 4.83 10.39 46.87
reorder 32x48x1x1 2 0.97 4.37 4.43 11.36 51.24
reorder 64x256x1x1 2 0.91 4.09 4.12 12.27 55.33
reorder 48x64x1x1 2 0.87 3.94 3.89 13.14 59.27
reorder 256x256x1x1 2 0.87 3.90 3.70 14.01 63.17
reorder 256x1x1x3x3 2 0.86 3.88 3.55 14.87 67.05
reorder 48x48x3x3 2 0.86 3.88 3.42 15.73 70.93
reorder 48x48x1x1 2 0.81 3.66 3.31 16.54 74.59
reorder 48x192x1x1 1 0.75 3.37 3.14 17.29 77.96
reorder 32x4x3x3 1 0.71 3.20 3.00 18.00 81.16
reorder 96x96x1x1 2 0.70 3.17 2.94 18.70 84.33
reorder 96x32x1x1 1 0.58 2.62 2.82 19.28 86.95
reorder 192x96x1x1 1 0.57 2.56 2.72 19.85 89.50
reorder 1x32x3x3 1 0.55 2.47 2.63 20.39 91.98
reorder 64x64x3x3 1 0.51 2.32 2.55 20.91 94.30
reorder 32x96x1x1 1 0.47 2.11 2.48 21.38 96.41
reorder 256x192x1x1 1 0.42 1.90 2.41 21.80 98.31
reorder 64x64x1x1 1 0.37 1.69 2.35 22.17 100.00

init_cpu.log
init_gpu.log

Fei.
Thanks, best wish.

vpirogov · 2024-05-08T20:30:35Z

@feixuedudiao,

Performance differences are expected on different platforms. One thing to note though is that oneDNN verbose mode has non-trivial performance overhead, in particular on GPUs and cannot be reliably used to measure performance. You can use benchdnn in performance validation mode to get accurate performance measurements.

feixuedudiao · 2024-05-11T01:12:15Z

@vpirogov. I conducted time consumption tests based on the cpu and gpu for the reordering of convolutional src and weights respectively, the code is from example of primitime convolution.cpp. The result of cpu and gpu are respectively 986 microseconds and 999604 microseconds，gpu is many times slower than cpu. Is there a better way to improve the performance of gpu reoder?

shu1chen · 2024-05-11T10:12:32Z

@feixuedudiao First of all, it doesn't make sense to compare the performance of primitives on CPU and GPU without considering the GPU hardware capabilities and configurations.
For your case, if you insist on comparing them, please try performance testing mode of benchdnn for testing. Here is an example of command line to check the 32x32x1x1 reorder:

./benchdnn --reorder --mode=P --reset --allow-enum-tags-only=0 --engine=gpu  --runtime-dim-mask=  --sdt=f32 --ddt=f32  --stag=abcd --dtag=Acdb16a --strides=:   32x32x1x1

I tested this command line on a new laptop with a latest Intel integrated GPU hardware and it shows that the performance on GPU is better than that on CPU:
Avg. time on CPU: 0.00714332 ms
Avg. time on GPU: 0.00160363 ms

feixuedudiao added the question label May 8, 2024

yehudaorel assigned onednnsupporttriage May 8, 2024

vpirogov assigned vpirogov and unassigned onednnsupporttriage May 8, 2024

shu1chen added the platform:intel-gpu label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? #1902

Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? #1902

feixuedudiao commented May 8, 2024

vpirogov commented May 8, 2024

feixuedudiao commented May 11, 2024 •

edited

shu1chen commented May 11, 2024 •

edited

Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? #1902

Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? #1902

Comments

feixuedudiao commented May 8, 2024

vpirogov commented May 8, 2024

feixuedudiao commented May 11, 2024 • edited

shu1chen commented May 11, 2024 • edited

feixuedudiao commented May 11, 2024 •

edited

shu1chen commented May 11, 2024 •

edited