Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? #1902

Open
feixuedudiao opened this issue May 8, 2024 · 3 comments

Comments

@feixuedudiao
Copy link

I run the code of oneDnn at i7-10700 @2.9GHZ, and gpu is hd630. The reorder of GPU is slower a lot than cpu,and the data type of both is f32.
Verbose log converter for cpu:

prim_kind shapes ncalls time(ms) overall% agg_ncalls(avg) agg_time(ms) agg_overall%
reorder 192x192x1x1 5 0.19 30.82 5.00 0.19 30.82
reorder 256x256x1x1 2 0.15 24.73 3.50 0.35 55.55
reorder 256x192x1x1 1 0.05 8.10 2.67 0.40 63.65
reorder 48x48x3x3 2 0.04 7.20 2.50 0.44 70.85
reorder 64x64x3x3 1 0.04 6.62 2.20 0.48 77.48
reorder 32x32x3x3 3 0.03 4.84 2.33 0.51 82.32
reorder 192x96x1x1 1 0.03 4.49 2.14 0.54 86.80
reorder 96x96x1x1 2 0.03 4.45 2.12 0.57 91.25
reorder 32x32x1x1 9 0.01 1.51 2.89 0.58 92.77
reorder 192x1x1x3x3 6 0.01 1.38 3.20 0.59 94.15
reorder 64x256x1x1 2 0.01 1.37 3.09 0.59 95.51
reorder 256x1x1x3x3 2 0.00 0.72 3.00 0.60 96.24
reorder 32x4x3x3 1 0.00 0.66 2.85 0.60 96.90
reorder 48x192x1x1 1 0.00 0.50 2.71 0.61 97.40
reorder 48x64x1x1 2 0.00 0.45 2.67 0.61 97.85
reorder 48x48x1x1 2 0.00 0.37 2.63 0.61 98.22
reorder 96x1x1x3x3 3 0.00 0.34 2.65 0.61 98.55
reorder 32x1x1x3x3 3 0.00 0.32 2.67 0.61 98.87
reorder 64x64x1x1 1 0.00 0.29 2.58 0.62 99.16
reorder 32x48x1x1 2 0.00 0.27 2.55 0.62 99.44
reorder 96x32x1x1 1 0.00 0.24 2.48 0.62 99.68
reorder 32x96x1x1 1 0.00 0.18 2.41 0.62 99.86
reorder 1x32x3x3 1 0.00 0.14 2.35 0.62 100.00

Verbose log converter for gpu:
prim_kind shapes ncalls time(ms) overall% agg_ncalls(avg) agg_time(ms) agg_overall%
reorder 32x32x1x1 9 3.57 16.12 9.00 3.57 16.12
reorder 32x32x3x3 3 1.60 7.22 6.00 5.17 23.34
reorder 192x192x1x1 5 1.42 6.41 5.67 6.60 29.75
reorder 192x1x1x3x3 6 1.37 6.17 5.75 7.96 35.91
reorder 32x1x1x3x3 3 1.26 5.70 5.20 9.23 41.61
reorder 96x1x1x3x3 3 1.17 5.26 4.83 10.39 46.87
reorder 32x48x1x1 2 0.97 4.37 4.43 11.36 51.24
reorder 64x256x1x1 2 0.91 4.09 4.12 12.27 55.33
reorder 48x64x1x1 2 0.87 3.94 3.89 13.14 59.27
reorder 256x256x1x1 2 0.87 3.90 3.70 14.01 63.17
reorder 256x1x1x3x3 2 0.86 3.88 3.55 14.87 67.05
reorder 48x48x3x3 2 0.86 3.88 3.42 15.73 70.93
reorder 48x48x1x1 2 0.81 3.66 3.31 16.54 74.59
reorder 48x192x1x1 1 0.75 3.37 3.14 17.29 77.96
reorder 32x4x3x3 1 0.71 3.20 3.00 18.00 81.16
reorder 96x96x1x1 2 0.70 3.17 2.94 18.70 84.33
reorder 96x32x1x1 1 0.58 2.62 2.82 19.28 86.95
reorder 192x96x1x1 1 0.57 2.56 2.72 19.85 89.50
reorder 1x32x3x3 1 0.55 2.47 2.63 20.39 91.98
reorder 64x64x3x3 1 0.51 2.32 2.55 20.91 94.30
reorder 32x96x1x1 1 0.47 2.11 2.48 21.38 96.41
reorder 256x192x1x1 1 0.42 1.90 2.41 21.80 98.31
reorder 64x64x1x1 1 0.37 1.69 2.35 22.17 100.00

init_cpu.log
init_gpu.log

Fei.
Thanks, best wish.

@vpirogov
Copy link
Member

vpirogov commented May 8, 2024

@feixuedudiao,

Performance differences are expected on different platforms. One thing to note though is that oneDNN verbose mode has non-trivial performance overhead, in particular on GPUs and cannot be reliably used to measure performance. You can use benchdnn in performance validation mode to get accurate performance measurements.

@feixuedudiao
Copy link
Author

feixuedudiao commented May 11, 2024

@vpirogov. I conducted time consumption tests based on the cpu and gpu for the reordering of convolutional src and weights respectively, the code is from example of primitime convolution.cpp. The result of cpu and gpu are respectively 986 microseconds and 999604 microseconds,gpu is many times slower than cpu. Is there a better way to improve the performance of gpu reoder?

@shu1chen
Copy link
Contributor

shu1chen commented May 11, 2024

@feixuedudiao First of all, it doesn't make sense to compare the performance of primitives on CPU and GPU without considering the GPU hardware capabilities and configurations.
For your case, if you insist on comparing them, please try performance testing mode of benchdnn for testing. Here is an example of command line to check the 32x32x1x1 reorder:

./benchdnn --reorder --mode=P --reset --allow-enum-tags-only=0 --engine=gpu  --runtime-dim-mask=  --sdt=f32 --ddt=f32  --stag=abcd --dtag=Acdb16a --strides=:   32x32x1x1

I tested this command line on a new laptop with a latest Intel integrated GPU hardware and it shows that the performance on GPU is better than that on CPU:
Avg. time on CPU: 0.00714332 ms
Avg. time on GPU: 0.00160363 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants