Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA stencil inefficiency (compared to SYCL) #590

Open
AtlantaPepsi opened this issue Sep 14, 2021 · 3 comments
Open

CUDA stencil inefficiency (compared to SYCL) #590

AtlantaPepsi opened this issue Sep 14, 2021 · 3 comments

Comments

@AtlantaPepsi
Copy link
Contributor

AtlantaPepsi commented Sep 14, 2021

As observed here, CUDA stencil operation appears to be much slower than DPC++ across all block sizes on NVIDIA device. I also ran the problem (8000 grid points, 100 iteration) on V100 and results are as following:

Block size CUDA Rate (MF/s) SYCL Rate (MF/s) CUDA Avg time (s) SYCL Avg time (s)
1 12375.8 12367.4 0.180636 0.18076
2 49495.2 49439.3 0.0451665 0.0452175
4 197629 197099 0.0113117 0.0113421
8 487579 575118 0.00458494 0.00388707
16 571705 696173 0.00391027 0.00321116
32 478201 684394 0.00467486 0.00326643

Although the difference is not as bad as @jeffhammond's results which were obtained on DGX-A100, CUDA is still quite a bit slower compared to SYCL on either platform. Here are the simple build commands I used:

(cuda 11.1)nvcc -g -O3 -std=c++17 --gpu-architecture=sm_70 -D_X86INTRIN_H_INCLUDED stencil-cuda.cu -o stencil-cuda
(clang 13.0.0, intel/llvm commitf126512)clang++ -g -O3 -std=c++17 -fsycl -fsycl-unnamed-lambda -fsycl-targets=nvptx64-nvidia-cuda-sycldevice stencil-sycl.cc -o stencil-sycl-oneapi

Upon a quick inspection with nvprof, there seems to be no additional overhead outside the two computational kernels (add and star).

Furthermore, add and star roughly splits the total avg runtime above in CUDA, but not for SYCL (at least on V100). While star takes roughly the same time, add on CUDA is about 50% slower around optimal block size (i.e. > 8). Considering the atomicity of add kernel, I reckon this slowdown probably shouldn't be attributed to problematic memory access patterns?

It'd be interesting to see if we can observe similar CUDA drawback on other kernels. For now I will be looking into the PTX binaries, perhaps I could spot the exact instructions that incurred this slowdown on CUDA.

@wangzy0327
Copy link

@AtlantaPepsi Hello, I'm trying to run the Cxx11 cuda and SYCL programs in x86_64 machine.But I'm unfamiliar with the make.defs config. Can you help me give you more precise make.defs config about cuda and SYCL? Thank you very much!

@AtlantaPepsi
Copy link
Contributor Author

hi @wangzy0327 , what do you mean by "more precise make.defs config", are you having a build error, or does the executable fail to produce similar results with existing flags in cuda/oneapi?

@wangzy0327
Copy link

This is the build error. I have installed libboost-dev software.

g++-11 -std=gnu++17 -pthread -O3 -mtune=native -ffast-math -Wall  -Wno-ignored-attributes -Wno-deprecated-declarations -DPRKVERSION="2020" stencil-ranges.cc -DUSE_BOOST_IRANGE -I/usr/include/boost/ -DUSE_RANGES -o stencil-ranges
In file included from stencil-ranges.cc:66:
stencil_ranges.hpp: In function ‘void star1(int, prk::vector<double>&, prk::vector<double>&)’:
stencil_ranges.hpp:2:16: error: ‘ranges’ has not been declared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants