CUDA stencil inefficiency (compared to SYCL) #590

AtlantaPepsi · 2021-09-14T02:33:41Z

As observed here, CUDA stencil operation appears to be much slower than DPC++ across all block sizes on NVIDIA device. I also ran the problem (8000 grid points, 100 iteration) on V100 and results are as following:

Block size	CUDA Rate (MF/s)	SYCL Rate (MF/s)	CUDA Avg time (s)	SYCL Avg time (s)
1	12375.8	12367.4	0.180636	0.18076
2	49495.2	49439.3	0.0451665	0.0452175
4	197629	197099	0.0113117	0.0113421
8	487579	575118	0.00458494	0.00388707
16	571705	696173	0.00391027	0.00321116
32	478201	684394	0.00467486	0.00326643

Although the difference is not as bad as @jeffhammond's results which were obtained on DGX-A100, CUDA is still quite a bit slower compared to SYCL on either platform. Here are the simple build commands I used:

(cuda 11.1)nvcc -g -O3 -std=c++17 --gpu-architecture=sm_70 -D_X86INTRIN_H_INCLUDED stencil-cuda.cu -o stencil-cuda
(clang 13.0.0, intel/llvm commitf126512)clang++ -g -O3 -std=c++17 -fsycl -fsycl-unnamed-lambda -fsycl-targets=nvptx64-nvidia-cuda-sycldevice stencil-sycl.cc -o stencil-sycl-oneapi

Upon a quick inspection with nvprof, there seems to be no additional overhead outside the two computational kernels (add and star).

Furthermore, add and star roughly splits the total avg runtime above in CUDA, but not for SYCL (at least on V100). While star takes roughly the same time, add on CUDA is about 50% slower around optimal block size (i.e. > 8). Considering the atomicity of add kernel, I reckon this slowdown probably shouldn't be attributed to problematic memory access patterns?

It'd be interesting to see if we can observe similar CUDA drawback on other kernels. For now I will be looking into the PTX binaries, perhaps I could spot the exact instructions that incurred this slowdown on CUDA.

The text was updated successfully, but these errors were encountered:

wangzy0327 · 2023-08-29T01:47:46Z

@AtlantaPepsi Hello， I'm trying to run the Cxx11 cuda and SYCL programs in x86_64 machine.But I'm unfamiliar with the make.defs config. Can you help me give you more precise make.defs config about cuda and SYCL? Thank you very much!

AtlantaPepsi · 2023-08-30T23:18:13Z

hi @wangzy0327 , what do you mean by "more precise make.defs config", are you having a build error, or does the executable fail to produce similar results with existing flags in cuda/oneapi?

wangzy0327 · 2023-08-31T01:30:56Z

This is the build error. I have installed libboost-dev software.

g++-11 -std=gnu++17 -pthread -O3 -mtune=native -ffast-math -Wall  -Wno-ignored-attributes -Wno-deprecated-declarations -DPRKVERSION="2020" stencil-ranges.cc -DUSE_BOOST_IRANGE -I/usr/include/boost/ -DUSE_RANGES -o stencil-ranges
In file included from stencil-ranges.cc:66:
stencil_ranges.hpp: In function ‘void star1(int, prk::vector<double>&, prk::vector<double>&)’:
stencil_ranges.hpp:2:16: error: ‘ranges’ has not been declared

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA stencil inefficiency (compared to SYCL) #590

CUDA stencil inefficiency (compared to SYCL) #590

AtlantaPepsi commented Sep 14, 2021 •

edited

wangzy0327 commented Aug 29, 2023

AtlantaPepsi commented Aug 30, 2023

wangzy0327 commented Aug 31, 2023

CUDA stencil inefficiency (compared to SYCL) #590

CUDA stencil inefficiency (compared to SYCL) #590

Comments

AtlantaPepsi commented Sep 14, 2021 • edited

wangzy0327 commented Aug 29, 2023

AtlantaPepsi commented Aug 30, 2023

wangzy0327 commented Aug 31, 2023

AtlantaPepsi commented Sep 14, 2021 •

edited