Seems OpenCL is faster than CUDA #239

daylight-00 · 2023-08-23T06:13:37Z

I compared the time required by ligand and the time required by whole job, using Autodock-GPU built with CUDA and OpenCL respectively. And it always took CUDA about 30~40% more time than OpenCL. However, the description in the repository says CUDA is faster than OpenCL, contrary to my results. Similar results have always been obtained when trying under different conditions on the same system, and I have not tried on other systems. So I hope others to check if CUDA is really faster than OpenCL.

System:

AMD EPYC 7542 32-Core Processor
NVIDIA Geforce RTX 3090 * 1
CUDA Toolkit 11.8 (Conda Package)

Docking:

ligand batch size: 10K
nrun: 10
iteration of entire job: 10
random seed: 100

atillack · 2023-08-23T15:27:55Z

@daylight0-0 Thank you and yes, OpenCL is about 5-15% faster in our own testing on the same hardware (RTX A5000). Newer versions should narrow the gap a little bit (due to requesting a smaller chunk of memory in Cuda similar to OpenCL based on the actual memory needed and not the maximums) - so if this isn't the current develop branch it may be worthwhile to test again.

I suspect the remaining difference may be caused by pre-allocated memory at compile time (OpenCL) vs dynamically allocated memory at runtime (Cuda) for variables in shared memory - as other than this Cuda and OpenCL paths are using exactly the same algorithms and even implementations as much as possible ...

Since OpenCL does exist on Nvidia and many more devices (all the way to Android) that's good news though ultimately :-)

atillack · 2023-08-23T15:52:35Z

Found the culprit: It looks like I wrote that Cuda was faster about 3 years ago in our README.md. It probably was true at the time before I merged the integer gradient from Cuda to OpenCL as well. So I'll fix README.md by taking this sentence out.

daylight-00 · 2023-08-23T16:11:31Z

Thank you for your answer. I did use the develop branch though.
I'm using AutoDock-GPU in a cluster that uses various types or number of gpu(A5000, A6000, 3090...), and I wonder if there could be a problem if I use it in a different node than when I do compile.

atillack · 2023-08-23T16:21:03Z

For Cuda this should only be an issue if you were to compile with the wrong architecture(s) - for 3090/A5000/A6000. you want to compile with TARGETS="86".

One more thing: I would only compare overall runtimes on the same machines as the kernel runtime performance timers while at the same location may still contain different tasks depending on what Cuda and OpenCL do at kernel cleanup time.

atillack · 2023-08-23T16:28:14Z

I just realized, PR #233 should close the Cuda performance gap a bit more as it contains the code to allocate the same amount of memory as OpenCL ...

daylight-00 changed the title ~~OpenCL is faster than CUDA~~ Seems OpenCL is faster than CUDA Aug 23, 2023

xavgit mentioned this issue Sep 2, 2023

Suggestions about the use AutoDock-GPU with NVIDIA A100 GPUs #240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems OpenCL is faster than CUDA #239

Seems OpenCL is faster than CUDA #239

daylight-00 commented Aug 23, 2023 •

edited

atillack commented Aug 23, 2023

atillack commented Aug 23, 2023

daylight-00 commented Aug 23, 2023 •

edited

atillack commented Aug 23, 2023

atillack commented Aug 23, 2023

Seems OpenCL is faster than CUDA #239

Seems OpenCL is faster than CUDA #239

Comments

daylight-00 commented Aug 23, 2023 • edited

atillack commented Aug 23, 2023

atillack commented Aug 23, 2023

daylight-00 commented Aug 23, 2023 • edited

atillack commented Aug 23, 2023

atillack commented Aug 23, 2023

daylight-00 commented Aug 23, 2023 •

edited

daylight-00 commented Aug 23, 2023 •

edited