Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alpaka::getWarpSizes incurs a noticeable overhead #2192

Open
fwyzard opened this issue Nov 21, 2023 · 5 comments
Open

alpaka::getWarpSizes incurs a noticeable overhead #2192

fwyzard opened this issue Nov 21, 2023 · 5 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Nov 21, 2023

While porting the CMS pixel reconstruction from native CUDA to Alpaka, it was noticed that the use of the alpaka::getWarpSizes(device) function incurs a noticeable overhead.

See cms-sw/cmssw#43064 (comment) for the discussion.

A possible workaround is to cache the warp size in our code, instead of querying it for every event.

However, it would seem natural to cache this information within the Alpaka device objects, instead of querying the underlying back-end each time.

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 21, 2023

I think that caching the warp sizes inside the device object would require

  • either filling it at construction time
  • or using a mutex to avoid setting the cache concurrently

@psychocoderHPC
Copy link
Member

IMO caching makes sense, we should store the value during the device creation then there will be no need for a mutex.

@bernhardmgruber
Copy link
Member

Is there a CUDA device with a warpSize not 32? I am almost in favor of hardcoding it ... Otherwise, we could just collect and cache the entire device properties (i.e. cudaDeviceProp), so we can also serve other values faster.

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 21, 2023

Not that I know of.

But HIP devices can have a warp size of 32 or 64, depending on the GPU model and potentially on the environment settings.

@psychocoderHPC
Copy link
Member

Partly solved by #2246. Never the less we should cache all over runtime constant device properties within the device, than there is no need to query the API multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants