Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues #75

Open
csa7fff opened this issue Feb 18, 2015 · 1 comment
Open

Scalability issues #75

csa7fff opened this issue Feb 18, 2015 · 1 comment

Comments

@csa7fff
Copy link

csa7fff commented Feb 18, 2015

This is a meta-issue to collect all information relevant to scallability problems in the UFO framework and it's plugins.

  1. On NVIDIA platform, the kernel execution penalty depends from the number of GPUs used in the OpenCL context. While normal penalty is about 16 - 20 us, it reaches ~ 100 us with 6 GPUs (ipepdvcompute2) and ~ 200 us with 9 (ipepdvcompute1). This affects filters with large number of kernel launches. For instance, SART executes 3 kernels for each projection at each iteration and does not scale beyond 2nd device on ipepdvcompute2. The AMD platform is not affected. If individual OpenCL context is used for each device, there is only a marginal growth of execution time. Small test for kernel launch penalties (cl_launch.c) is available from bzr+ssh://ufo.kit.edu/opencl/tools
  2. Another problem affecting UfoIr filters is a way how ufo-basic-ops are operating. As I understand from explanations of Andrey Shkarin, originally the kernel was compiled on each execution which was introducing huge latency. Then, Matthias Vogelgesang has implemented caching. However, even if multiple GPUs are used, always the same kernel is returned. As result, current implementation of UfoIr uses mutexes and operation can't be executed on multiple devices in parallel. This currently harms performance of SIRT implementation. I guess the caching should be done on the per cl_queue or per-thread basis.
  3. For high-speed reconstruction filters like DFI, PCIe transfer becomes an issue, especially if an external PCIe enclosures sharing a single x16 link between mulitple GPUs are used. As I can see, currently UFO buffers are only supporting synchronous API. I think we should provide alternative API to use asynchronous IO. Moreover, on NVIDIA platform it is possible to use overlapping of memory transfers with kernel execution. This is achieved if pinned (page-locked) host memory is used. On NVIDIA platform it can be done by executing clCreateBuffer with CL_MEM_ALLOC_HOST_PTR flag and then it is mapped with clEnqueueMapBuffer. I guess this also should be supported in ufo buffers.
@matze matze changed the title Scallability issues Scalability issues Feb 19, 2015
@tfarago
Copy link
Contributor

tfarago commented Mar 21, 2015

The pinned memory would be nice. Also disabling the double-buffered mode could be beneficial for some situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants