Scalability issues #75

csa7fff · 2015-02-18T20:22:05Z

This is a meta-issue to collect all information relevant to scallability problems in the UFO framework and it's plugins.

On NVIDIA platform, the kernel execution penalty depends from the number of GPUs used in the OpenCL context. While normal penalty is about 16 - 20 us, it reaches ~ 100 us with 6 GPUs (ipepdvcompute2) and ~ 200 us with 9 (ipepdvcompute1). This affects filters with large number of kernel launches. For instance, SART executes 3 kernels for each projection at each iteration and does not scale beyond 2nd device on ipepdvcompute2. The AMD platform is not affected. If individual OpenCL context is used for each device, there is only a marginal growth of execution time. Small test for kernel launch penalties (cl_launch.c) is available from bzr+ssh://ufo.kit.edu/opencl/tools
Another problem affecting UfoIr filters is a way how ufo-basic-ops are operating. As I understand from explanations of Andrey Shkarin, originally the kernel was compiled on each execution which was introducing huge latency. Then, Matthias Vogelgesang has implemented caching. However, even if multiple GPUs are used, always the same kernel is returned. As result, current implementation of UfoIr uses mutexes and operation can't be executed on multiple devices in parallel. This currently harms performance of SIRT implementation. I guess the caching should be done on the per cl_queue or per-thread basis.
For high-speed reconstruction filters like DFI, PCIe transfer becomes an issue, especially if an external PCIe enclosures sharing a single x16 link between mulitple GPUs are used. As I can see, currently UFO buffers are only supporting synchronous API. I think we should provide alternative API to use asynchronous IO. Moreover, on NVIDIA platform it is possible to use overlapping of memory transfers with kernel execution. This is achieved if pinned (page-locked) host memory is used. On NVIDIA platform it can be done by executing clCreateBuffer with CL_MEM_ALLOC_HOST_PTR flag and then it is mapped with clEnqueueMapBuffer. I guess this also should be supported in ufo buffers.

tfarago · 2015-03-21T16:19:55Z

The pinned memory would be nice. Also disabling the double-buffered mode could be beneficial for some situations.

matze changed the title ~~Scallability issues~~ Scalability issues Feb 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability issues #75

Scalability issues #75

csa7fff commented Feb 18, 2015

tfarago commented Mar 21, 2015

Scalability issues #75

Scalability issues #75

Comments

csa7fff commented Feb 18, 2015

tfarago commented Mar 21, 2015