Skip to content

GPU support, rotation-based recon, MSVC support

Compare
Choose a tag to compare
@carterbox carterbox released this 23 May 17:33
· 721 commits to master since this release
  • Version is now pulled from git tags instead of from VERSION file
  • dxchange is no longer a required dependency
  • CMake build system to handle addition of code in two new languages: C++ and CUDA
  • Python bindings to C++/CUDA code still go through C interface (i.e. no direct binding)
  • SIRT and MLEM have been implemented on the GPU and CPU using rotation-based algorithm
    • GPU support has been validated for Windows and Linux
    • CPU version uses OpenCV for rotation
      • OpenCV distributed via conda + MinGW on Windows does not work. Use MSVC compiler on Windows.
    • GPU version uses NPP for rotation
    • benchmarking on NVIDIA P100: ~11x slower than gridrec but vastly improved reconstruction quality
    • benchmarking on NVIDIA V100: per-slice speed-up over ray-based algorithm is ~650x, e.g. a TomoBank reconstruction (2048p + 1,500 proj angles) formerly requiring ~6.5 hours is completed in ~40 seconds
  • Support for Microsoft Visual C++ (MSVC) compiler
    • Implemented gridrec in C++ (uses std::complex) which is enabled by default on Windows
  • To enable new algorithms, include accelerated=True to tomopy.recon for SIRT and MLEM
    • there are other options available but unless there is an explicitly understanding the effects of the other parameters, use the defaults.
  • Multi-GPU support is available
    • Automatic detection of number of available devices
    • Multiple threads started at Python level automatically spread out over the number of available GPUs
  • Secondary thread-pools created in C++ code to provide highly efficient communication with the GPU and additional parallelism on the CPU.
    • When running on the GPU, set ncore parameter to tomopy.recon to the number of GPUs available.
    • Each "Python" thread creates a unique secondary thread-pool with a default size of 2 * number-of-cpus. This is intentional and, in general, the larger the secondary thread-pool, the more efficiently the CPU-GPU communication latency is hidden. However, in general, more than 24 threads per thread-pool provides no benefit (all latency is essentially hidden at that point)