PtyPy 0.5 release notes

We're excited to bring you a new release, with new engines, GPU accelerations and many smaller improvements.

Engine Updates

New abstraction layer for most engines, new engines.

generalised projectional engine with derived engines DM, RAAR
generalised stochastic engine with derived engines EPIE, SDR

Engines that are based on global projections now all derive from a generalized base engine that is able to express most common projection algorithms with 4 scalar parameters. DM and RAAR are two such derived classes. Similarly, algorithms based on a stochastic sequence of local projections (SDR, EPIE) now inherit from a common base engine.

GPU acceleration

GPU-acceleration for all major engines DM, ML, EPIE, SDR, RAAR
accelerated engines needs to be imported explicitly with
```
import ptypy
ptypy.load_gpu_engines('cuda')
```

We accelerated three engines (projectional, stochastic and ML) using the PyCUDA and Reikna library and a whole collection of custom kernels.

All GPU engines leverage a "streaming" model which means that the primary locations of all objects are on the host (CPU) memory. Diffraction data arrays and all other arrys that scale linearly with the number of shifts/positons are segmented into blocks (of frames). The idea is that these blocks are moved on and off the device (GPU) during engine iteration if the GPU does not have enough memory to store all blocks. The number of frames per block can be adjusted with the new top-level [frames_per_block] https://ptycho.github.io/ptypy/rst/parameters.html#ptycho.frames_per_block) parameter. This parameter has little influence for smaller problem size, but needs to be adjusted if your GPU has too little memory to fit even a single block.

Each engine iteration will cycle through all blocks, DM needs to even cycle once for each projection. We therefore recommend to make the block size small enough such that at least a couple of blocks fit on the GPU to hide the latency of data transfers. For best performance, we employ a mirror scheme such that each cycle reverses the block order and reduces the host to device copies (and vice versa) to the absolute minimum.

GPU engines work in parallel when each MPI rank takes one GPU. For sending data between ranks, PtyPy will perform a host copy first in most cases or use whatever the underlying MPI implementations does for CUDA-aware MPI (only tested for OpenMPI). Unfortunately, this mapping of one rank per GPU will leave CPU cores idle if there are more cores on the system than GPUs.

Within a node, PtyPy can use nccl (requires a CuPy install and setting PTYPY_USE_NCCL=1) for passing data between ranks/GPUs.

Breaking changes

Ptyscan classes (experiment) need to be imported explicitly

Most derived PtyScan classes (all those in the /experiment folder) now need to be imported explicitly. We took this step to separate the user space more clearly from the base package and to avoid dependency creep from user-introduced code. At the beginning of your script, you now need to import your module explicitly or use one of the helper functions.

import ptypy
ptypy.load_ptyscan_module(module='')
ptypy.load_all_ptyscan_modules()

Any PtyScan derived class in these modules that is decorated with the ptypy.experiment.register() function will now be included in the parameter tree and selectable by name.

If you prefer the old way of importing ptypy "fully loaded", just use

import ptypy
ptypy.load_all()

which attempts to load all optional PtyScan classes and all engines.

Other updates

Code for utils.parallel.bcast_dict and gather_dict has been simplified and should be backwards compatible.
The fourier_power_bound that was previously calculated internally from the fourier_relax_factor can now be set explicitly and we recommend that from now on. The recommended value for thefourier_power_bound is 0.25 for Poisson statistics (see supplementary of this paper)
Position correction now supports an alternate search scheme, i.e. along a fixed grid. This scheme is more accurate than a stochastic search and the overhead incurred for this brute force search is acceptable for GPU engines.
We switched to a pip install within a conda environment as the main supported way of installation

Roadmap

Automatic adjustment of the block sizes.
Improve scaling behavior across multiple nodes and high frame counts.
Better support for live processing (on a continuous detector data stream).
More tests.
Branch cleaning.

Contributors

Thanks to the efforts at the Diamond Light Source that made this update possible.

Aaron Parsons
Bjoern Enders
Benedikt Daurer
Joerg Lotze

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.5.0