Skip to content

Increase efficiency of CAG construction

Compare
Choose a tag to compare
@sminot sminot released this 05 Aug 21:32
· 44 commits to master since this release

This release has been refactored to make the CAG construction process much more efficient. The changes implemented for that goal are:

  • Saving all gene abundances in Zarr format to speed up the process of reading subsets in each shard
  • Grouping genes into CAGs before constructing each DataFrame to reduce the total memory burden
  • Increasing the number of shards used to initiate the CAG construction by 10X

This release has been tested on a real-world dataset, and we found that with this updated version the set of CAGs which were generated closely matched the results of the previous release. In addition, the CAG creation process was able to be spread out in parallel over ~10X more nodes (for the initial step) which had the effect of reducing the total time to answer while keeping the total compute time constant. In other words, with this release my expectation is that the cost of analysis will be unchanged, but the results should be available more quickly.