The updates in this minor commit mostly focus on memory optimizations in the refineCAGs step.
The changes which I found to be most helpful for reducing the memory burden of this step was:
- Updating the cag_membership table (linking genes and CAGs) with new CAG IDs is best done by initializing an entirely new DataFrame
Building a single dict is much better than concatenating a set of two-column DataFrames - The index can be dropped from the table with input CAG membership immediately after reading in each shard
- This release has only performance improvements over v0.8.2, but no expected differences in the results.