Skip to content

Releases: Golob-Minot/geneshot

Increase memory efficiency of `refineCAGs`

21 Sep 16:33
Compare
Choose a tag to compare

The updates in this minor commit mostly focus on memory optimizations in the refineCAGs step.

The changes which I found to be most helpful for reducing the memory burden of this step was:

  • Updating the cag_membership table (linking genes and CAGs) with new CAG IDs is best done by initializing an entirely new DataFrame
    Building a single dict is much better than concatenating a set of two-column DataFrames
  • The index can be dropped from the table with input CAG membership immediately after reading in each shard
  • This release has only performance improvements over v0.8.2, but no expected differences in the results.

Release v0.8.2

15 Sep 18:38
Compare
Choose a tag to compare

This release contains commits which primarily fix a bug introduced in v0.8.1, but also should be viewed in the context of the enhancements which were added with that release as well.

The results generated with this release should not be different from those provided by v0.8.0, but they should be much more efficiently generated with large datasets.

It all started with zarr. This new file format reduced the memory required for steps like assembleAbundances and findCAGs by reading subsets of genes directly from disk without having to load the entire table at once. However, with further testing it became clear that this approach was introducing a large increase in time needed to run the findCAGs steps, since that zarr store had to be read by every single shard across every round.

The innovation implemented here reduces the time needed to run the findCAGs steps. This is accomplished by passing the CAG-level abundances from earlier rounds of CAG identification directly to the subsequent rounds. This approach eliminates the need to read from the zarr store after the initial round of CAG identification.

Increase efficiency of CAG construction

05 Aug 21:32
Compare
Choose a tag to compare

This release has been refactored to make the CAG construction process much more efficient. The changes implemented for that goal are:

  • Saving all gene abundances in Zarr format to speed up the process of reading subsets in each shard
  • Grouping genes into CAGs before constructing each DataFrame to reduce the total memory burden
  • Increasing the number of shards used to initiate the CAG construction by 10X

This release has been tested on a real-world dataset, and we found that with this updated version the set of CAGs which were generated closely matched the results of the previous release. In addition, the CAG creation process was able to be spread out in parallel over ~10X more nodes (for the initial step) which had the effect of reducing the total time to answer while keeping the total compute time constant. In other words, with this release my expectation is that the cost of analysis will be unchanged, but the results should be available more quickly.

Release v0.7.1

19 Jul 15:31
1954b12
Compare
Choose a tag to compare
Refactor assemble abundances

Having attempted to run the previous version of the pipeline on much larger datasets, it's clear that assembling a single table with all gene-level abundances is not computationally tractable.

The only impact of these changes on the output of the pipeline is that the /abund/gene/wide table is no longer present (neither is the matching feather file). In practice, the execution of the rest of the pipeline is unchanged.

Refactor assembleAbundances

19 Jul 15:26
1954b12
Compare
Choose a tag to compare

Having attempted to run the previous version of the pipeline on much larger datasets, it's clear that assembling a single table with all gene-level abundances is not computationally tractable.

The only impact of these changes on the output of the pipeline is that the /abund/gene/wide table is no longer present (neither is the matching feather file). In practice, the execution of the rest of the pipeline is unchanged.

Example output files for release v0.6.2

02 Jul 21:44
Compare
Choose a tag to compare

While this release is tagged v0.6.3, it was created in order to capture the output files generated by the previous release.

Performance enhancements for corncob and betta

25 Jun 22:57
7a6d929
Compare
Choose a tag to compare

This minor release adds functionality without changing the content or format of the outputs (with the exception of adding the wald statistic to the corncob and betta outputs).

  • Corncob is now run across multiple shards
  • Added the wald statistic to corncob and betta outputs
  • Set the maximum pickle protocol to 4 for backwards compatibility with python < 3.8
  • Added validation of corncob output, including pickle protocol 4 and python 3.7

Run betta in parallel

22 Jun 22:13
Compare
Choose a tag to compare

Very minor change which runs betta in parallel across multiple processes and then combines them again for the final output. No change to parameters or outputs.

Meta-analysis by gene annotation using betta

18 Jun 18:34
316ec36
Compare
Choose a tag to compare

After running corncob and annotating genes by taxon and/or eggNOG, we will now calculate the combined association of all of the CAGs which share a given annotation against each of the parameters of interest.

The resulting table is being written to /stats/enrichment/betta and has the columns:

  • annotation: species, genus, family, or eggNOG_desc are the current options
  • label: The value in the annotation column being analyzed in this row
  • parameter: The parameter from the corncob formula being analyzed in this row
  • estimate: The combined estimate across all CAGs in this group
  • std_error: The combined standard error of the estimate across all CAGs in this group
  • p_value: The probability that the true coefficient is zero
  • q_value: The FDR-adjusted p-value (using the method specified by --fdr_method, fdr_bh by default)

This new feature will be run for any dataset which has specified --formula with taxonomic and/or eggNOG annotation.

Fix bug with small samples

11 Jun 21:58
Compare
Choose a tag to compare

In the edge case where a sample was provided which did not produce an assembly containing any protein-coding genes, the previous release would result in an error at the alignAlleles step. This minor release fixes that bug by skipping those samples for the purpose of annotating assemblies.