Releases · Golob-Minot/geneshot

21 Sep 16:33

github-actions

v0.8.6

5312030

Increase memory efficiency of `refineCAGs` Latest

Latest

The updates in this minor commit mostly focus on memory optimizations in the refineCAGs step.

The changes which I found to be most helpful for reducing the memory burden of this step was:

Updating the cag_membership table (linking genes and CAGs) with new CAG IDs is best done by initializing an entirely new DataFrame
Building a single dict is much better than concatenating a set of two-column DataFrames
The index can be dropped from the table with input CAG membership immediately after reading in each shard
This release has only performance improvements over v0.8.2, but no expected differences in the results.

Assets 3

15 Sep 18:38

github-actions

v0.8.2

dc0fa00

Release v0.8.2

This release contains commits which primarily fix a bug introduced in v0.8.1, but also should be viewed in the context of the enhancements which were added with that release as well.

The results generated with this release should not be different from those provided by v0.8.0, but they should be much more efficiently generated with large datasets.

It all started with zarr. This new file format reduced the memory required for steps like assembleAbundances and findCAGs by reading subsets of genes directly from disk without having to load the entire table at once. However, with further testing it became clear that this approach was introducing a large increase in time needed to run the findCAGs steps, since that zarr store had to be read by every single shard across every round.

The innovation implemented here reduces the time needed to run the findCAGs steps. This is accomplished by passing the CAG-level abundances from earlier rounds of CAG identification directly to the subsequent rounds. This approach eliminates the need to read from the zarr store after the initial round of CAG identification.

Assets 3

05 Aug 21:32

sminot

v0.8.0

863528e

Increase efficiency of CAG construction

This release has been refactored to make the CAG construction process much more efficient. The changes implemented for that goal are:

Saving all gene abundances in Zarr format to speed up the process of reading subsets in each shard
Grouping genes into CAGs before constructing each DataFrame to reduce the total memory burden
Increasing the number of shards used to initiate the CAG construction by 10X

This release has been tested on a real-world dataset, and we found that with this updated version the set of CAGs which were generated closely matched the results of the previous release. In addition, the CAG creation process was able to be spread out in parallel over ~10X more nodes (for the initial step) which had the effect of reducing the total time to answer while keeping the total compute time constant. In other words, with this release my expectation is that the cost of analysis will be unchanged, but the results should be available more quickly.

Assets 2

19 Jul 15:31

github-actions

v0.7.1

1954b12

Release v0.7.1

Refactor assemble abundances

Having attempted to run the previous version of the pipeline on much larger datasets, it's clear that assembling a single table with all gene-level abundances is not computationally tractable.

The only impact of these changes on the output of the pipeline is that the /abund/gene/wide table is no longer present (neither is the matching feather file). In practice, the execution of the rest of the pipeline is unchanged.

Assets 3

19 Jul 15:26

sminot

v0.7.0

1954b12

Refactor assembleAbundances

Having attempted to run the previous version of the pipeline on much larger datasets, it's clear that assembling a single table with all gene-level abundances is not computationally tractable.

The only impact of these changes on the output of the pipeline is that the /abund/gene/wide table is no longer present (neither is the matching feather file). In practice, the execution of the rest of the pipeline is unchanged.

Assets 2

02 Jul 21:44

github-actions

v0.6.3

4bd0eb6

Example output files for release v0.6.2

While this release is tagged v0.6.3, it was created in order to capture the output files generated by the previous release.

Assets 3

25 Jun 22:57

sminot

v0.6.2

7a6d929

Performance enhancements for corncob and betta

This minor release adds functionality without changing the content or format of the outputs (with the exception of adding the wald statistic to the corncob and betta outputs).

Corncob is now run across multiple shards
Added the wald statistic to corncob and betta outputs
Set the maximum pickle protocol to 4 for backwards compatibility with python < 3.8
Added validation of corncob output, including pickle protocol 4 and python 3.7

Assets 2

22 Jun 22:13

sminot

v0.6.1

7a3ea19

Run betta in parallel

Very minor change which runs betta in parallel across multiple processes and then combines them again for the final output. No change to parameters or outputs.

Assets 2

18 Jun 18:34

sminot

v0.6.0

316ec36

Meta-analysis by gene annotation using betta

After running corncob and annotating genes by taxon and/or eggNOG, we will now calculate the combined association of all of the CAGs which share a given annotation against each of the parameters of interest.

The resulting table is being written to /stats/enrichment/betta and has the columns:

annotation: species, genus, family, or eggNOG_desc are the current options
label: The value in the annotation column being analyzed in this row
parameter: The parameter from the corncob formula being analyzed in this row
estimate: The combined estimate across all CAGs in this group
std_error: The combined standard error of the estimate across all CAGs in this group
p_value: The probability that the true coefficient is zero
q_value: The FDR-adjusted p-value (using the method specified by --fdr_method, fdr_bh by default)

This new feature will be run for any dataset which has specified --formula with taxonomic and/or eggNOG annotation.

Assets 2

11 Jun 21:58

sminot

v0.5.1

9de390d

Fix bug with small samples

In the edge case where a sample was provided which did not produce an assembly containing any protein-coding genes, the previous release would result in an error at the alignAlleles step. This minor release fixes that bug by skipping those samples for the purpose of annotating assemblies.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: Golob-Minot/geneshot

Increase memory efficiency of `refineCAGs`

Release v0.8.2

Increase efficiency of CAG construction

Release v0.7.1

Refactor assembleAbundances

Example output files for release v0.6.2

Performance enhancements for corncob and betta

Run betta in parallel

Meta-analysis by gene annotation using betta

Fix bug with small samples