Master Issue - v0.9 #56

sminot · 2021-01-17T18:23:49Z

Now that there is a critical mass of minor improvements to be made, I am creating a master issue to organize all of the changes which will go into a new minor release. I am describing this as a minor release because we will not be breaking the syntax of the inputs. The changes to be made will certainly change the content of what is being output, but that should be restricted to either (a) improved quality of existing output objects or (b) the addition of new output objects which extend the functionality of the workflow.

The most substantive change to the workflow will be the implementation of a new CAG clustering approach which uses co-assembly information to boost the performance of co-abundance clustering.

I'd like to use this issue to discuss what features and fixes will be included in the release, as well as tracking which existing issues will be subsumed by this one.

cc @jgolob

sminot · 2021-01-17T18:32:53Z

Items:

Enhance the performance of co-abundance clustering using co-assembly information
add extractCounts to the default workflow even when corncob is not run (add extractCounts to the default workflow even when corncob is not run #53)
Add rRNA prediction on contigs (Add rRNA prediction #42)
Check for empty inputs (Check for empty inputs #28)
Add antiSMASH for pathway identification (add antiSMASH for synthetic pathway identification #26)
Aggregate results in file objects which are amenable for interactive visualization
Add E-value threshold for taxonomic assignment
Add E-value threshold for eggNOG functional classification assignment
Update DSL-2 syntax for NXF_VER=20.10.0

sminot · 2021-01-17T18:44:53Z

On the topic of "Aggregate results in file objects which are amenable for interactive visualization," my first thought is to save a redis store. After reading this, I will first try RDB format.

The biggest question here is going to be the tradeoff between file format, file size, and file content. I'd like to include everything needed for of one particular approach to visualization, but the file should be less than a gigabyte for a typical experiment, and should be able to be read quickly. This requirement is making me want to move away from HDF for the visualization portion, but the use of redis may need to be reassessed depending on how things go.

The smaller file size will also hopefully be supported by increasing the sensitivity of CAG clustering, which should reduce the overall number of CAGs.

The tables that I think would be needed for visualization are:

(per CAG) Size (number of genes)
(per CAG) Majority taxonomic assignment of genes per each CAG (for labeling purposes only) at each taxonomic rank
(per CAG) Estimated coefficients of association (and p-values) for each covariate
(per CAG) Number of genes assigned at each taxonomic rank
(per CAG) Number of genes assigned to each unique function (via eggNOG)
(per CAG) Relative proportion of gene copies which are assigned to this CAG across all specimens
(per CAG) Ordination layout for plotting on the basis of taxonomic classification spectra
(per dataset) User-provided metadata sheet (with R1/R2 removed, and all specimens deduplicated)
(per taxon) Set of CAGs which contain genes assigned to that taxon
(per taxon) Relative proportion of gene copies which are assigned to this taxon across all specimens
(per function) Set of CAGs which contain genes assigned to this eggNOG function
(per function) Relative proportion of gene copies which are assigned to this function across all specimens

sminot · 2021-01-25T23:18:20Z

The major update to co-abundance gene clustering (in 5eb1441) implements iterative agglomerative clustering of genes, using co-assembly information to prioritize the order in which genes are added. This approach also implements a very simple filtering mechanism to limit the set of clustered genes to those which are assembled into a contig with >=X depth which contains >=Y genes in total. These parameters are --min_contig_depth and --min_contig_size

sminot · 2021-01-25T23:21:07Z

DSL-2 syntax implemented with f0b609e

sminot · 2021-01-25T23:43:24Z

I did some digging into antiSMASH for metabolite prediction, and it appears that this software suite may be more suitable for running in a separate workflow, and less suitable for running within geneshot.

On a practical level, the antiSMASH software is distributed as a Docker image which includes both the code and reference database, which is far too large for the default Docker partition on many batch computing systems (including AWS Batch) and may prove to be an extremely challenging reconfiguration for many users.

On a more theoretical level, the output of the antiSMASH software is highly oriented towards the human inspection of genomic loci (see their very nicely written documentation). This is extremely interesting and useful, but it is fundamentally distinct from the gene-oriented analysis performed by geneshot. In other words, the units of analysis for antiSMASH (being the operon or genomic region) are orthogonal to anything which geneshot provides at the moment.

On another theoretical note, the reliance of antiSMASH on long assembled contigs may be confounded by the highly fragmented assemblies which result from short-read metagenomics. Further advancements in either metabolomics analysis or in long-read sequencing may change this calculus in the future.

Planning for the future, I am more inclined to incorporate metabolomics analysis into geneshot when we can use tools which output (e.g.) the predicted abundance of metabolites for a single specimen.

Do you have thoughts on the addition of antiSMASH or other tools like it into geneshot, @jgolob ?

Addresses #56

Not just if the user provides a formula. Addresses #56

sminot · 2021-02-01T16:10:45Z

One fairly major change that I'd like to implement relates to how we consider the association of organism abundances with the experimental design, with regards to taxonomic and functional annotations.

In the pre-v0.9 approach, the experimental design was used to estimate the association of CAG relative abundances with a user-provided formula. The connection with taxa or functions was then made with the betta approach, which considered the subset of CAGs which contained any genes with those annotations.

Instead, what would be more direct would be to analyze the taxonomic groups and the eggNOG functional annotation groups in the exact same way as the CAGs, by running corncob on the readcounts summed over the group of genes which share the same grouping.

Addresses #56

sminot · 2021-03-02T16:01:24Z

The "interactive visualization" feature is now implemented as buildRedis

sminot · 2021-03-02T16:02:02Z

After spending time with this release, I think that the antiSMASH approach is not going to be easily integrated into the codebase, at least not for this release.

sminot self-assigned this Jan 17, 2021

sminot added discussion enhancement New feature or request labels Jan 17, 2021

sminot pushed a commit that referenced this issue Jan 26, 2021

Extract rRNA alleles with barrnap

b966b47

Addresses #56

sminot pushed a commit that referenced this issue Jan 26, 2021

Always extract the readcounts per CAG

f55edaa

Not just if the user provides a formula. Addresses #56

sminot pushed a commit that referenced this issue Feb 2, 2021

Add --tax_evalue parameter

1deff83

Addresses #56

sminot pushed a commit that referenced this issue Feb 2, 2021

Run corncob over taxa and functions, as well as CAGs

0fb653b

Addresses #56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master Issue - v0.9 #56

Master Issue - v0.9 #56

sminot commented Jan 17, 2021

sminot commented Jan 17, 2021 •

edited

sminot commented Jan 17, 2021

sminot commented Jan 25, 2021

sminot commented Jan 25, 2021

sminot commented Jan 25, 2021

sminot commented Feb 1, 2021

sminot commented Mar 2, 2021

sminot commented Mar 2, 2021

Master Issue - v0.9 #56

Master Issue - v0.9 #56

Comments

sminot commented Jan 17, 2021

sminot commented Jan 17, 2021 • edited

sminot commented Jan 17, 2021

sminot commented Jan 25, 2021

sminot commented Jan 25, 2021

sminot commented Jan 25, 2021

sminot commented Feb 1, 2021

sminot commented Mar 2, 2021

sminot commented Mar 2, 2021

sminot commented Jan 17, 2021 •

edited