Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master Issue - v0.9 #56

Open
sminot opened this issue Jan 17, 2021 · 8 comments
Open

Master Issue - v0.9 #56

sminot opened this issue Jan 17, 2021 · 8 comments
Assignees
Labels
discussion enhancement New feature or request

Comments

@sminot
Copy link
Collaborator

sminot commented Jan 17, 2021

Now that there is a critical mass of minor improvements to be made, I am creating a master issue to organize all of the changes which will go into a new minor release. I am describing this as a minor release because we will not be breaking the syntax of the inputs. The changes to be made will certainly change the content of what is being output, but that should be restricted to either (a) improved quality of existing output objects or (b) the addition of new output objects which extend the functionality of the workflow.

The most substantive change to the workflow will be the implementation of a new CAG clustering approach which uses co-assembly information to boost the performance of co-abundance clustering.

I'd like to use this issue to discuss what features and fixes will be included in the release, as well as tracking which existing issues will be subsumed by this one.

cc @jgolob

@sminot sminot self-assigned this Jan 17, 2021
@sminot sminot added discussion enhancement New feature or request labels Jan 17, 2021
@sminot
Copy link
Collaborator Author

sminot commented Jan 17, 2021

Items:

@sminot
Copy link
Collaborator Author

sminot commented Jan 17, 2021

On the topic of "Aggregate results in file objects which are amenable for interactive visualization," my first thought is to save a redis store. After reading this, I will first try RDB format.

The biggest question here is going to be the tradeoff between file format, file size, and file content. I'd like to include everything needed for of one particular approach to visualization, but the file should be less than a gigabyte for a typical experiment, and should be able to be read quickly. This requirement is making me want to move away from HDF for the visualization portion, but the use of redis may need to be reassessed depending on how things go.

The smaller file size will also hopefully be supported by increasing the sensitivity of CAG clustering, which should reduce the overall number of CAGs.

The tables that I think would be needed for visualization are:

  • (per CAG) Size (number of genes)
  • (per CAG) Majority taxonomic assignment of genes per each CAG (for labeling purposes only) at each taxonomic rank
  • (per CAG) Estimated coefficients of association (and p-values) for each covariate
  • (per CAG) Number of genes assigned at each taxonomic rank
  • (per CAG) Number of genes assigned to each unique function (via eggNOG)
  • (per CAG) Relative proportion of gene copies which are assigned to this CAG across all specimens
  • (per CAG) Ordination layout for plotting on the basis of taxonomic classification spectra
  • (per dataset) User-provided metadata sheet (with R1/R2 removed, and all specimens deduplicated)
  • (per taxon) Set of CAGs which contain genes assigned to that taxon
  • (per taxon) Relative proportion of gene copies which are assigned to this taxon across all specimens
  • (per function) Set of CAGs which contain genes assigned to this eggNOG function
  • (per function) Relative proportion of gene copies which are assigned to this function across all specimens

@sminot
Copy link
Collaborator Author

sminot commented Jan 25, 2021

The major update to co-abundance gene clustering (in 5eb1441) implements iterative agglomerative clustering of genes, using co-assembly information to prioritize the order in which genes are added. This approach also implements a very simple filtering mechanism to limit the set of clustered genes to those which are assembled into a contig with >=X depth which contains >=Y genes in total. These parameters are --min_contig_depth and --min_contig_size

@sminot
Copy link
Collaborator Author

sminot commented Jan 25, 2021

DSL-2 syntax implemented with f0b609e

@sminot
Copy link
Collaborator Author

sminot commented Jan 25, 2021

I did some digging into antiSMASH for metabolite prediction, and it appears that this software suite may be more suitable for running in a separate workflow, and less suitable for running within geneshot.

On a practical level, the antiSMASH software is distributed as a Docker image which includes both the code and reference database, which is far too large for the default Docker partition on many batch computing systems (including AWS Batch) and may prove to be an extremely challenging reconfiguration for many users.

On a more theoretical level, the output of the antiSMASH software is highly oriented towards the human inspection of genomic loci (see their very nicely written documentation). This is extremely interesting and useful, but it is fundamentally distinct from the gene-oriented analysis performed by geneshot. In other words, the units of analysis for antiSMASH (being the operon or genomic region) are orthogonal to anything which geneshot provides at the moment.

On another theoretical note, the reliance of antiSMASH on long assembled contigs may be confounded by the highly fragmented assemblies which result from short-read metagenomics. Further advancements in either metabolomics analysis or in long-read sequencing may change this calculus in the future.

Planning for the future, I am more inclined to incorporate metabolomics analysis into geneshot when we can use tools which output (e.g.) the predicted abundance of metabolites for a single specimen.

Do you have thoughts on the addition of antiSMASH or other tools like it into geneshot, @jgolob ?

sminot pushed a commit that referenced this issue Jan 26, 2021
sminot pushed a commit that referenced this issue Jan 26, 2021
Not just if the user provides a formula. Addresses #56
@sminot
Copy link
Collaborator Author

sminot commented Feb 1, 2021

One fairly major change that I'd like to implement relates to how we consider the association of organism abundances with the experimental design, with regards to taxonomic and functional annotations.

In the pre-v0.9 approach, the experimental design was used to estimate the association of CAG relative abundances with a user-provided formula. The connection with taxa or functions was then made with the betta approach, which considered the subset of CAGs which contained any genes with those annotations.

Instead, what would be more direct would be to analyze the taxonomic groups and the eggNOG functional annotation groups in the exact same way as the CAGs, by running corncob on the readcounts summed over the group of genes which share the same grouping.

sminot pushed a commit that referenced this issue Feb 2, 2021
sminot pushed a commit that referenced this issue Feb 2, 2021
@sminot
Copy link
Collaborator Author

sminot commented Mar 2, 2021

The "interactive visualization" feature is now implemented as buildRedis

@sminot
Copy link
Collaborator Author

sminot commented Mar 2, 2021

After spending time with this release, I think that the antiSMASH approach is not going to be easily integrated into the codebase, at least not for this release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant