New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master Issue - v0.9 #56
Comments
Items:
|
On the topic of "Aggregate results in file objects which are amenable for interactive visualization," my first thought is to save a redis store. After reading this, I will first try RDB format. The biggest question here is going to be the tradeoff between file format, file size, and file content. I'd like to include everything needed for of one particular approach to visualization, but the file should be less than a gigabyte for a typical experiment, and should be able to be read quickly. This requirement is making me want to move away from HDF for the visualization portion, but the use of redis may need to be reassessed depending on how things go. The smaller file size will also hopefully be supported by increasing the sensitivity of CAG clustering, which should reduce the overall number of CAGs. The tables that I think would be needed for visualization are:
|
The major update to co-abundance gene clustering (in 5eb1441) implements iterative agglomerative clustering of genes, using co-assembly information to prioritize the order in which genes are added. This approach also implements a very simple filtering mechanism to limit the set of clustered genes to those which are assembled into a contig with >=X depth which contains >=Y genes in total. These parameters are |
DSL-2 syntax implemented with f0b609e |
I did some digging into antiSMASH for metabolite prediction, and it appears that this software suite may be more suitable for running in a separate workflow, and less suitable for running within On a practical level, the antiSMASH software is distributed as a Docker image which includes both the code and reference database, which is far too large for the default Docker partition on many batch computing systems (including AWS Batch) and may prove to be an extremely challenging reconfiguration for many users. On a more theoretical level, the output of the antiSMASH software is highly oriented towards the human inspection of genomic loci (see their very nicely written documentation). This is extremely interesting and useful, but it is fundamentally distinct from the gene-oriented analysis performed by On another theoretical note, the reliance of antiSMASH on long assembled contigs may be confounded by the highly fragmented assemblies which result from short-read metagenomics. Further advancements in either metabolomics analysis or in long-read sequencing may change this calculus in the future. Planning for the future, I am more inclined to incorporate metabolomics analysis into geneshot when we can use tools which output (e.g.) the predicted abundance of metabolites for a single specimen. Do you have thoughts on the addition of antiSMASH or other tools like it into |
Not just if the user provides a formula. Addresses #56
One fairly major change that I'd like to implement relates to how we consider the association of organism abundances with the experimental design, with regards to taxonomic and functional annotations. In the pre-v0.9 approach, the experimental design was used to estimate the association of CAG relative abundances with a user-provided formula. The connection with taxa or functions was then made with the Instead, what would be more direct would be to analyze the taxonomic groups and the eggNOG functional annotation groups in the exact same way as the CAGs, by running corncob on the readcounts summed over the group of genes which share the same grouping. |
The "interactive visualization" feature is now implemented as |
After spending time with this release, I think that the antiSMASH approach is not going to be easily integrated into the codebase, at least not for this release. |
Now that there is a critical mass of minor improvements to be made, I am creating a master issue to organize all of the changes which will go into a new minor release. I am describing this as a minor release because we will not be breaking the syntax of the inputs. The changes to be made will certainly change the content of what is being output, but that should be restricted to either (a) improved quality of existing output objects or (b) the addition of new output objects which extend the functionality of the workflow.
The most substantive change to the workflow will be the implementation of a new CAG clustering approach which uses co-assembly information to boost the performance of co-abundance clustering.
I'd like to use this issue to discuss what features and fixes will be included in the release, as well as tracking which existing issues will be subsumed by this one.
cc @jgolob
The text was updated successfully, but these errors were encountered: