New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assemblies with >1 Cov #212
Comments
We do have information about which contigs do contain RdRp in bgc_statistics.txt. |
For the record, I vote on dropping those accessions from the initial preprint. Viral contig binning sounds like a can a worms |
Disagree. If we have two global RdRp HMM alignments that are <97% identical to each other, we have two species. This seems pretty solid to me. This scenario is not unlikely in something a virome or environmental bat poop sample. |
I'd say for immediacy we go with one, then we swing back and pick up the stragglers in a second pass. I would wager 99% of the time this is going to be the same CoV just a strain variant. Leave one of these issues open and we will return to it in a second pass. |
I would agree, except this is a closely related issue to Frank's missing RdRp. Why not kill two pols with one stone and capture all the RdRps in the Cov contigs. |
I mean we cluster by OTU later so we can get rid of duplicates that way so it could work yes. We'll just need to decide if we report how many unique SRA we have or how many unique contigs for count data. |
So, there are few issues here. Actually the Frank problem is due to the subtle problem in the way how assemblies were run, so scaffolder got turned off. We might want to re-run of of the assemblies including Frank to obtain a single scaffold in the results. I believe I saw at least one other dataset that would benefit from it. Certainly, using all RdRps is another story and is useful for tree building, etc. |
I wanted to share a specific example of an assembly with multiple contigs, seemingly coming from different genomes. Here is the FTR table output of VADR for
If you look at the 4th column, it is the "model" column. This is the NCBI Nucleotide accession for the closest-matching RefSeq model (a VADR model is an organized set of CMs). You'll notice that there are six contigs with ORF 1ab hits, all of which have different models as the closest-matching. |
We appear to have an unsolved problem with assemblies that have multiple Covs.
From @rchikhi in an earlier issue: "Among the 10,816 datasets of the master table, 272 (2.4%) of them have CoV contigs of total size longer than 50 kbp (arbitrary threshold at which there's likely >=2 genomes). Yet among those 272, 208 (76%) accessions have >= 2 contigs longer than 20kbp. So if we decided to try to separate genome, maybe taking the contigs longer than 20kbp is a viable strategy."
If there are 272, then IMO we do need to split them into two assemblies because there are several downstream analyses that assume there is only one virus per SRA. Serratax is one of them, and if I understood correctly then darth is another. I will need PFAM alignments separately if there are two good viruses in one SRA, at a minimum the RdRps if there are two. Regardless of what else we do, I think it would be a good idea to check if there are two good RdRp alignments to ensure we don't lose good novel Covs. Maybe the CS output can tell us if there are two RdRps.
@ababaian suggest you offer guidance here.
The text was updated successfully, but these errors were encountered: