Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assemblies with >1 Cov #212

Open
rcedgar opened this issue Jul 19, 2020 · 8 comments
Open

Assemblies with >1 Cov #212

rcedgar opened this issue Jul 19, 2020 · 8 comments
Assignees

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 19, 2020

We appear to have an unsolved problem with assemblies that have multiple Covs.

From @rchikhi in an earlier issue: "Among the 10,816 datasets of the master table, 272 (2.4%) of them have CoV contigs of total size longer than 50 kbp (arbitrary threshold at which there's likely >=2 genomes). Yet among those 272, 208 (76%) accessions have >= 2 contigs longer than 20kbp. So if we decided to try to separate genome, maybe taking the contigs longer than 20kbp is a viable strategy."

If there are 272, then IMO we do need to split them into two assemblies because there are several downstream analyses that assume there is only one virus per SRA. Serratax is one of them, and if I understood correctly then darth is another. I will need PFAM alignments separately if there are two good viruses in one SRA, at a minimum the RdRps if there are two. Regardless of what else we do, I think it would be a good idea to check if there are two good RdRp alignments to ensure we don't lose good novel Covs. Maybe the CS output can tell us if there are two RdRps.

@ababaian suggest you offer guidance here.

@asl
Copy link

asl commented Jul 19, 2020

We do have information about which contigs do contain RdRp in bgc_statistics.txt.

@rchikhi
Copy link
Collaborator

rchikhi commented Jul 19, 2020

For the record, I vote on dropping those accessions from the initial preprint. Viral contig binning sounds like a can a worms

@rcedgar
Copy link
Collaborator Author

rcedgar commented Jul 19, 2020

Disagree. If we have two global RdRp HMM alignments that are <97% identical to each other, we have two species. This seems pretty solid to me. This scenario is not unlikely in something a virome or environmental bat poop sample.

@ababaian
Copy link
Owner

I'd say for immediacy we go with one, then we swing back and pick up the stragglers in a second pass. I would wager 99% of the time this is going to be the same CoV just a strain variant. Leave one of these issues open and we will return to it in a second pass.

@rcedgar
Copy link
Collaborator Author

rcedgar commented Jul 20, 2020

I would agree, except this is a closely related issue to Frank's missing RdRp. Why not kill two pols with one stone and capture all the RdRps in the Cov contigs.

@ababaian
Copy link
Owner

I mean we cluster by OTU later so we can get rid of duplicates that way so it could work yes. We'll just need to decide if we report how many unique SRA we have or how many unique contigs for count data.

@asl
Copy link

asl commented Jul 20, 2020

So, there are few issues here. Actually the Frank problem is due to the subtle problem in the way how assemblies were run, so scaffolder got turned off. We might want to re-run of of the assemblies including Frank to obtain a single scaffold in the results. I believe I saw at least one other dataset that would benefit from it.

Certainly, using all RdRps is another story and is useful for tree building, etc.

@taltman
Copy link
Collaborator

taltman commented Jul 21, 2020

I wanted to share a specific example of an assembly with multiple contigs, seemingly coming from different genomes.

Here is the FTR table output of VADR for SRR8389793:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR8389793/SRR8389793$ awk -v OFS="\t" '{ print $1, $2, $3, $5, $6, $7 }' SRR8389793.vadr.ftr 
#       seq     seq     ftr     ftr     ftr
#idx    name    len     model   type    name
#---    ---------------------------------------------------     ----    ---------       -----------     -----------------------------------------------
1.1     NODE_9_length_984_cluster_9_candidate_1_domains_2       984     NC_035191       gene    gene.1
1.2     NODE_9_length_984_cluster_9_candidate_1_domains_2       984     NC_035191       CDS     ORF1ab_polyprotein
#
2.1     NODE_18_length_279_cluster_18_candidate_1_domains_1     279     NC_009019       gene    orf1ab
2.2     NODE_18_length_279_cluster_18_candidate_1_domains_1     279     NC_009019       CDS     orf1ab_polyprotein
#
3.1     NODE_19_length_275_cluster_19_candidate_1_domains_1     275     NC_016994       gene    orf1ab
3.2     NODE_19_length_275_cluster_19_candidate_1_domains_1     275     NC_016994       CDS     replicase_polyprotein
#
4.1     NODE_6_length_1483_cluster_6_candidate_1_domains_2      1483    NC_016996       gene    orf1ab
4.2     NODE_6_length_1483_cluster_6_candidate_1_domains_2      1483    NC_016996       CDS     replicase_polyprotein
#
5.1     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       gene    ORF1ab
5.2     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       CDS     ORF1ab_polyprotein
5.3     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       mat_peptide     endoRNAse
5.4     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       mat_peptide     2'-O-ribose_methyltransferase
#
7.1     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       gene    orf1ab
7.2     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       CDS     orf1ab_polyprotein
7.3     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       CDS     orf1a_polyprotein
7.4     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       mat_peptide     coronavirus_nsp3_(HD2):GBSEP:hydrophobic_domain

If you look at the 4th column, it is the "model" column. This is the NCBI Nucleotide accession for the closest-matching RefSeq model (a VADR model is an organized set of CMs). You'll notice that there are six contigs with ORF 1ab hits, all of which have different models as the closest-matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants