Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dicistro assemblies #239

Open
rchikhi opened this issue Dec 14, 2020 · 16 comments
Open

Dicistro assemblies #239

rchikhi opened this issue Dec 14, 2020 · 16 comments

Comments

@rchikhi
Copy link
Collaborator

rchikhi commented Dec 14, 2020

Coronaspades' gene_clusters file are here: s3://serratus-public/assemblies/dicistro/gene_clusters/
Other coronaspades files: s3://serratus-public/assemblies/dicistro/other/

4921 assemblies made out of 5442 accessions: s3://serratus-public/assemblies/dicistro/analysis/list_assembled_dicistro.txt

@rchikhi
Copy link
Collaborator Author

rchikhi commented Dec 14, 2020

All gene_clusters concatenated into a single file: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.gz

Diamond of all gene_clusters assemblies versus rdrp0_q_d: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6.gz

Diamond of all gene_clusters assemblies versus dicistro.protref.aa: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.diamond_vs_dicistro_protref.fmt6.gz

@rcedgar
Copy link
Collaborator

rcedgar commented Dec 15, 2020

My analysis of the above data is posted on S3 here:

s3://serratus-public/rce/quenya_analysis/
s3://serratus-public/rce/dicistro_analysis/

Selection criteria are: contig length >300nt, RdRp/ORF1 id in range 50..75%.

Many assemblies have hits to multiple contigs, but many of these look like assembly problems where the same virus is assembled multiple times. Therefore, I select only the first matching hit from each SRA. These results may therefore underestimate the correct number in cases where there are multiple viruses in one SRA.

Each directory contains the following files:

first_hit_id50to75.fa -- this is the aligned query segment from the diamond output.

first_hit_id50to75_clustered_id70.fa -- This is first_hit_id50to75.fa clustered at 70% nt identity.

first_hit_id50to75.tsv -- tabbed file with fields 1. SRA, 2. contig, 3. contig_length, 4. pctid, 5. database_label.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Dec 21, 2020

Pathracer-seq-fs analysis :

all files are in s3://serratus-public/assemblies/dicistro/analysis/

(1) the a.a. translations of RdRp according to PR

(PR applied to all_gene_clusters.fa and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa

(2) those RdRp a.a. sequences clustered at 97%id

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.fa

(3) diamond search of the clustered sequences vs. dicistro.protref.aa

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6

(4) diamond search of the clustered sequences vs. rdrp0.

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6

cmdlines used: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/cluster.sh

@rcedgar
Copy link
Collaborator

rcedgar commented Dec 21, 2020

what did you mean by "I don't see any of these" in (4) above?

@rchikhi
Copy link
Collaborator Author

rchikhi commented Dec 21, 2020

Ah sorry :) was a copy-paste of your message that I forgot to erase!

@rcedgar
Copy link
Collaborator

rcedgar commented Dec 21, 2020

My analysis uploaded to: s3://serratus-public/rce/dicistro_analysis/

The RdRp a.a. sequences provided by Rayan above were clustered at 90% id ("species") and classified according to %id to rdrp0 and disistro,protref.aa.

Files are:

known_species.fa FASTA containing sequences with >=90% match to any known RdRp.

novel_species.fa FASTA containing sequences with no match >=90% to any known RdRp.

otutable.tsv Tabbed file with one line per species giving top hit to rdrp0 and/or dicistro.protref with %identities.

type_counts.tsv Number of species in each category:

11100   NovelFamily # species is <50% to any known RdRp
5980    NovelGenus_Other # 50..75% to known RdRp from non-Dicistro family
2196    KnownSpecies # >=90% to a known RdRp
1751    NovelSpecies_Other # 75..90% to known RdRp from non-Dicistro family
586     NovelGenus_Dicistro # 50..75% to known RdRp from Dicistro
197     Novel_NoMatch # Self-explanatory. Is this really an RdRp?
98      NovelSpecies_Dicistro # 75..90% to known Dicistro RdRp

@rchikhi
Copy link
Collaborator Author

rchikhi commented Dec 22, 2020

Pathracer analysis versus all assembly graphs (not the gene_clusters this time):

all files are in s3://serratus-public/assemblies/dicistro/analysis/

(1) the a.a. translations of RdRp according to PR

(PR applied to each assembly graph and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)

all_assembly_graphs.RdRP_1234q.seqs.fa.gz

(2) those RdRp a.a. sequences clustered at 97%id

all_assembly_graphs.RdRP_1234q.seqs.centroids.fa.gz

(3) diamond search of the clustered sequences vs. dicistro.protref.aa

all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz̀

(4) diamond search of the clustered sequences vs. rdrp0.

all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz̀

scripts used:
https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_run.sh
https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_organize.sh

@rcedgar
Copy link
Collaborator

rcedgar commented Dec 23, 2020

Results uploaded to s3://serratus-public/rce/dicistro_analysis/pr_graphs/, same filenames as in my comment above. If this analysis is correct, there are 59k new species. Summary totals:

21886   NovelGenus_Other
20666   NovelFamily
10955   NovelSpecies_Other
6439    KnownSpecies
4310    Novel_NoMatch
1340    NovelGenus_Dicistro
99      NovelSpecies_Dicistro

@ababaian
Copy link
Owner

ababaian commented Dec 27, 2020

Monkey work checking rce/novel_species.fa file.

Error Type 1: Low complexity sequences

From: novel_species.fa

    285 >SRR6400004;Type=Novel_NoMatch;Rdrp0=none;Dicistro=none;
    286 KHTQTLKHTHTLKHTHKHTHKHTLKHSNTLKHTHTHTLKHTLKHTHTQTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTN
    287 THTHTHTHTHTHTNTHTHTHTHTQTHTQTHTHSNTHTHTLKHTLKHTHSNTHSNTHTHTHTHTHTHTHTQTHTHTHTHTH
    288 KHTLKHTHTQTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHKHTHTHTHTHTHTHKHTHTHTHTHTHTHSNTHSNTHTQTH
    289 TQTHTHTHTHTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTHTHTHTHTQTH
    290 TQTHTHSNTHTHTLKHTHTHTHTHTHTHTHTHTLKHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTLKHTLKHTHSNTHSN
    291 THTHTHTHTHTHTHTQTHTHTHKHTHTHTHTHTNTHTNTHTHKHTHTHTQTHTQTHTHTHTHTHTHTQTHTHTHTHTHTH
    292 TLKHTLKHTHSNTHSNTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTHTHTHTHSNTHSNTHTHTHTHSNT

Diamond finds no similarity of this in either rdrp0 or dicistro databases. This has no dicistro/rdrp0 match yet the HMM models hit. Traceback on the hit in pathracer, Anton working on this.

@rcedgar
Copy link
Collaborator

rcedgar commented Dec 27, 2020

Yes, I found these a few days ago; there is a discussion with Anton on the slack . These are filtered out by the diamond E-value, but we should keep this issue in mind. PR does not report an E-value, so any long enough ORF will induce an alignment.

@ababaian
Copy link
Owner

Putative Error 2: Known polio, likely misassembly

From: rce/novel_species.fa

Line: 5573 >SRR1036663;Type=NovelGenus_Other;rdrp0=rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1(70.4%);Dicistro=none;
EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINTEQMCLEDAMYGTD
GLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYV
AMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK
VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFE
ALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSI
FNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK
RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNTQDHVRSLCLLAWH
NGEEEYNKFLAKIRSVPI

Taken from SRR1036663: WT poliovirus serially passaged through HelaS3 - sequenced using CirSeq - Acevedo et al. Nature 2013

Blastp hit: 70.93% from polio

Select seq gb|AFQ38633.1| 	RNA polymerase 3D [Human poliovirus 1] 	Human poliovirus 1 	781 	781 	100% 	0.0 	70.93% 	461 	AFQ38633.1

Query  1    EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL  60
            EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL
Sbjct  26   EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL  85

Query  61   MSLDINTEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLS  120
            MSLDINTEQMCLEDAMYGTDGLEALDLSTS                              
Sbjct  86   MSLDINTEQMCLEDAMYGTDGLEALDLSTS------------------------------  115

Query  121  TSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEM  180
              AGYPYVA+GKKKRDI+NKQTRDTKE                                M
Sbjct  116  --AGYPYVALGKKKRDIMNKQTRDTKE--------------------------------M  141

Query  181  QKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK  240
            Q+LLDTYGINLPLVTYVKDEL                                  RS+TK
Sbjct  142  QRLLDTYGINLPLVTYVKDEL----------------------------------RSRTK  167

Query  241  VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEE  300
            VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGV+TGSAVGCDPDLFWSKIPVLMEE
Sbjct  168  VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVVTGSAVGCDPDLFWSKIPVLMEE  227

Query  301  KLFAFDYTGYDASLSPAWFEALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP  360
            KLFAFDYTGYDASLSPAWFEALKM LEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP
Sbjct  228  KLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP  287

Query  361  SGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKG  420
            SGCSGTSIFNSM                                INNLIIRTLLLKTYKG
Sbjct  288  SGCSGTSIFNSM--------------------------------INNLIIRTLLLKTYKG  315

Query  421  IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK  480
            IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT   
Sbjct  316  IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT---  372

Query  481  RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRW  540
                                         FLKRFFRADEKYPFLIHPVMPMKEIHESIRW
Sbjct  373  -----------------------------FLKRFFRADEKYPFLIHPVMPMKEIHESIRW  403

Query  541  TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI  578
            TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI
Sbjct  404  TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI  441

That library contains a good clean hit to polio, and the hit above is from "another assembly" I am guessing.

From rc/all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa

>Score=79.0939|Bitscore=161.231|PartialBitscore=161.231|Seq=SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9|Position=4995|Frameshifts=0|Alignment=29M3D12M1D9M1D21M6D39M2D15M8D56M1I7M1D12M3D28M1D10M4D13M1D46M10D16M2D21M12D10M57D5M

From: .rc/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6

SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9  rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1      99.7    339     1       0       5029    6045    1       339     1.5e-195        680.2   6046    425     GTGAAGGAACCAGCAGTCCTCACTAAAAACGATCCCAGGCTTAAGACAGACTTTGAGGAGGCAATTTTCTCCAAGTACGTGGGTAACAAAATTACTGAAGTGGATGAGTACATGAAAGAGGCAGTAGACCACTATGCTGGCCAGCTCATGTCACTAGACATCAACACAGAACAAATGTGCTTGGAGGATGCCATGTATGGCACTGATGGTCTAGAAGCACTTGATTTGTCCACCAGTGCTGGCTACCCTTATGTAGCAATGGGAAAGAAGAAGAGAGACATCTTGAACAAACAAACCAGAGACACTAAGGAAATGCAAAAACTGCTCGACACATATGGAATCAACCTCCCACTGGTGACTTATGTAAAGGATGAACTTAGATCCAAAACAAAGGTTGAGCAGGGGAAATCCAGATTAATTGAAGCTTCTAGTTTGAATGACTCAGTGGCAATGAGAATGGCTTTTGGGAACCTATATGCTGCTTTTCACAAAAACCCAGGAGTGATAACAGGTTCAGCAGTGGGGTGCGATCCAGATTTGTTTTGGAGCAAAATTCCGGTATTGATGGAAGAGAAGCTGTTTGCTTTTGACTACACAGGGTATGATGCATCTCTCAGCCCTGCTTGGTTCGAGGCACTAAAGATGGTGCTTGAGAAAATCGGATTCGGAGACAGAGTTGACTACATCGACTACCTAAACCACTCACACCACCTGTACAAGAATAAAACATACTGTGTCAAGGGCGGTATGCCATCTGGCTGCTCAGGCACTTCAATTTTTAACTCAATGATTAACAACTTGATTATCAGGACACTCTTACTGAAAACCTACAAGGGCATAGATTTAGACCACCTAAAAATGATTGCCTATGGTGATGATGTAATTGCTTCCTACCCCCATGAAGTTGACGCTAGTCTCCTAGCCCAATCAGGAAAAGACTATGGACTAACTATGACTCCAGCTGACAAATCAGCTACATTTGAAACAGTCACATGGGAGAATGTAACATTCTTGAAG       VKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINIEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK

@asl
Copy link

asl commented Dec 27, 2020

I've uploaded PR hits to s3://serratus-public/assemblies/dicistro/analysis/ split into two groups:

  • all_assembly_graphs.RdRP_1234q.scaffold_seqs.fa.gz – all hits supported by scaffolds
  • all_assembly_graphs.RdRP_1234q.non_scaffold_seqs.fa.gz – all other hits

@asl
Copy link

asl commented Dec 27, 2020

Further uploaded:

all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.fa.gz	
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz

@ababaian
Copy link
Owner

ababaian commented Dec 27, 2020

Here is the assembly graph from @asl for poliovirus:
poliovirus_exclusion

The issue is when within a single assembly-graph there are multiple paths which provide a RdRp match, each one is being returned, this is yielding indels as seen above in slippage-prone viruses like polio (or other RNA viruses).

Proposed Solution: 1. For each RdRp output graph, allow each rdrp-containing edge (red highlight above) to be used at most ONCE per reported sequences. The "top hit" will be selected by percent-identity / top-score to a known virus (thus assuming the hit is not novel). The risk is if there is a novel virus in the same library as a known virus and they share homology over an edge of say 50 amino acids, then the novel virus would be excluded as the known virus takes priority. The benefit is this will reduce intra-sample viral variants to the most conservative.

One more caveat, assume rdrp is the red region in the graph above and the viral genome is green. If a sub-graph (blue) has a higher identity match then the longest match (green), but does not contain an end-to-end RdRp domain, the longest match with variants should take priority.

@rcedgar
Copy link
Collaborator

rcedgar commented Dec 28, 2020

Analysis of high-trust scaffold sequences uploaded to:

s3://serratus-public/rce/dicistro_analysis/scaffolds/

15538 NovelFamily
9746 NovelGenus_Other
2252 NovelSpecies_Other
1713 KnownSpecies
877 Novel_NoMatch
481 NovelGenus_Dicistro
94 NovelSpecies_Dicistro

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 12, 2021

I extracted the assembled scaffolds that contain all of Anton's centroids (all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.fa.gz from #239 (comment)).

Results are in: all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.scaffold_nucls.fa
in s3://serratus-public/assemblies/dicistro/analysis/

generated using: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_get_scaffolds.py
note: the script failed to get a scaffold for 2% (760) of the centroids, let me know if you need to recover them all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants