Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assemblies of 1k random SRAs from novel_id90.fa #246

Open
rchikhi opened this issue Feb 6, 2021 · 10 comments
Open

assemblies of 1k random SRAs from novel_id90.fa #246

rchikhi opened this issue Feb 6, 2021 · 10 comments

Comments

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 6, 2021

List of the 1,000 random SRAs accessions:
s3://serratus-rayan/1krandom-assembly/1000_random_sra.txt
(obtained through: cat novel_id90.fa | grep ">" | cut -c 2- |sort |uniq | shuf -n 1000)

952 SPAdes log files were produced (meaning SPAdes at least started running the assembly). I could investigate why the remaining 48 failed if you'd like (but it doesn't seem so important). For 70 of the 952 SRAs, SPAdes didn't produce an output, likely because it ran out of memory (for the few cases I eyeballed). So, as a result:

882 SRAs were assembled into a scaffolds.fasta file:
s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds.txt

@rchikhi
Copy link
Collaborator Author

rchikhi commented Feb 7, 2021

All contigs having a motifator hit:
s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta

@rchikhi
Copy link
Collaborator Author

rchikhi commented Feb 8, 2021

hmmsearch results of those contigs versus Pfam-A:

s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.transeq.faa.*
(ran with hmmsearch -A [.sto] --tblout [.tbl] --domtblout [.domtbl] -o [.hmmsearch_stdout] Pfam-A.hmm [contigs.fa])

@rchikhi
Copy link
Collaborator Author

rchikhi commented Feb 8, 2021

CheckV on those contigs:

  • completeness.tsv : s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.checkv.completeness.tsv
  • quality_summary.tsv: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.checkv.quality_summary.tsv
    (other output files available on request: complete_genomes.tsv contamination.tsv proviruses.fna viruses.fna)

@rchikhi
Copy link
Collaborator Author

rchikhi commented Feb 9, 2021

Virsorter2 results on those contigs:

====> VirSorter run (provirus mode) finished.
            # of full    seqs (>=2 genes) as viral:     61575
            # of partial seqs (>=2 genes) as viral:     20
            # of short   seqs (< 2 genes) as viral:     99941
            Useful output files:
            final-viral-score.tsv       ==> score table
            final-viral-combined.fa     ==> all viral seqs
            final-viral-boundary.tsv    ==> table with boundary info
            Suffix is added to seq names in final-viral-combined.fa:
            full    seqs (>=2 genes) as viral:  ||full
            partial seqs (>=2 genes) as viral:  ||partial
            short   seqs (< 2 genes) as viral:  ||lt2gene
            NOTES:
            Users can further screen the results based on the following
                columns in final-viral-score.tsv:
                - contig length (length)
                - hallmark gene count (hallmark)
                - viral gene % (viral)
                - cellular gene % (cellular)
            The group field in final-viral-score.tsv should NOT be used
                as reliable taxonomy info
            <====
  • s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-score.tsv
  • s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-boundary.tsv

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jun 22, 2021

added the FASTA files of the viruses found by VirSorter2:

  • s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.fa

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jun 22, 2021

Palmscan on the above virsorter2 sequences + usearch_global (id=0.7) to palmdb using https://github.com/rcedgar/palmdb/blob/main/2021-03-14/uniques.fa.gz (as palmdb.fa in the scripts below).

results:

  • s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter_palmdb.b6

command lines:
script1 script2 script3

@rchikhi rchikhi closed this as completed Jun 22, 2021
@rchikhi rchikhi reopened this Jun 22, 2021
@rchikhi
Copy link
Collaborator Author

rchikhi commented Jun 26, 2021

Diamond of virsorter2 sequences to nr.

I took the "full" and "less than 2 genes" FASTA contigs produced by virsorter2. I discarded the "partial" ones as it's a tiny category (20 sequences out of 161536, wouldn't be significant to analyze, but let me know if you'd still like to have results for them).

diamond cmdline: diamond blastx -p 48 -q final-viral-combined.full.fa --db ~/diamonddb/nr.gz --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qseq sseq | tee out_diamond.full.fmt6

results for full: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.diamond.full.fmt6

results for lt2genes: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.diamond.lt2.fmt6

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jun 30, 2021

An analysis of the above raw results:

59,488 contigs annotated as full viruses:

  • 2,070 are highly similar to a known nr entry
  • 57,253 are novel

93,728 contigs annotated as having less than 2 genes (lt2genes):

  • 1,107 are highly similar to a known nr entry
  • 92,010 are novel

What I mean by 'nr entry' is an ENA accession, e.g. QJI52014.1 (didn't check if the accession was a virus or something else).

My criteria for 'similar to known entry' is:

  • aligns to some entry in nr with >= 90% identity and covers >= 90% of the length of the entry

Criteria for 'novel':

  • alignments to nr all have identity < 90% (irrespective of the alignment length)

Notebook of the analysis: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/1000_random/virsorter_fasta_analysis/notebook.ipynb

TSV results:

  • for each contig, if one or more alignments to nr entries cover >= 90% of the entry, report the entry with the highest identity

  • otherwise, no alignment covers >=90% of a genome, so report the alignment with the highest coverage to an nr entry

  • 'full' contigs: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.full.fmt6.summarized

Preview:

contig %idy fraction genome covered identifier of closest genome
DRR029069.NODE_8_length_7653_cov_87.338197::full 47.8 0.8600260416666666 QJI52014.1
DRR067252.NODE_22_length_2974_cov_87.067791::full 26.3 0.9118028534370947 QIJ70132.1
DRR067252.NODE_84_length_2351_cov_3.961331::full 36.8 0.9334257975034674 QGY72634.1
DRR067252.NODE_102_length_2260_cov_25.716362::full 36.8 0.9776119402985075 AVD97103.1
DRR067252.NODE_113_length_2231_cov_11.967205::full 22.8 0.7331975560081466 BAU79511.1
....
  • 'lt2genes' contigs: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.diamond.lt2.fmt6.summarized

Preview:

contig %idy fraction genome covered identifier of closest genome
DRR067252.NODE_15_length_3268_cov_87.993314::lt2gene 40.3 0.8797364085667215 AGK29950.1
DRR067252.NODE_42_length_2680_cov_47.739522::lt2gene 32.6 1.0401234567901234 ANR02704.1
DRR067252.NODE_43_length_2678_cov_10.159153::lt2gene 26.5 0.1342999762300927 YP_009553622.1
DRR067252.NODE_77_length_2401_cov_3.035180::lt2gene 38.5 0.5705967976710334 APG79084.1
....
  • a reminder that the contigs are available here: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.fa

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jun 30, 2021

And finally, the aggregation of the two previous palmdb+diamond analyses:
(#246 (comment) and #246 (comment))
(Corresponding slack discussion: https://hackseq-rna.slack.com/archives/G016GDKBDEF/p1624652669090600?thread_ts=1624570400.080400&cid=G016GDKBDEF)

Aggregated table for both full and lt2genes contigs: s3://serratus-rayan/1krandom-assembly/1000_random_sra.scaffolds_motifator.whole_contigs_hits.fasta.virsorter.final-viral-combined.nr_palm.aggregated.tsv

Preview:

contig closest nr genome identity to closest nr genome when palmprint found, identifier in palmdb identity to palmdb
DRR029069.NODE_8_length_7653_cov_87.338197::full QJI52014.1 47.8 u183088 100.0
DRR067252.NODE_22_length_2974_cov_87.067791::full QIJ70132.1 26.3 NA NA
DRR067252.NODE_84_length_2351_cov_3.961331::full QGY72634.1 36.8 NA NA
... - - - -

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jul 4, 2021

Comparison between the above TSV file and the one @rcedgar produced in https://github.com/ababaian/serratus/wiki/Viral-contigs-containing-RdRP for a different set of contigs:

Set statistics:

  • 473,000 macro contigs in rdrp_contigs.tsv are not in 1000_random_sra
  • 45,913 contigs in 1000_random_sra are not in rdrp_contigs.tsv
  • Intersection: 103,350 contigs
    image

Curiously, 5,307 full contigs in 1000_random_sra are not in rdrp_contigs.tsv:

  • SRR9077587 NODE_1630_length_1873_cov_6..
  • SRR10854864 NODE_3006_length_3825_cov_10..
  • SRR11565434 NODE_240_length_3836_cov_4..
  • SRR5215308 NODE_1371_length_2221_cov_8..
  • ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant