Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rVert assemblies #241

Open
rchikhi opened this issue Jan 16, 2021 · 5 comments
Open

rVert assemblies #241

rchikhi opened this issue Jan 16, 2021 · 5 comments

Comments

@rchikhi
Copy link
Collaborator

rchikhi commented Jan 16, 2021

Results are here: s3://serratus-rayan/rVert-assembly/

Folder structure:

Data:

pro/                          -- Original .pro data
fasta/                        -- Converted to .fasta
all_pe.fa                     -- Merged into single files
all_se.fa

Results:

individual/                       -- RNAViralSPAdes assembly of each SRAs individually
rnaviralspades_coassembly_k*/     -- RNAViralSPAdes co-assembly of all_pe.fa+all_se.fa with given k values

Scripts:

assemble_individually.sh  -- command line of RNAViralSPAdes assembly of each SRAs individually
coassemble.sh             -- command line of RNAViralSPAdes co-assembly of all_pe.fa+all_se.fa
pro_to_fasta.py           -- Conversion of .pro.gz files to .fasta
pro_to_fasta.sh           -- wrapper
upload.sh                 -- S3 upload
@rchikhi rchikhi changed the title rVert assemblies: rVert assemblies Jan 16, 2021
@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 16, 2021

  • Added 3 co-assemblies, with various k-mer sizes suggested by Anton. The largest one, assembly length wise, is k29,43 but you could also have a look at k33,55,77 which is smaller but maybe has different content.

  • Added motifator results for the two 'best' co-assemblies (those with highest k value <= 77) using cmdline (1).
    Results are in
    s3://serratus-rayan/rVert-assembly/motifator-results/rnaviralspades_coassembly_k29,43/
    and
    s3://serratus-rayan/rVert-assembly/motifator-results/rnaviralspades_coassembly_k33,55,77/

  • Added motifator results for individual SRA accessions (kept only those with non-empty LHF) using cmdline (1).
    Results are in: s3://serratus-rayan/rVert-assembly/motifator-results/individual/

cmdline (1):

  transeq -frame 6 $input $input.aa
  base=$(basename $input)
  ./motifator   -search_rdrp $input.aa -model rdrp_model.txt  \
                -tsvout results/$base.tsv \
                -report results/$base.txt \
                -fevout results/$base.fev \
                -medhionly \
                -trim_fastaout   results/$base.trim.LHF.fa \
                -motifs_fastaout results/$base.motifs.fa

@asl
Copy link

asl commented Jan 16, 2021

To be 100% explicit: no HMMs here were involved :)

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 21, 2021

Update: re-uploaded s3://serratus-rayan/rVert-assembly/motifator-results/individual/ which, up until this comment, contained the wrong files (I had mistakingly run motifator on the .pro reads and not contigs).

Now motifator has been run on the individual SRA's for unitigs (before_rr.fasta), contigs and scaffolds.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 21, 2021

@rcedgar asked "how many contigs give motifator hits"? To attempt to answer this, I ran:

$ grep "high-conf" *.tsv |cut -d"." -f1 |sort|uniq|wc -l
3388
$ grep "medium-conf" *.tsv |cut -d"." -f1 |sort|uniq|wc -l
112

How many SRA accessions were in rVert:

$ ls ../../fasta/ |cut -d"." -f1 |sort|uniq|wc -l
70070

(turns out many .pro files are empty, at least 20k) UPDATE: Due to a bug I didn't create fasta files for .pro files containing a single read, will re-run but it shouldnt change results much

How many SRAs were assembled into empty unitigs:

$ find ../individual/ -name "*.before_rr.fasta" -empty|wc -l
51888

How many non-empty contigs:

$ find ../individual/ -name "*.contigs.fasta" |wc -l
18182

Thus 3388/18182=18.6% non-empty contigs have a high-confidence RdRp hit.
but if you count all SRAs including empty contigs, that number drops to 4.8%.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 22, 2021

Out of the 18k rVert non-empty individual assemblies, 501 of them have different filesizes between before_rr.fasta and contigs.fasta (cc @asl). the difference is typically not big (~100 bp).

An extreme example: SRR3999033 (12.2kbp vs 8.8kbp).
Other smaller ones: DRR032780, SRR3289253, SRR5085421. Assemblies are in s3://serratus-rayan/rVert-assembly/individual/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants