Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all .pro assemblies #242

Open
rchikhi opened this issue Jan 23, 2021 · 10 comments
Open

all .pro assemblies #242

rchikhi opened this issue Jan 23, 2021 · 10 comments

Comments

@rchikhi
Copy link
Collaborator

rchikhi commented Jan 23, 2021

This thread will be for updates of the .pro assemblies.

number of .pro.gz files analyzed (all of s3://serratus-public/out/21* except *r1p*):

5,726,283

number of .fasta.gz obtained after converting .pro to FASTA and discarding empty files:

3,379,127

@rchikhi rchikhi changed the title all .pro assemblies all .pro assemblies Jan 23, 2021
@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 24, 2021

Assemblies done (measured by before_rr.fasta existing):

3,378,813

(no idea why in ~300 cases, no before_rr.fasta was created)

Number of empty assemblies:

2,890,521

Thus, non-empty assemblies (i.e. both before_rr.fasta and contigs.fasta exist and are non-empty):

488,292 (14.4%)

For reference, 19% of the rVert assemblies were non-empty.

@asl
Copy link

asl commented Jan 24, 2021

@rchikhi

(no idea why in ~300 cases, no before_rr.fasta was created)

Likely the assembly failed. Can you collect few logs out there?

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 24, 2021

Can do, let me just finish with the bulk of the results first.

Number of non-empty trim.LHF.fa motifator files:

168,460

@rcedgar
Copy link
Collaborator

rcedgar commented Jan 24, 2021

Hi @rchikhi Minor feature request/suggestion for future runs: can you combine all micro-assemblies into one FASTA file? This file should not be too big, only around 1 Gb or so. This would be easier to process on Linux than millions of small FASTAs or millions of directories, each with a small/empty FASTA. This would require embedding the SRA identifier in the sequence label a.k.a. FASTA defline, e.g. as a prefix >SRA1234567|NODE_1..., something like that.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 24, 2021

Data availability

Individual assemblies (excluding empty files):

s3://serratus-rayan/pro-assembly/individual/

Individual motifator analyses of the above assemblies:

s3://serratus-rayan/pro-assembly/individual_motifator/

For download convenience, the above two folders (assemblies and motifator analyses) are packaged into a tar.gz file each:

s3://serratus-rayan/pro-assembly/individual_assemblies.tar.gz
s3://serratus-rayan/pro-assembly/individual_motifator.tar.gz

All these folders are relatively small (~10GB) but have in the order of millions of files.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 24, 2021

In addition, for @rcedgar, here are all the motifator outputs (just the LHF files) concatenated into a single file:

s3://serratus-rayan/pro-assembly/all.before_rr.LHF.fasta
s3://serratus-rayan/pro-assembly/all.contigs.LHF.fasta

SRR id is added as follows: >[SRR id][a single space][contig name] e.g. >SRR0123123 NODE_1_xxx.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 24, 2021

And concatenated unitigs/contigs:

s3://serratus-rayan/pro-assembly/all.before_rr.fasta
s3://serratus-rayan/pro-assembly/all.contigs.fasta

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 24, 2021

@rchikhi
Copy link
Collaborator Author

rchikhi commented Jan 25, 2021

here's an exhaustive list of "reads" that are above 600 bp among the single-end libraries:

https://serratus-rayan.s3.amazonaws.com/rdrp-pan-assembly/prelim/all_se.above_600bp.txt

from that list I extracted the set of 719 accessions that are deemed not to be Illumina short reads:

https://serratus-rayan.s3.amazonaws.com/rdrp-pan-assembly/prelim/nonILMN.txt

@rchikhi
Copy link
Collaborator Author

rchikhi commented Feb 4, 2021

Coverage analysis of the motifator hits within the .pro assemblies

s3://serratus-rayan/pro-assembly/depth_summary.csv

schema:
sra, header, contig_type, p_cvg1, p_cvg2, p_cvg3-4, p_cvg5-8, p_cvg9plus

where p_cvgX is the percentage of bases of the region where coverage is >= X

code used to generate those results
https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/bed_analysis.sh
https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/depth_analysis.sh
https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-rdrp-analysis/-/blob/master/depth_summary.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants