Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download fasta file #257

Open
MjelleLab opened this issue Feb 14, 2022 · 10 comments
Open

Download fasta file #257

MjelleLab opened this issue Feb 14, 2022 · 10 comments

Comments

@MjelleLab
Copy link

Hi,
I wonder where I can access the fasta files of the genomes for the RNA-viruses within Serratus.

@asl
Copy link

asl commented Feb 14, 2022

@MjelleLab Serratus did not assemble complete genomes of RNA viruses. Only RdRPs which are part of PalmDB (@ababaian please correct me if I'm wrong). However, there are some assemblies, in particular for all CoVs. See more information at https://github.com/ababaian/serratus/wiki/Assembly-Data

@rcedgar
Copy link
Collaborator

rcedgar commented Feb 14, 2022

Actually we (meaning @rchikhi IIRC) did complete assemblies of hundreds(?) of metagenome libraries generating many contigs with full and partial RNA virus genomes from many different phyla and families. Hopefully the SQL API gives a way to identify the RdRP+ contigs within those assemblies, this would be @ababaian's department. Looks like a gap in the database and/or wiki documentation that we don't explain how to find the RNA virus contigs.

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 14, 2022

For the assemblies we generated, we only specifically looked at extracting CoV's at the time. For other viral families, a reasonable strategy would be to browser the serratus database to identify which SRA accessions have RdRP+ contigs, intersect that list of accession with the list of assemblies we generated, and then run a generic viral identification tool (e.g Virsorter) on those assemblies

@rcedgar
Copy link
Collaborator

rcedgar commented Feb 14, 2022

@rchikhi I don't think that's correct, I'm pretty we did a macro-micro comparison where we made large batch of macro-assemblies (complete SRAs) to validate micro-assemblies (diamond hits only) as part of our QC for our protein search methodology. If you don't remember this I can try to dig up backups with my notes, unfortunately they're on an drive that recently got corrupted.

@rcedgar
Copy link
Collaborator

rcedgar commented Feb 14, 2022

We used a couple of methods including Virsorter to classify palmprint+ contigs as viral / other as part of the same exercise. See Ext Data Fig 2(h) in the published paper: "(h) Kingdom predicted by Virsorter2 for RdRP+ contigs (by Palmscan) obtained
by full assembly of 880 randomly chosen RdRP+ runs
". These 880 runs were the successful assemblies from a list of 1k attempted.

@rcedgar
Copy link
Collaborator

rcedgar commented Feb 14, 2022

We should post+document the RdRP+ contigs from those 880 complete assemblies if this is not already done; for sure something should be added to the Wiki page mentioned earlier in this issue thread.

@ababaian
Copy link
Owner

ababaian commented Feb 14, 2022

Short answer @MjelleLab, I'd lean towards Rayan's strategy, we provide an index of RdRP sequence/barcodes to identify where in the SRA a particular RNA virus (or those related) can be found. If this index is sufficient for you, I would suggest either try palmID with an input RdRP sequence to find which SRA libraries contain potential matches, or search through the micro-assembly data directly Explained here.

Long answer: as others have said, we have something like 56K assemblies, with like 50K of those being from Coronavirus libraries. You can download a list of SRA libraries with available assemblies with aws s3 ls s3://lovelywater/assembly/contigs/ and check for a DRR029953.coronaspades.contigs.fa.mfc file (note the MFC compression).

@rcedgar, I think a good SQL interface would be great :) We should slap it on the TODO list and integrate it into the web-UI.

@ababaian
Copy link
Owner

Maybe an addendum @MjelleLab, could you tell us what YOU would find most useful? We have organized the data internally within the project, but if we better understand use-cases from users we can offer better solutions in how we serve the available data.

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 15, 2022

Regarding #257 (comment): @rcedgar you're right, I had overlooked that experiment! It's "only" a subset of 880 assemblies, but indeed there are some potentially novel viral contigs in there.

@rcedgar
Copy link
Collaborator

rcedgar commented Feb 15, 2022

Some of these had hundreds of viruses, I believe we found something like 10-20% of novel species in the 880 assemblies. Novel RdRPs are strongly concentrated in large metagenomes/viromes, and these were preferentially chosen by the random selection of the 1k subset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants