Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all serratus assemblies (60k accessions) #260

Open
rchikhi opened this issue Mar 7, 2022 · 13 comments
Open

all serratus assemblies (60k accessions) #260

rchikhi opened this issue Mar 7, 2022 · 13 comments

Comments

@rchikhi
Copy link
Collaborator

rchikhi commented Mar 7, 2022

This describes the gathering of all the assemblies generated by Serratus into a single file.

TLDR: It's available at https://lovelywater.s3.amazonaws.com/assembly/rdva/rdva_v0.2.fa.lz4
(updated 2022-05-14)

How this file was constructed: it's a concatenation of:

  • All scaffolds.fasta (or contigs.fa) files present on s3://lovelywater/assembly/contigs/ prior to Feb 2022.
  • All assemblies from the following runs on s3://serratus-public/assemblies/: epsys_120_july21, infernal_59_feb22, palmfold_5k_feb22, phage_april21 , other, (quenya, dicistro, 1krandom and a subset of other were already added to lovelywater)
  • Added "ribozyviria+" datasets from drz0
  • Depleted SRA-retracted datasets

Minor caveats:

  • Due to earlier nomenclature, some files are named contigs.fa but they're actually scaffolds.
  • In a subset (around 3300) of our earlier assemblies, scaffolds were lost and gene_clusters.fa was the only FASTA result kept. Those gene_clusters.fa files aren't included in the big lz4 file as they're not complete assemblies of an accession. In a subset (~950) of those accessions, the assembly graph was kept and I was able to recover a complete assembly using the script https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/assembly_graph_to_scaffolds.py and uploaded it as a contigs.fasta as well as included it in the big lz4 file.
  • In some cases we had assembled an accession both using Coronaspades and Rnaviralspades. My decision to keep one or the other was based on whether the detected virus in the master table was a coronavirus (if so, coronaspades was kept, otherwise rnaviralspades). https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/all_we_assembled/scripts/choose_cs_or_rs.py
  • In 3 cases we had both metaspades and rnaviralspades assemblies, I kept metaspades.
  • In our initial runs uploaded to lovelywater, some accessions aren't assembled, and only the log script of the failed assembly attempt was uploaded.
  • All assembly output files that weren't already on lovelywater have been staged on s3://serratus-rayan/lovelywater/contigs/ for @ababaian to upload to lovelywater (5448 new accessions!).
@rcedgar
Copy link
Collaborator

rcedgar commented Mar 7, 2022

Hi @rchikhi wow this is great! @ababaian flagging that availability & locations of assemblies should be prominent in the documentation & from web home page once the migration to lovelywater is completed -- this is a valuable resource a la TSA but Serratus users are generally not aware of this.

@rchikhi
Copy link
Collaborator Author

rchikhi commented Mar 8, 2022

Some stats:

# number of lines
$ lz4cat all_serratus_assemblies_05032022.fa.lz4 | wc -l
115901154997

# number of sequences
$ lz4cat all_serratus_assemblies_05032022.fa.lz4 |grep "^" | wc -l      
13455281410

@rcedgar
Copy link
Collaborator

rcedgar commented Mar 8, 2022

115,901,154,997 = 116 x 10^9
13,455,281,410 = 13 x 10^9

but should be bases not lines, what is a line anyway ;)

@rchikhi
Copy link
Collaborator Author

rchikhi commented Mar 9, 2022

hehe agreed that the lines metric is a bit arbitrary (-- only useful if you're tracking progress of parsing this file line by line).
Here's a more useful one, number of bases: 5.9 trillion

$ \time lz4cat all_serratus_assemblies_05032022.fa.lz4 | seqkit stats
8552.58user 2539.75system 7:38:42elapsed 40%CPU (0avgtext+0avgdata 7204maxresident)k
4963841248inputs+0outputs (0major+1548minor)pagefaults 0swaps
file  format  type        num_seqs            sum_len  min_len  avg_len    max_len
-     FASTA   DNA   13,455,281,410  5,950,865,353,293       12    442.3  2,480,648

@rchikhi
Copy link
Collaborator Author

rchikhi commented Mar 9, 2022

The newly uploaded file s3://serratus-public/assemblies/all_serratus_assemblies_05032022.k_values.txt contains the k values of each assembly. Specifically, the last k value used by SPAdes.
(For only 6 of the accessions, the k value couldn't be retrieved and is reported as "-1".)

@rchikhi
Copy link
Collaborator Author

rchikhi commented Mar 10, 2022

and this file contains all the circular contigs within the assemblies: s3://serratus-public/assemblies/all_serratus_assemblies_05032022.only_circles.fasta

@Anto007
Copy link

Anto007 commented Nov 9, 2022

Looks like a super-cool resource and many thanks for your pains in generating this wonderful piece of data. Before I go ahead & get the resources for the local storage of this gigantic assembly, I was wondering if a BLAST/Diamond search for my gene of interest is going to be feasible if I use this resource as my BLAST database. I haven't worked with such a big BLAST database before and so I was wondering if you would have any insights/suggestions in this regard. I have access to high-end scientific workstations as well as a high performance computing cluster at my end. Thank you again!

@rchikhi
Copy link
Collaborator Author

rchikhi commented Nov 9, 2022

You could run minimap2 with that database as query and your gene of interest as reference (important not to switch ref<>query here).

@ababaian
Copy link
Owner

ababaian commented Nov 9, 2022

Everything is possible with enough motivation :D
But seriously, it could take quite a bit of time to generate the BLAST or diamond database, but it is feasible. Alternative would be to run hmmscan with your sequence as input and that could work well too.

@Anto007
Copy link

Anto007 commented Nov 10, 2022

Many thanks for your useful suggestions @rchikhi and @ababaian. I think I'll give it a try with minimap2, nhmmer and also mmseqs2. I noticed that you have listed out all the accession numbers of datasets that were used in generating this assembly and I was curious to know if you've got the sample source biome information (for e.g., soil, ocean/human microbiome etc.) for these listed out somewhere as well. Alternatively, I suppose I can just esearch & efetch to fetch the info here. I would also appreciate it very much if you could let me know how best to cite this wonderful resource of yours in case I go ahead and use it in my work. Thank you again.

@Anto007
Copy link

Anto007 commented Nov 16, 2022

Sorry, I'm a bit confused about one vital piece of info- @rchikhi could you confirm if this rdva_v0.2.fa.lz4 represents not only metatranscriptomes but also metagenomes? From a quick look, it definitely seems so? In your Twitter thread, you seemed to suggest that the assembly is represented by metatranscriptomes or perhaps I interpreted your tweet wrongly?

@ababaian
Copy link
Owner

These will contain metatranscriptomes, RNAseq, metagenomes and possibly a few more exotic sequence datasets (ChIP-seq). Selection was automated based on sequences which were detected within the library (i.e. Coronavirus reads), as human annotation of data is faulty. Buyer beware.

@permia
Copy link

permia commented Dec 6, 2023

Wow, have your annotated all the sequences and found all the virus/dark sequences?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants