New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all serratus assemblies (60k accessions) #260
Comments
Some stats:
|
but should be bases not lines, what is a line anyway ;) |
hehe agreed that the lines metric is a bit arbitrary (-- only useful if you're tracking progress of parsing this file line by line).
|
The newly uploaded file |
and this file contains all the circular contigs within the assemblies: |
Looks like a super-cool resource and many thanks for your pains in generating this wonderful piece of data. Before I go ahead & get the resources for the local storage of this gigantic assembly, I was wondering if a BLAST/Diamond search for my gene of interest is going to be feasible if I use this resource as my BLAST database. I haven't worked with such a big BLAST database before and so I was wondering if you would have any insights/suggestions in this regard. I have access to high-end scientific workstations as well as a high performance computing cluster at my end. Thank you again! |
You could run minimap2 with that database as query and your gene of interest as reference (important not to switch ref<>query here). |
Everything is possible with enough motivation :D |
Many thanks for your useful suggestions @rchikhi and @ababaian. I think I'll give it a try with minimap2, nhmmer and also mmseqs2. I noticed that you have listed out all the accession numbers of datasets that were used in generating this assembly and I was curious to know if you've got the sample source biome information (for e.g., soil, ocean/human microbiome etc.) for these listed out somewhere as well. Alternatively, I suppose I can just esearch & efetch to fetch the info here. I would also appreciate it very much if you could let me know how best to cite this wonderful resource of yours in case I go ahead and use it in my work. Thank you again. |
Sorry, I'm a bit confused about one vital piece of info- @rchikhi could you confirm if this |
These will contain metatranscriptomes, RNAseq, metagenomes and possibly a few more exotic sequence datasets (ChIP-seq). Selection was automated based on sequences which were detected within the library (i.e. Coronavirus reads), as human annotation of data is faulty. Buyer beware. |
Wow, have your annotated all the sequences and found all the virus/dark sequences? |
This describes the gathering of all the assemblies generated by Serratus into a single file.
TLDR: It's available at https://lovelywater.s3.amazonaws.com/assembly/rdva/rdva_v0.2.fa.lz4
(updated 2022-05-14)
lz4cat [file]
)>[SRA identifier] [scaffold identifier]
e.g.>DRR001151 NODE_1_length_15617_cov_7148.125289
How this file was constructed: it's a concatenation of:
scaffolds.fasta
(orcontigs.fa
) files present ons3://lovelywater/assembly/contigs/
prior to Feb 2022.s3://serratus-public/assemblies/
:epsys_120_july21
,infernal_59_feb22
,palmfold_5k_feb22
,phage_april21
,other
, (quenya
,dicistro
,1krandom
and a subset ofother
were already added to lovelywater)Minor caveats:
contigs.fa
but they're actually scaffolds.gene_clusters.fa
was the only FASTA result kept. Thosegene_clusters.fa
files aren't included in the big lz4 file as they're not complete assemblies of an accession. In a subset (~950) of those accessions, the assembly graph was kept and I was able to recover a complete assembly using the script https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/assembly_graph_to_scaffolds.py and uploaded it as acontigs.fasta
as well as included it in the big lz4 file.s3://serratus-rayan/lovelywater/contigs/
for @ababaian to upload to lovelywater (5448 new accessions!).The text was updated successfully, but these errors were encountered: