New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate assembly data to lovelywater #237
Comments
it's all staged in
|
TODO for me next:
|
The README.md in the top-level of |
Most recent version is always on the Data Access Page |
That page is also inconsistent. In Naming Conventions, it uses as an example, |
The data for assemblies has not been migrating on it, once that's done it closes this issue. edit: updated the access page to reflect situation on the ground |
Satellites assemblies have been migrated, to |
c'est la vie. Is this the complete collection of assemblies then? |
nope, i'm in the process of moving dicistro/quenya assemblies too, will let you know when it's over |
done! dicistro, quenya, satellites assemblies are copied. total number of accessions assembled in Darth annotations of checkv-filtered gene_clusters (gzip-compressed): 2.0 TB
|
Also there is the 1k subset of accession assemblies found by the |
yes |
1ksubset: migration done |
after some Slack discussions:
so I think we're done |
hold on, i'll also move checkV analysis from |
done! Here's the final content of
as staged in
|
I'll begin data migration shortly! |
Take a look at Also updated the If that looks good then close this baby! |
What's the status on this? Should I be pulling data from |
either is fine they are identical. Migration is now complete. I think we're good to close this @rchikhi |
Same number of files and size as my folder, looks good
|
so, this issue is closed yet I noticed today that we never deleted anything off the original location
I'd go for 1) |
One consideration is |
We need to migrate all the assembly and annotation data generated as part of Serratus to our data-lake in a structured way so as to allow for programmatic access. This is a proposed folder hierarchy to discuss wherewe have
$SRA
as the accession-variableSimilar to the rest of the archive, I propose 'flat' folders broken up by major category and every file contains a
$SRA
prefex. So nocontig/$SRA/$SRA.data.fa
orcontig/$SRA/data.tsv
cases.assembly/cov/$SRA.cov.fa
: Contigs identified to be CoV (i.e. 12K paper is based on)s3://serratus-public/assemblies/contigs/
0B
or empty filescontigs/
: ThecoronaSPAdes
output files such as$SRA.inputdata.txt
,$SRA.coronaspdes.txt
,$SRA.coronaspdes.gene_clusters.fa
...$SRA.coronaspdes.assembly_graph_with_scaffolds.gfa.gz
s3://serratus-public/assemblies/other/$SRA.coronaspades/$SRA...
$SRA.coronaspades/
intermediate folderannotation/
s3://serratus-public/assemblies/annotations/
gz/
: I was originally thinking of also storing the data as a single$SRA.tar.gz
file containing cov/ contig/ and annotation/ data but this will duplicate the data and is probably not a good idea. Instead we can provide a shortgrabSRA.sh $SRA
script which will automatically download all the files associated with a particular$SRA
to the local system for users.The text was updated successfully, but these errors were encountered: