Migrate assembly data to lovelywater #237

ababaian · 2020-12-09T20:49:21Z

We need to migrate all the assembly and annotation data generated as part of Serratus to our data-lake in a structured way so as to allow for programmatic access. This is a proposed folder hierarchy to discuss wherewe have $SRA as the accession-variable

Similar to the rest of the archive, I propose 'flat' folders broken up by major category and every file contains a $SRA prefex. So no contig/$SRA/$SRA.data.fa or contig/$SRA/data.tsv cases.

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments
├ cov_index.tsv       # Index file of CoV+ libraries
└ assembly_index.tsv  # Index file of assembled SRA libraries

assembly/cov/$SRA.cov.fa : Contigs identified to be CoV (i.e. 12K paper is based on)

Currently in : s3://serratus-public/assemblies/contigs/
Do not include 0B or empty files

contigs/ : The coronaSPAdes output files such as $SRA.inputdata.txt, $SRA.coronaspdes.txt, $SRA.coronaspdes.gene_clusters.fa ... $SRA.coronaspdes.assembly_graph_with_scaffolds.gfa.gz

Currently as s3://serratus-public/assemblies/other/$SRA.coronaspades/$SRA...
Remove $SRA.coronaspades/ intermediate folder

annotation/

Currently as s3://serratus-public/assemblies/annotations/

gz/ : I was originally thinking of also storing the data as a single $SRA.tar.gz file containing cov/ contig/ and annotation/ data but this will duplicate the data and is probably not a good idea. Instead we can provide a short grabSRA.sh $SRA script which will automatically download all the files associated with a particular $SRA to the local system for users.

The text was updated successfully, but these errors were encountered:

rchikhi · 2021-01-20T14:07:29Z

it's all staged in s3://serratus-rayan/lovelywater/assembly, please have a look before transferring to lovelywater.

Name	Size
annotation/	73.8 GB
cov/	169.2 MB
contigs/	4.0 TB

rchikhi · 2021-01-20T22:59:33Z

TODO for me next:

quenya, dicistro, satellites CS assemblies into contigs/
update access data release page

taltman · 2021-01-22T21:54:11Z

The README.md in the top-level of lovelywater is out-of-sync with the bucket directory structure.

ababaian · 2021-01-22T21:57:08Z

Most recent version is always on the Data Access Page

taltman · 2021-01-22T22:07:12Z

That page is also inconsistent. In Naming Conventions, it uses as an example, s3://lovelywater/contig/SRA123456.fa. In the Folder Organization section, there is no such folder contig, and there is no such directory in the bucket (as far as I can see).

ababaian · 2021-01-22T22:19:20Z

The data for assemblies has not been migrating on it, once that's done it closes this issue.

edit: updated the access page to reflect situation on the ground

rchikhi · 2021-02-27T21:10:59Z

Satellites assemblies have been migrated, to s3://serratus-rayan/lovelywater/assembly/contigs i.e. same location as other CoV assembly data.
For some reason, I can't find satellites' scaffolds.fasta files, only the gene_clusters.fasta are present. I tend to think I might have never copied scaffolds.fasta to S3 (likely due to a past bug that has recently been fixed) and it's likely that we were only interested in gene_clusters.fasta during the satellite analysis.

ababaian · 2021-02-27T21:35:20Z

c'est la vie. Is this the complete collection of assemblies then?

rchikhi · 2021-02-27T21:37:16Z

nope, i'm in the process of moving dicistro/quenya assemblies too, will let you know when it's over

rchikhi · 2021-02-28T10:53:36Z

done! dicistro, quenya, satellites assemblies are copied.

total number of accessions assembled in s3://serratus-rayan/lovelywater/assembly/contigs: 56,071
total size of ̀s3://serratus-rayan/lovelywater/: 4.9 TB
scaffolds from CoV assemblies (MFC-compressed): 0.9 TB
scaffolds from other assemblies (gzip-compressed): 0.2 TB
assembly graphs (gzip-compressed): 1.6 TB
(These could be deleted, but at the same time keeping them would enable to quickly regenerate assemblies e.g. after a coronaSPAdes update, or to get the missing scaffolds.fasta files)

Darth annotations of checkv-filtered gene_clusters (gzip-compressed): 2.0 TB
Some of those somehow made their way to the contigs/ folder. Among these, some contain a huge BAM file of reads aligned to contigs, hence the space usage. This was needed for quality control. They could be deleted, as for each of those there is another gzip file without the BAM file. Two options:

delete the large BAM-containing Darth archives and move the small ones to into annotation/ folder
keep everything and move all darth stuff to annotation/ folder
any preference?

rchikhi · 2021-02-28T11:00:31Z

Also there is the 1k subset of accession assemblies found by the .pro analysis, wanna include it?

ababaian · 2021-02-28T18:07:15Z

yes

rchikhi · 2021-02-28T19:53:08Z

1ksubset: migration done

rchikhi · 2021-03-01T08:35:58Z

after some Slack discussions:

darth data inside contigs/ has been deleted as it's mainly redundant with the one aleady in ̀annotation/ except for huge BAM files.
serratax/serraplace stuff inside contigs/ has been moved to annotation/

so I think we're done

rchikhi · 2021-03-01T08:47:15Z

hold on, i'll also move checkV analysis from contigs/ to annotation/

rchikhi · 2021-03-02T20:00:55Z

done! Here's the final content of

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments

as staged in s3://serratus-rayan/lovelywater/assembly/.

`assembly/cov`:

These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.

`assembly/contigs`:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes.
Depending on the assembler, a subset of these files will be present for each accession.
Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

`assembly/annotation`:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (taxonomic placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz

ababaian · 2021-03-02T20:02:23Z

I'll begin data migration shortly!

ababaian · 2021-03-04T22:16:53Z

Take a look at s3://lovelywater/assembly/ and let me know if that looks alright.

Also updated the

If that looks good then close this baby!

taltman · 2021-03-10T01:57:56Z

What's the status on this? Should I be pulling data from s3://serratus-rayan/lovelywater/assembly/cov/ or s3://lovelywater/assembly/cov/?

ababaian · 2021-03-10T02:17:40Z

either is fine they are identical. Migration is now complete. I think we're good to close this @rchikhi

rchikhi · 2021-03-10T11:21:26Z

Same number of files and size as my folder, looks good

Total Objects: 671859
   Total Size: 3.2 TiB

rchikhi · 2021-06-25T21:00:20Z

so, this issue is closed yet I noticed today that we never deleted anything off the original location s3://serratus-public/assemblies (thought the staged location s3://serratus-rayan/lovelywater got correctly cleared). The original location still contains all the migrated data + some other less useful and non-migrated accessions, like those with partially failed assemblies, a few minia assemblies that coronaspades didn't assemble, etc. I see 48268 coronaspades assemblies on lovelywater and 51756 coronaspades folders on serratus-public (with possibly empty in some cases).
@ababaian, a few options:

delete from s3://serratus-public/assemblies only the migrated stuff
delete everything from s3://serratus-public/assemblies
keep s3://serratus-public/assemblies for some reason

I'd go for 1)

ababaian · 2021-06-29T12:44:44Z

One consideration is serratus-public currently has version control, so you have to do a 2-pass deletion (delete file, and delete history) to remove data. We do need to do this this but I've been delaying until the paper is "done" so we don't whoopsy and lose some data we need. I'll re-open and let's go with (2) once the paper is done is my take. I'll reopen the issue.

ababaian assigned rchikhi Dec 9, 2020

rchikhi closed this as completed Mar 10, 2021

ababaian reopened this Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate assembly data to lovelywater #237

Migrate assembly data to lovelywater #237

ababaian commented Dec 9, 2020 •

edited

rchikhi commented Jan 20, 2021 •

edited

rchikhi commented Jan 20, 2021 •

edited

taltman commented Jan 22, 2021

ababaian commented Jan 22, 2021

taltman commented Jan 22, 2021

ababaian commented Jan 22, 2021 •

edited

rchikhi commented Feb 27, 2021 •

edited

ababaian commented Feb 27, 2021

rchikhi commented Feb 27, 2021

rchikhi commented Feb 28, 2021 •

edited

rchikhi commented Feb 28, 2021 •

edited

ababaian commented Feb 28, 2021

rchikhi commented Feb 28, 2021

rchikhi commented Mar 1, 2021

rchikhi commented Mar 1, 2021

rchikhi commented Mar 2, 2021 •

edited

ababaian commented Mar 2, 2021

ababaian commented Mar 4, 2021

taltman commented Mar 10, 2021

ababaian commented Mar 10, 2021

rchikhi commented Mar 10, 2021

rchikhi commented Jun 25, 2021 •

edited

ababaian commented Jun 29, 2021

Migrate assembly data to lovelywater #237

Migrate assembly data to lovelywater #237

Comments

ababaian commented Dec 9, 2020 • edited

rchikhi commented Jan 20, 2021 • edited

rchikhi commented Jan 20, 2021 • edited

taltman commented Jan 22, 2021

ababaian commented Jan 22, 2021

taltman commented Jan 22, 2021

ababaian commented Jan 22, 2021 • edited

rchikhi commented Feb 27, 2021 • edited

ababaian commented Feb 27, 2021

rchikhi commented Feb 27, 2021

rchikhi commented Feb 28, 2021 • edited

rchikhi commented Feb 28, 2021 • edited

ababaian commented Feb 28, 2021

rchikhi commented Feb 28, 2021

rchikhi commented Mar 1, 2021

rchikhi commented Mar 1, 2021

rchikhi commented Mar 2, 2021 • edited

assembly/cov:

assembly/contigs:

assembly/annotation:

ababaian commented Mar 2, 2021

ababaian commented Mar 4, 2021

taltman commented Mar 10, 2021

ababaian commented Mar 10, 2021

rchikhi commented Mar 10, 2021

rchikhi commented Jun 25, 2021 • edited

ababaian commented Jun 29, 2021

ababaian commented Dec 9, 2020 •

edited

rchikhi commented Jan 20, 2021 •

edited

rchikhi commented Jan 20, 2021 •

edited

ababaian commented Jan 22, 2021 •

edited

rchikhi commented Feb 27, 2021 •

edited

rchikhi commented Feb 28, 2021 •

edited

rchikhi commented Feb 28, 2021 •

edited

rchikhi commented Mar 2, 2021 •

edited

`assembly/cov`:

`assembly/contigs`:

`assembly/annotation`:

rchikhi commented Jun 25, 2021 •

edited