Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate assembly data to lovelywater #237

Open
ababaian opened this issue Dec 9, 2020 · 23 comments
Open

Migrate assembly data to lovelywater #237

ababaian opened this issue Dec 9, 2020 · 23 comments
Assignees

Comments

@ababaian
Copy link
Owner

ababaian commented Dec 9, 2020

We need to migrate all the assembly and annotation data generated as part of Serratus to our data-lake in a structured way so as to allow for programmatic access. This is a proposed folder hierarchy to discuss wherewe have $SRA as the accession-variable

Similar to the rest of the archive, I propose 'flat' folders broken up by major category and every file contains a $SRA prefex. So no contig/$SRA/$SRA.data.fa or contig/$SRA/data.tsv cases.

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments
├ cov_index.tsv       # Index file of CoV+ libraries
└ assembly_index.tsv  # Index file of assembled SRA libraries

assembly/cov/$SRA.cov.fa : Contigs identified to be CoV (i.e. 12K paper is based on)

  • Currently in : s3://serratus-public/assemblies/contigs/
  • Do not include 0B or empty files

contigs/ : The coronaSPAdes output files such as $SRA.inputdata.txt, $SRA.coronaspdes.txt, $SRA.coronaspdes.gene_clusters.fa ... $SRA.coronaspdes.assembly_graph_with_scaffolds.gfa.gz

  • Currently as s3://serratus-public/assemblies/other/$SRA.coronaspades/$SRA...
  • Remove $SRA.coronaspades/ intermediate folder

annotation/

  • Currently as s3://serratus-public/assemblies/annotations/

gz/ : I was originally thinking of also storing the data as a single $SRA.tar.gz file containing cov/ contig/ and annotation/ data but this will duplicate the data and is probably not a good idea. Instead we can provide a short grabSRA.sh $SRA script which will automatically download all the files associated with a particular $SRA to the local system for users.

@rchikhi
Copy link
Collaborator

rchikhi commented Jan 20, 2021

it's all staged in s3://serratus-rayan/lovelywater/assembly, please have a look before transferring to lovelywater.

Name Size
annotation/ 73.8 GB
cov/ 169.2 MB
contigs/ 4.0 TB

@rchikhi
Copy link
Collaborator

rchikhi commented Jan 20, 2021

TODO for me next:

  • quenya, dicistro, satellites CS assemblies into contigs/
  • update access data release page

@taltman
Copy link
Collaborator

taltman commented Jan 22, 2021

The README.md in the top-level of lovelywater is out-of-sync with the bucket directory structure.

@ababaian
Copy link
Owner Author

Most recent version is always on the Data Access Page

@taltman
Copy link
Collaborator

taltman commented Jan 22, 2021

That page is also inconsistent. In Naming Conventions, it uses as an example, s3://lovelywater/contig/SRA123456.fa. In the Folder Organization section, there is no such folder contig, and there is no such directory in the bucket (as far as I can see).

@ababaian
Copy link
Owner Author

ababaian commented Jan 22, 2021

The data for assemblies has not been migrating on it, once that's done it closes this issue.

edit: updated the access page to reflect situation on the ground

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 27, 2021

Satellites assemblies have been migrated, to s3://serratus-rayan/lovelywater/assembly/contigs i.e. same location as other CoV assembly data.
For some reason, I can't find satellites' scaffolds.fasta files, only the gene_clusters.fasta are present. I tend to think I might have never copied scaffolds.fasta to S3 (likely due to a past bug that has recently been fixed) and it's likely that we were only interested in gene_clusters.fasta during the satellite analysis.

@ababaian
Copy link
Owner Author

c'est la vie. Is this the complete collection of assemblies then?

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 27, 2021

nope, i'm in the process of moving dicistro/quenya assemblies too, will let you know when it's over

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 28, 2021

done! dicistro, quenya, satellites assemblies are copied.

total number of accessions assembled in s3://serratus-rayan/lovelywater/assembly/contigs: 56,071
total size of ̀s3://serratus-rayan/lovelywater/: 4.9 TB
scaffolds from CoV assemblies (MFC-compressed): 0.9 TB
scaffolds from other assemblies (gzip-compressed): 0.2 TB
assembly graphs (gzip-compressed): 1.6 TB
(These could be deleted, but at the same time keeping them would enable to quickly regenerate assemblies e.g. after a coronaSPAdes update, or to get the missing scaffolds.fasta files)

Darth annotations of checkv-filtered gene_clusters (gzip-compressed): 2.0 TB
Some of those somehow made their way to the contigs/ folder. Among these, some contain a huge BAM file of reads aligned to contigs, hence the space usage. This was needed for quality control. They could be deleted, as for each of those there is another gzip file without the BAM file. Two options:

  1. delete the large BAM-containing Darth archives and move the small ones to into annotation/ folder
  2. keep everything and move all darth stuff to annotation/ folder
    any preference?

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 28, 2021

Also there is the 1k subset of accession assemblies found by the .pro analysis, wanna include it?

@ababaian
Copy link
Owner Author

yes

@rchikhi
Copy link
Collaborator

rchikhi commented Feb 28, 2021

1ksubset: migration done

@rchikhi
Copy link
Collaborator

rchikhi commented Mar 1, 2021

after some Slack discussions:

  • darth data inside contigs/ has been deleted as it's mainly redundant with the one aleady in ̀annotation/ except for huge BAM files.
  • serratax/serraplace stuff inside contigs/ has been moved to annotation/

so I think we're done

@rchikhi
Copy link
Collaborator

rchikhi commented Mar 1, 2021

hold on, i'll also move checkV analysis from contigs/ to annotation/

@rchikhi
Copy link
Collaborator

rchikhi commented Mar 2, 2021

done! Here's the final content of

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments

as staged in s3://serratus-rayan/lovelywater/assembly/.

assembly/cov:

These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.

assembly/contigs:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes.
Depending on the assembler, a subset of these files will be present for each accession.
Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

assembly/annotation:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (taxonomic placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz

@ababaian
Copy link
Owner Author

ababaian commented Mar 2, 2021

I'll begin data migration shortly!

@ababaian
Copy link
Owner Author

ababaian commented Mar 4, 2021

Take a look at s3://lovelywater/assembly/ and let me know if that looks alright.

Also updated the

If that looks good then close this baby!

@taltman
Copy link
Collaborator

taltman commented Mar 10, 2021

What's the status on this? Should I be pulling data from s3://serratus-rayan/lovelywater/assembly/cov/ or s3://lovelywater/assembly/cov/?

@ababaian
Copy link
Owner Author

either is fine they are identical. Migration is now complete. I think we're good to close this @rchikhi

@rchikhi
Copy link
Collaborator

rchikhi commented Mar 10, 2021

Same number of files and size as my folder, looks good

Total Objects: 671859
   Total Size: 3.2 TiB

@rchikhi rchikhi closed this as completed Mar 10, 2021
@rchikhi
Copy link
Collaborator

rchikhi commented Jun 25, 2021

so, this issue is closed yet I noticed today that we never deleted anything off the original location s3://serratus-public/assemblies (thought the staged location s3://serratus-rayan/lovelywater got correctly cleared). The original location still contains all the migrated data + some other less useful and non-migrated accessions, like those with partially failed assemblies, a few minia assemblies that coronaspades didn't assemble, etc. I see 48268 coronaspades assemblies on lovelywater and 51756 coronaspades folders on serratus-public (with possibly empty in some cases).
@ababaian, a few options:

  1. delete from s3://serratus-public/assemblies only the migrated stuff
  2. delete everything from s3://serratus-public/assemblies
  3. keep s3://serratus-public/assemblies for some reason

I'd go for 1)

@ababaian
Copy link
Owner Author

One consideration is serratus-public currently has version control, so you have to do a 2-pass deletion (delete file, and delete history) to remove data. We do need to do this this but I've been delaying until the paper is "done" so we don't whoopsy and lose some data we need. I'll re-open and let's go with (2) once the paper is done is my take. I'll reopen the issue.

@ababaian ababaian reopened this Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants