Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/requirements for Be The Match collaboration #9

Open
10 tasks
heuermh opened this issue May 8, 2017 · 3 comments
Open
10 tasks

Features/requirements for Be The Match collaboration #9

heuermh opened this issue May 8, 2017 · 3 comments

Comments

@heuermh
Copy link
Member

heuermh commented May 8, 2017

For lack of a better place for this, our collaboration with Be The Match will require

  • Download BAM files from s3, transform to ADAM Avro+Parquet, and upload to s3 (transform_alignments)
  • Download ADAM Avro+Parquet alignments for multiple samples from s3, update record groups to prevent collision, merge into a single multi-sample ADAM Avro+Parquet alignments data set, and upload to s3 (merge_alignments)
  • Report BAM file sizes, single sample ADAM Avro+Parquet alignments file sizes, and merged ADAM Avro+Parquet alignments file size
  • Download VCF files from s3, transform to ADAM Avro+Parquet variants and genotypes, and upload to s3 (transform_variants, transform_genotypes)
  • Download ADAM Avro+Parquet variants for multiple samples, merge into a single sites-only ADAM Avro-Parquet variants data set, and upload to s3 (merge_variants)
  • Download ADAM Avro+Parquet genotypes for multiple samples, merge into a single multi-sample ADAM Avro-Parquet genotypes data set, and upload to s3 (merge_genotypes)
  • Report VCF file sizes, single sample ADAM Avro+Parquet variants and genotypes file sizes, and merged ADAM Avro+Parquet variants and genotypes file sizes
  • Notebook with queries to compare native file via s3 vs. transformed via s3 access performance
  • Documentation on how to run this stuff
  • Short manuscript on transformation process, storage requirements, and access performance

There hasn't been an ask for realigning reads, recalling variants, annotating variants with SnpEff, or joint genotyping yet, but there could be in the near future.

@fnothaft
Copy link
Member

fnothaft commented May 9, 2017

This SGTM. WRT merge_genotypes, do they want to "square off" with no-calls?

@heuermh
Copy link
Member Author

heuermh commented May 9, 2017

want to "square off" with no-calls?

Not sure, will ask when I get to that step.

Notebook with queries ...

In a meeting this afternoon, they've decided to use Apache Zeppelin on Amazon EMR for this use case.

With some clicking around we got ADAM installed on Zeppelin using Maven Central coordinates. Need to do a bit more digging to figure out where to set the Kryo Spark configuration parameters, and create a separate EMR step for Conductor (we used s3-dist-cp).

@heuermh
Copy link
Member Author

heuermh commented Sep 1, 2017

As an update:

All the transformations to ADAM Avro+Parquet have been run on EMR clusters, downloading from s3 to HDFS using conductor, and uploading from HDFS to s3 using s3-dist-cp, using bash scripts at https://github.com/heuermh/hook.

Notebooks have been implemented in Zeppelin and RStudio on EMR.

The conversation about merging samples into larger data sets has not happened yet.

fnothaft pushed a commit to fnothaft/workflows that referenced this issue Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants