If anything below doesn't make sense, ask us questions: .
Redoing GTEx Rail-RNA runs on Amazon Elastic MapReduce
-
Install Rail-RNA v0.2.1, which is available for download here.
-
Follow the instructions here to set up an Amazon Web Services (AWS) account with an Identity and Access Management (IAM) user configured to analyze dbGaP data securely. Name the CloudFormation stack
dbgap-1
rather thandbgap
, as those instructions recommend. The secure bucket name created with the CloudFormation template is referenced ass3://gtex-bucket
here. -
To faciliate submitting job flows in multiple availability zones of the US Standard region (i.e.,
us-east-1
), create three more CloudFormation stacks as described here, except with this CloudFormation template, which requires specification of an availability zone for the public subnet into which the Rail-RNA Elastic MapReduce cluster will be launched. Choose from amongus-east-1a
,us-east-1b
,us-east-1c
,us-east-1d
, andus-east-1e
, and name the stacksdbgap-2
,dbgap-3
, anddbgap-4
. -
Download the dbGaP repository key granting access to GTEx data. It should have the extension
.ngc
and is referenced as/path/to/dbgap/key.ngc
here. -
Run
python gen.py --s3-bucket s3://gtex-bucket --prep-stack-names <one or more of the dbgap-* stack names above separated by spaces> --align-stack-names <one or more of the dbgap-* stack name above separated by spaces> --dbgap-key /path/to/dbgap/key.ngc
to generate scripts for preprocessing and aligning GTEx data. 60 scripts representing a partitioning of GTEx RNA-seq data into 30 batches are generated: 30 for preprocessing and 30 for aligning.
6. Run the scripts generated in the previous step to submit job flows to Elastic MapReduce. Each prep_gtex_batch_k.sh
file for k
between 0
and 29
inclusive should be run and its job flow completed before the corresponding align_gtex_batch_k.sh
is run to align data preprocessed and uploaded to S3. It is recommended that only three preprocessing job flows are submitted at a time. Tweak shell scripts to change the argument of --stack-name
in the rail-rna
command as necessary if Elastic MapReduce complains that there aren't enough IPs in the subnet of a VPC in a given availability zone to launch more job flows.
7. Use this script to download all results from S3 to local storage. Command-line parameters are described in its comments.
8. Compute total number of reads across samples with the script total.sh
. Its only command-line parameter is the local GTEx output directory specified in the previous step, which is where all analysis results should have been dumped. The figure we obtained was 896,466,227,499
. Here, "read" refers to a mate for paired-end samples.
This figure was generated using Keynote; see security_figure.key
.
Run the Mathematica 10 notebook rail_dbgap_plots.nb
. It uses costs.csv
, costs downloaded from the AWS Cost Explorer, as well as activity.tsv
, which has start and end times of all GTEx preprocess and align job flows. The file activity.tsv
was generated with reconstruct_activity.py
from the saved Elastic MapReduce web interface HTML files in logs/
. If you don't have Mathematica, check rail_dbgap_plots.pdf
for its output.