We securely reanalyzed TCGA in the cloud with Amazon Elastic MapReduce. Reproducing our TCGA runs requires dbGaP authorized access, a CGC account, and an AWS account set up for runs of Rail-RNA on dbGaP-protected data. If you already have an AWS account, see our documentation for instructions on preparing the account for analysis of dbGaP-protected data.
-
Clone the repo and change to this directory at the command line. Then run
python tcga_file_list.py >tcga_file_list.tsv
to obtain a list of file paths from a SPARQL query of CGC. See the docstring oftcga_file_list.py
for its requirements. The user's list may be different from the list we obtained when we performed the query (9/29/2016). Our list istcga_file_list.tsv
, and the user may skip this step and simply use our file, assuming all file paths on the CGC are the same. -
Download and install Rail-RNA v0.2.4a. Set Rail-RNA up for analyzing dbGaP-protected data by following the instructions at http://docs.rail.bio/dbgap/.
-
Use the Python script
gen.py
to regenerate all the Rail-RNA manifest files (*.manifest
) in this directory as well as scripts that run Rail-RNA to preprocess (prep_tcga_batch_*.sh
) and align (align_tcga_batch_*.sh
) TCGA data on Amazon Elastic MapReduce. Refer togen.py
's docstring for the precise command to execute; be sure to change the output bucket on S3. The script divides TCGA into 30 batches, each with about 380 randomly selected samples. A given batch is associated with a different Rail-RNA manifest file, preprocess script, and alignment script. Note thatgen.py
requires an authorization token provided by CGC. To obtain one, sign up for a CGC account, confirm dbGaP authorized access to TCGA, generate a token using the CGC web interface, and store it in a text file locally. -
For each batch b (a number between 0 and 29 inclusive), run
sh prep_tcga_batch_b.sh
wait for for the Rail-RNA preprocess job on Elastic MapReduce to finish successfully, and next run
sh align_tcga_batch_b.sh
- Download all results from the output bucket on S3 you chose in step 3 to a dbGaP-compliant local cluster using either the AWS CLI or the console.
Run sh tcga_query.sh
to ultimately obtain all_cgc_metadata.tsv.gz
.