GDC Case Discover

GDC Case Discover is an update to the CPTAC3-specific implementation, CPTAC3 Case Discover.

Uses python-based JSON parser
Revised aliquot annotation parsing
Generates Catalog v3 files defined here

Query GDC to discover sequence and methylation data and write it to a catalog file

Quick start

Obtain token from GDC, save to file gdc-user-token.txt
- make this available as global variable with, export GDC_TOKEN=gdc-user-token.txt
git clone --recurse-submodules https://github.com/ding-lab/CPTAC3.case.discover PROJECT_NAME
edit 1_process_all.sh
run bash 1_process_all.sh

Discuss v3.0 Cataog file

User's Manual

Installation

Other packages which need to be installed:

python and json library; these typically come installed in a developer environment.
- TODO provide explicit instructions
jq : see here for installation instructions.

Note that bashids is also used, but this is installed during git clone as a submodule.

Usage

Project configuration

All CPTAC3.case.discover code code can be obtained with,

git clone https://github.com/ding-lab/CPTAC3.case.discover PROJECT_NAME

Edit README.project.md to provide project-specific descriptions. This file is typically not committed to git.

The file 1_process_all.sh defines several locale-specific environment variables and paths, and must be edited as appropriate.

Create file dat/cases.dat listing all cases to be processed.

Obtaining GDC token

All queries require a GDC authorization token, as described here.

Log in to GDC Data Submission Portal
Download token, and save it to some filename, e.g. gdc-user-token.txt.
Update GDC_TOKEN in 1_process_all.sh accordingly

File output format

Catalog 3 file format

Catalog file

Catalog file columns:

dataset_name - ad hoc name for this file, generated for convenience and consistency, formerly called sample_name
case - Unique ID of participant
disease - Disease code
experimental_strategy - WGS, WXS, RNA-Seq, miRNA-Seq, Methylation Array, Targeted Sequencing
specimen_type - short name for sample_type: blood_normal, tissue_normal, tumor, buccal_normal, tumor_bone_marrow, tumor_peripheral_blood
specimen_name - Label of biospecimen which was sequenced
filename - This does not include a path
filesize - Size in bytes as shown by ls -l
data_format - BAM, FASTQ, IDAT
data_variety - additional data descriptors
- "chimeric", "genomic", "transcriptome" for RNA-Seq BAMs,
- "Red" or "Green" for Methylation Array
- "NA" otherwise
alignment - Description of reference or alignment status
project - General administrative category of dataset, e.g., "HTAN"
uuid - Unique identifier of this dataset. Mandatory, must be unique
md5 - Fingerprint of data for validation purposes, as provided by data generator (sequencing lab)
metadata - JSON string providing additional details about this dataset

Dataset names

Dataset names are ad hoc names we generate for convenience. They indicate the case, experimental strategy, sample type, whether data are harmonized (hg38) and any aliquot annotation codes. Examples include,

See Catalog v3 file

See Heterogeneity Studies below for information about labels like HET_qZq3G.

Sample types

The sample_type column lists GDC sample types. We abbreviate these names in the sample name and short_sample_type column respectively as,

Blood Derived Normal: N, blood_normal
Buccal Cell Normal: Nbc, buccal_normal
Tumor, Primary Tumor, 'Additional - New Primary': T, tumor
Primary Blood Derived Cancer - Bone Marrow: Tbm, tumor_bone_marrow
Primary Blood Derived Cancer - Peripheral Blood: Tpb, tumor_peripheral_blood
Solid Tissue Normal: A, tissue_normal
Recurrent Tumor: R, recurrent_tumor
Metastatic: M, metastatic
"FFPE Scrolls" and "FFPE Recurrent": ffpe, F

Heterogeneity Studies and duplicates

GDC provides annotations associated with aliquots which contain additional context regarding cases with multiple tumor samples. This information is stored in the field aliquot_annotation and is used to generate a convenient label used in the sample metadata and sample name fields.

If aliquot_annotation is defined for a given data file, we generate sample label consisting of a label prefix followed by an ID code. For CPTAC3, an example sample label may be HET_qZq3G, where the prefix HET indicates heterogeneity and the ID code is qZq3G. This code is hash ID generated with bashids, where the input numerical string is obtained from the aliquot name (CPT0000650008) with "CPT" and any leading 0's removed. The sample label used for the sample_name and sample_metadata fields

Table below lists all known GDC aliquot annotations, and the prefix used to generate the sample label.

TODO: Update this to Catalog3

alq_code.loc[dup] = "DUP"   # "duplicate item"
alq_code.loc[add] = "ADD"   # "additional"
alq_code.loc[rep] = "REP"   # "replacement"

Aliquot annotation	Label prefix
Additional DNA Distribution - Additional aliquot	ADD
BioTEXT_RNA	BIOTEXT
Duplicate item: Additional DNA for PDA Deep Sequencing	DEEP
Duplicate item: Additional DNA requested	ADNA
Duplicate item: Additional RNA requested	ARNA
Duplicate item: CCRCC Tumor heterogeneity study	HET
Duplicate Item: CHOP GBM Duplicate Primary Tumor DNA Aliquot	ADNA
Duplicate Item: CHOP GBM Duplicate Primary Tumor RNA Aliquot	ADNA
Duplicate Item: CHOP GBM Duplicate Recurrent Tumor DNA Aliquot	ADNA
Duplicate Item: CHOP GBM Duplicate Recurrent Tumor RNA Aliquot	ADNA
Duplicate item: No new shipment/material. DNA aliquot resubmission for Broad post-harmonization sequencing and sample type mismatch correction.	RDNA
Duplicate item: PDA BIOTEXT DNA	BIOTEXT
Duplicate item: PDA Pilot - bulk-derived DNA	BULK
Duplicate item: PDA Pilot - core-derived DNA	CORE
Duplicate item: Replacement DNA Distribution - original aliquot failed	RDNA
Duplicate item: Replacement RNA Aliquot	RRNA
Duplicate item: Replacement RNA Distribution - original aliquot failed	RRNA
Duplicate item: UCEC BioTEXT Pilot	BIOTEXT
Duplicate item: UCEC LMD Heterogeneity Pilot	LMD
Original DNA Aliquot	ODNA
Replacement DNA Aliquot	RDNA
This entity was not yet authorized to be released by the submitters	UNAV
unknown	UNK

All this is outdated

Demographics

The following clinical information is recorded in the file dat/PROJECT.Demographics.dat for each case:

* case
* disease
* ethnicity
* gender
* race
* days to birth

Catalog Summary Files

Catalog summary files provide a one-line representation of data available for a given case on GDC. Following case and disease, each column represents a particular data type, and one-letter codes T, N, A indicate availability of tumor, blood normal, and tissue adjacent normal samples, respectively. Repeated codes indicate repeated data files.

Example

C3L-00001   LUAD        WGS.hg19 T N A      WXS.hg19 T N A      RNA.fq TT  AA       miRNA.fq T  A       WGS.hg38 T N A      WXS.hg38 T N A      RNA.hg38 TTT  AAA       miRNA.hg38 T  A     MethArray TT  AA

This line indicates that LUAD case C3L-00001 has tumor, blood normal, and adjacent normal samples for WGS and WXS data as submitted (hg19); tumor and adjacent normal RNA-Seq data (TT, AA because FASTQ data comes in pairs); and tumor and adjacent miRNA data in FASTQ format. All these are available as harmonized hg38 WGS and WXS, and harmonized hg38 RNA-Seq chimeric, genomic, and transcriptome BAMs are available for tumor and adjacent normal. Methylation array data for tumor and tissue adjacent also available (Green and Red channel for each).

Exon target capture info

The intermediate files cases/*/read_groups.dat capture the target_capture_kit_target_region field of each read group, which is used for exome analysis. Currently the only value observed (apart from null and "Not Applicable") is,

http://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexterarapidcapture/nexterarapidcapture_exome_targetedregions_v1.2.bed

Processing details

Workflow

Processing workflow and hierarchy proceeds as,

1_process_all.sh
- All project-specific definitions take place here
- Calls src/process_multi_cases.sh, which
  - Iterates over cases file
  - Calls src/process_case.sh for each case
  - src/process_case.sh Calls the following:
    - src/get_aliquots.sh
    - src/get_read_groups.sh
    - src/get_harmonized_reads.sh
    - src/get_methylation_array.sh
    - src/make_catalog.sh
    - src/get_demographics.sh
  - Collects catalog files to write project catalog file
  - Collects demographics files to write project demographics file

queryGDC documentation includes additional information about GDC queries and other useful links.

Support

Please contact Matt Wyczalkowski m.wyczalkowski@wustl.edu for with questions and bug reports.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
config		config
doc		doc
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
1_run_discovery.sh		1_run_discovery.sh
2_make_catalog2.sh		2_make_catalog2.sh
3_make_catalog3.sh		3_make_catalog3.sh
4_make_demographics.sh		4_make_demographics.sh
BRANCH.master		BRANCH.master
LICENSE		LICENSE
README.md		README.md
README.project.md		README.project.md
README.queryGDC.md		README.queryGDC.md
discovery_config.sh		discovery_config.sh

License

ding-lab/GDC.case.discover

Folders and files

Latest commit

History

Repository files navigation