Skip to content

NCI-GDC/gdc-workflow-overview

Repository files navigation

Overview of GDC Harmonization Workflows

Description

  • Here are the overview of major GDC harmonization workflows.
  • The GDC workflow repositories has been tested on GDC data and in the particular environment GDC is running in.
  • Current GDC production workflows are running in the GDC Pipeline Automation System (GPAS).
  • Most GDC workflows are developed using Common Workflow Language (CWL) with Dockerized tools.
  • Please check GDC Documentation for more details.

For external users

The GDC workflow repositories has been tested on GDC data and in the particular environment GDC is running in.

  • We have created external CWL entrypoint in some workflows. For the others, you are expecting to modify the workflow to be used in your system.
  • Some of the reference data required for the workflow production are hosted in GDC reference files. We can not share some other reference files due to licensing issues.
    • In particular, we can not share target or bait bed files or interval list of any specific Target Capture Kits. You are encouraged to contact the vendors or project owners to get these files.
  • We are not able to share dockers to external users due to licensing issues. You are welcomed to build your own dockers using the docker files provided.
  • For any questions related to GDC data, please contact the GDC Help Desk at support@nci-gdc.datacommons.io.

Production Workflows Include

DNA-Seq Alignment

  • BWA based alignment workflow for Whole Exome Sequencing (WXS), Whole Genome Sequencing (WGS), Targeted Sequencing, and some other DNA-Seq experimental strategies. The workflow takes either BAM or FASTQ files as input, performs reads mapping, and optional steps of Base Quality Score Recalibration (BQSR), Indel Realignment, MarkDuplicates, and outputs a sorted BAM file, a BAM index file, and various QC metrics.
  • Main CWL: https://github.com/NCI-GDC/gdc-dnaseq-cwl

RNA-Seq Alignment

  • STAR based RNA-Seq alignment workflow that takes either BAM and FASTQ files as input, and generates 3 BAMs (Genome Aligned BAM, Transcriptome Aligned BAM, Chimeric BAM), STAR Counts, and Splice Junction Quantifications. Among the 3 BAMs generated, Transcriptome aligned BAM is read name sorted instead of coordinate sorted, so it is not companioned by a BAM index file. In addition, the STAR Counts file contains quantification in 3 different ways: strandless mode and two different stranded modes.
  • Main CWL: https://github.com/NCI-GDC/gdc-rnaseq-cwl
  • Utility scripts: https://github.com/NCI-GDC/gdc-rnaseq-tool

miRNA Alignment and Profiling

RNA-Seq HTSeq Quantification

WXS Variant Calling

WXS Variant Filtering

WGS Variant Calling

Tumor-only Variant Calling

Tumor-only Variant Filtering

VEP Variant Annotation

Mutation Annotation File (MAF)

One-off Workflows Include

SNP6 Segmentation

  • The workflow applies Circular Binary Segmentation to existing BirdSeed probe-level copy numbers, and generates copy number segmentation files and gene-level copy number TSVs.
  • Utility scripts: https://github.com/NCI-GDC/dnacopy-tool

Other Essential Data

Target Capture Kit

  • target_capture_kit is an enumerated property on the ReadGroup node, and a value other than Unknown or Not Applicable is required for read groups that are associated to WXS and Targeted Sequencing strategies.
  • Unfortunately we can not share the Target Capture Kit bed files in public because of policy restrictions from some kit vendors. You are able to find the corresponding size (in bps) of each kit in this file (https://github.com/NCI-GDC/gdc-workflow-overview/blob/master/gdc_target_capture_kit_size.tsv) for Tumor Mutation Burden (TMB) analysis. Please note some of these files arrive GDC in hg19 reference build and some in GRCh38 (hg38) reference build, so we labeled them in separate columns in the file.

Releases

No releases published

Packages

No packages published