Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Guide to Importing and Analyzing External Data

Obi Griffith edited this page Mar 23, 2015 · 24 revisions

To demonstrate that the GMS can be run on external data we will import some completely new data from a public repo and run through the GMS pipelines. To proceed you must have an installed version of the GMS (see Installation Types for options) and have already run the ./setup/prime-system.pl command as documented in the installation instructions or use the Pre-configured Virtual Machine. We completed the following tutorial on an AWS instance of the i2.2xlarge type.

NOTE: We are currently hosting a live demonstration of the GMS, installed on Amazon AWS. The Beginner's Guide to the Demonstration Analysis and Guide to Importing and Analyzing External Data have been completed on that instance. The web viewer for that demonstration instance can be browsed at: http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com/. Please visit the tutorial on Sharing GMS results using an Amazon AWS Instance for more details on how it was set up and how to connect to it.

###Downloading an external data set from GEO/SRA

To illustrate we will process data from an additional breast cancer cell line that was reported in: Daemen A*, Griffith OL*, 2013. Modeling precision treatment of breast cancer. Genome Biology. 14:R110. A good pair of candidate cell lines (with matched tumor/normal) are HCC1143/HCC1143BL. Exome-seq data are available for both tumor/normal and RNAseq data for tumor. They were sequenced each with a single lane of Illumina GAIIx 2x100bp reads for exome and 2x76bp reads for RNA-seq.

Data were deposited at GEO/SRA and are accessible through the GEO dataset superseries: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48216

DNA subseries + SRA link:

RNA subseries + SRA link:

A BioStar tutorial exists on how to download fastq/bam files for individual cell lines from this dataset here: https://www.biostars.org/p/111040/

These datasets are substantially smaller (GAIIx vs Hiseq2000 lanes) but otherwise conceptually similar to the HCC1395/HCC1395BL cell line pair that we used previously for demonstration purposes. Except of course they lack WGS.

NCBI fetch utilities are pre-installed in the sGMS. By default, downloading with prefetch is set to /home/ubuntu/ncbi/. There is insufficient space there. A quick fix for this is to simply create a new folder in /opt/ and symlink to it as follows:

mkdir /opt/ncbi/
ln -s /opt/ncbi/ /home/ubuntu/

Individual run record pages can be found here:

The run IDs for the three HCC1143/HCC1143BL datasets are SRR925765, SRR925766, and SRR925702 for tumor exome, normal exome and tumor RNAseq respectively. With these run IDs you can download data using the ncbi prefetch command.

prefetch -v SRR925765
prefetch -v SRR925766
prefetch -v SRR925702

The downloaded sra files can then be converted to fastq format

mkdir /opt/fastq
fastq-dump --outdir /opt/fastq/ --split-files /home/ubuntu/ncbi/public/sra/SRR925765.sra
fastq-dump --outdir /opt/fastq/ --split-files /home/ubuntu/ncbi/public/sra/SRR925766.sra
fastq-dump --outdir /opt/fastq/ --split-files /home/ubuntu/ncbi/public/sra/SRR925702.sra

Fastq files can be converted to unaligned bam format as expected by the GMS. This can be done using the Genome Model Tool for Picard as follows:

mkdir /opt/bam/
gmt picard fastq-to-sam --fastq=/opt/fastq/SRR925765_1.fastq --fastq2=/opt/fastq/SRR925765_2.fastq --output=/opt/bam/SRR925765_1.bam --quality-format=Standard --sample-name=HCC1143_DNA
gmt picard fastq-to-sam --fastq=/opt/fastq/SRR925766_1.fastq --fastq2=/opt/fastq/SRR925766_2.fastq --output=/opt/bam/SRR925766_1.bam --quality-format=Standard --sample-name=HCC1143BL_DNA
gmt picard fastq-to-sam --fastq=/opt/fastq/SRR925702_1.fastq --fastq2=/opt/fastq/SRR925702_2.fastq --output=/opt/bam/SRR925702_1.bam --quality-format=Standard --sample-name=HCC1143_RNA

Once all bam files have been created, you may wish to recover some disk space by cleaning up fastq and sra files:

rm -r /opt/fastq/
rm -r /home/ubuntu/ncbi/public/sra/

###Importing the external bam files and meta-data into the GMS

You can now follow the data import tutorial to add new data. This will require modifying some commands and meta-data. One strategy would be to simply copy the sample commands from the Beginner's Guide to Data Import and modify appropriately for this new data set.

cp /opt/src/gms/downsampled-demo-data/example-import-data.sh /opt/bam/import-hcc1143-data.sh

Alternatively, you can review our modified version of the import script for HCC1143 data or simply run it from /opt/src/gms/demo-data/import-hcc1143-data.sh. The following sections walk you through the changes we made to the import commands. A helper script to generate these commands automatically from a spreadsheet-format file is under development.

####First, set the instrument data directory to where bam files were downloaded

INSTRUMENT_DATA_DIRECTORY='/opt/bam'

####Add HCC1143 as an individual in the system

  • name: is an internal, unique, immutable, deidentified system name.
  • upn: can be the same, or contain an external global unique name.
  • common_name: allows a shorter, human-friendy name to be given which can be updated over time as needed.
  • taxon: for all data in this experiment the preloaded "human" taxon is used.
INDIVIDUAL='HCC1143'
genome individual create \
--name="$INDIVIDUAL" \
--upn='HCC1143' \
--common-name="HCC1143" \
--gender=female \
--taxon="name=human"

The individual can now be found by the command line and web interfaces.

genome individual list "name='$INDIVIDUAL'"

####Add samples to the system

  1. DNA from tumor tissue : HCC1143_DNA
  2. DNA from normal tissue : HCC1143BL_DNA
  3. RNA from tumor tissue : HCC1143_RNA

The essential characteristics of each sample are:

  • extraction type: either "genomic dna" or "rna"
  • source: identify the individual from which the sample came
  • name: a unique immutable identifier (can be a barcode)
  • common name: a friendly name that can be updated liesure
  • extraction label: any unique name for the source tissue (can match "name")
  • tissue desc: describe the tissue (informational, does not affect processing)
SAMPLE_TUMOR='HCC1143_DNA'
genome sample create \
--extraction-type='genomic dna' \
--source="name=$INDIVIDUAL" \
--name=$SAMPLE_TUMOR \
--common-name='tumor' \
--extraction-label='HCC1143_DNA' \
--tissue-desc='epithelial'

SAMPLE_NORMAL='HCC1143BL_DNA'
genome sample create \
--extraction-type='genomic dna' \
--source="name=$INDIVIDUAL" \
--name=$SAMPLE_NORMAL \
--common-name='normal' \
--extraction-label='HCC1143BL_DNA' \
--tissue-desc='b lymphoblast'

SAMPLE_RNA_TUMOR='HCC1143_RNA'
genome sample create \
--extraction-type='rna' \
--source="name=$INDIVIDUAL" \
--name=$SAMPLE_RNA_TUMOR \
--common-name='tumor' \
--extraction-label='HCC1143_RNA' \
--tissue-desc='epithelial'

All of these are visible by the command line now:

genome sample list "source.name='$INDIVIDUAL'"
genome sample list "source.name='$INDIVIDUAL' and extraction_type='rna'"
genome sample list "source.name='$INDIVIDUAL' and common_name like '%tumor'"

####Add DNA Fragment Libraries:

  • Each sample will have one or more libraries of DNA fragments in the system.
  • A library represents a specific set of DNA fragments, with expected fragment size, and possibly stranding data (for RNA).
  • One library of tumor DNA was used for exome sequencing. One library of normal DNA was used for exome sequencing. One library of tumor RNA was used for RNA-seq.
  • NOTE: From the GEO sample page for library construction we can learn that the targeted library insert size was 150bp.
CAPTURE_LIBRARY_TUMOR='HCC1143-capture-lib1'
genome library create \
--name="$CAPTURE_LIBRARY_TUMOR" \
--sample="$SAMPLE_TUMOR" \
--protocol='Illumina Library Construction' \
--library-insert-size='150'

CAPTURE_LIBRARY_NORMAL='HCC1143BL-capture-lib1'
genome library create \
--name="$CAPTURE_LIBRARY_NORMAL" \
--sample="$SAMPLE_NORMAL" \
--protocol='Illumina Library Construction' \
--library-insert-size='150'

LIBRARY_RNA_TUMOR='HCC1143_RNA-lib1'
genome library create \
--name="$LIBRARY_RNA_TUMOR" \
--sample="$SAMPLE_RNA_TUMOR" \
--protocol='Illumina Library Construction' \
--original-insert-size='150' \
--library-insert-size='150' \
--transcript-strand='unstranded'

Each is listable once created:

genome library list "sample.patient.common_name='HCC1143'"
genome library list "sample.name='$SAMPLE_TUMOR' and library_insert_size > 100"

Add sequence reads to the system

  • Each "instrument data" can represent sequence reads, microarray data, or any other data emitted by a laboratory instrument to be used as an input to analysis. For this example all are Illumina NGS reads.
  • For indexed data, there is one instrument data entry per flow cell + lane + index. For unindexed data, there is one per flow cell + lane.
  • The GMS prefers reads to be in BAM format (pre alignment), for efficiency.
  • Note that the flow_cell_id, lane, and index sequence are extracted from the BAM name automatically.
  • Note that the number of clusters can be obtained from "# of Spots" the SRA record pages.
  • Note that for targeted (e.g., Exome) data a "target_region_set_name" is specified. The value is used to look-up a set of positions in the system which were pre-loaded during installation. See genome feature-list --help for more details.
genome instrument-data import basic \
--description='tumor rna 1' \
--import-source-name='GEO_SRA' \
--instrument-data-properties='clusters=25116917' \
--source-files="$INSTRUMENT_DATA_DIRECTORY/SRR925702_1.bam" \
--library="$LIBRARY_RNA_TUMOR"

TARGET_SET='NimbleGen v3 Capture Chip Set'
genome instrument-data import basic \
--description='tumor exome 1' \
--import-source-name='GEO_SRA' \
--instrument-data-properties="clusters=72512840,target_region_set_name=$TARGET_SET" \
--source-files="$INSTRUMENT_DATA_DIRECTORY/SRR925765_1.bam" \
--library="$CAPTURE_LIBRARY_TUMOR"

genome instrument-data import basic \
--description='normal exome 1' \
--import-source-name='GEO_SRA' \
--instrument-data-properties="clusters=36138317,target_region_set_name=$TARGET_SET" \
--source-files="$INSTRUMENT_DATA_DIRECTORY/SRR925766_1.bam" \
--library="$CAPTURE_LIBRARY_NORMAL"

All instrument-data is listable as above, and can be filtered by params.

genome instrument-data list imported "sample.patient.common_name='HCC1143' and sample.extraction_type = 'genomic dna'"
genome instrument-data list imported "sample.patient.common_name='HCC1143' and sample.extraction_type = 'rna'"

###Analyzing the imported data

To create the required models and start appropriate builds you can use genome model clin-seq advise and follow the instructions. First, use the common name to identify sample IDs for HCC1143. Then run again, specifying those sample ids. Follow the prompts to define models and start builds. Depending on the resources of your hardware you may need to first run the two reference alignment builds (wait for completion), then exome somatic variation (wait for completion), then RNAseq (wait for completion), and finally clin-seq.

genome model clin-seq advise --allow-imported --individual='common_name=HCC1143'
genome model clin-seq advise --allow-imported --individual='common_name=HCC1143' --samples='id in [eb963047a2054fbbad872b35a7a98523,0456ded1220949d09a38726b36d0ed25,9be3ad5f2e5d45e9b3e3680be3e9d5c8]'

Visualizing the results

The results of our own run-through of this tutorial are currently live in this AWS instance: http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com/opt/gms/U6GNV74/fs/U6GNV74/info/model_data/8cae417fdb1648ee8daa88a974a4f596/build99cd0558dd4c4c2ea43076a503c7a78b/HCC1143/

Please feel free to browse these results. See Location and description of results files in GMS pipelines for an explanation of the path to most interesting result files (especially the clinseq/medseq results).

To provide just one example here, the following is a composite of two images from the mutation_diagrams folder showing the presence of a TP53 R248Q mutation in HCC1143 and comparing to mutation frequencies observed in COSMIC for this gene.

Opening the ~/snv/exome/summary/VariantExpressionSummary.tsv file reveals that this TP53 mutation is present at 0% frequency in normal, 98.6% in tumor exome, and 98.2% in tumor rnaseq data and a relatively high FPKM of 157.6. This suggests a somatic single-nucleotide mutation followed by loss of heterozygosity and resulting allele-specific expression of this mutant copy. Opening the ~/snv_indel_report/HCC1143_final_filtered_coding_clean.xls file further reveals that this mutation was called by three variant callers (sniper, strelka and varscan) and corresponds to a known dbSNP entry: rs11540652. This is a known pathogenic allele reported as a sometimes somatic, sometimes germline event and associated with multiple cancer types as well as Li-Fraumeni syndrome.

Viewing the raw data for this region is as simple as going to the ~/igv/level1_igv_session.xml file. Right-click and copy the link location, open IGV, and choose File->Load from URL. This will load the alignment bam files and bed tracks necessary to manually review the variant in question.

Clone this wiki locally