Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.
Obi Griffith edited this page Feb 15, 2015 · 26 revisions

in progress

Contents

Process Flow

Processing Profile: 2762841

Description

Alignment

Each lane or sub-lane of data is aligned to GRCh37-lite with Ensembl 74 annotation using TopHat 1,2,3 v2.0.4 which performs mappings using Bowtie2 v2.0.0 after "bisecting" each read for optimal cross-exon placement. The only variation from default command line parameters is setting the number of cores and bowtie version (-p 4 --bowtie-version=2.0.0). Downstream BAM file conversion and manipulation performed with Picard4 v1.52.

Quality Assurance

Picard CollectRnaSeqMetrics is performed using the Ensembl v74 annotation and the GRCh37-lite reference to assess the amount of the mapped reads that align to the genomic components (ex. intergenic, intronic, coding, utr, and ribosomal or mitochondrial).

Abundance Estimation

Resultant TopHat mappings are processed through Cufflinks 4,5,6,7 v2.0.2 which assembles aligned RNA-seq reads into transcripts and estimates their abundances. Cufflinks is run in three modes using the annotation in various ways to identify abundance of a de novo transcript assembly(default), known transcripts only(-G), and a reference-guided hybrid mode (-g). Cufflinks is run using Ensembl v74 annotation to help drive transcript identification in the both the known transcripts and reference-guided modes. For all three modes abundance estimates are made using FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) values.

references

  1. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
  2. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.
  3. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. . Genome Biology 2011, 14:R36
  4. http://picard.sourceforge.net.
  5. Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
  6. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias Genome Biology doi:10.1186/gb-2011-12-3-r22
  7. Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq Bioinformatics doi:10.1093/bioinformatics/btr355

What is RNA-seq

Overview

Conceptual

Actual

Pipeline

TopHat2

How to Analyze RNA-seq data?

Start an RNA-seq Genome Model

  1. Determine Model Inputs a) Instrument Data (The first library name starts with H_KH-540077-Tumor-cDNA-8):
$ genome instrument-data list solexa --filter library_name=Pooled_RNA_2891007020-mRNA2-cDNA-1-lig1-lib1
ID           FLOW_CELL_ID   LANE   INDEX_SEQUENCE   SAMPLE_NAME                LIBRARY_NAME                                   READ_LENGTH   IS_PAIRED_END   CLUSTERS    MEDIAN_INSERT_SIZE   SD_ABOVE_INSERT_SIZE   TARGET_REGION_SET_NAME
--           ------------   ----   --------------   -----------                ------------                                   -----------   -------------   --------    ------------------   --------------------   ----------------------
2891354555   C1TD1ACXX      8      ACAGTG           H_NJ-HCC1395-HCC1395_RNA   Pooled_RNA_2891007020-mRNA2-cDNA-1-lig1-lib1   100           1               156248832   247                  1421                   <NULL>
 b) Process Profile (current defaults/only option) is : 2762841 (October 2012 Default Ovation V2 RNA-seq)

 c) Reference Sequence: 106942997

 d) Annotation: 124434505
  1. Create a Genome Model Group to help organize models:
$ genome model-group create $USER-rnaseq-tutorial-test

Created model group:
ID: 8a6d9e68ceed4c96af3cc36d6b321cd6, NAME: gmsuser-rnaseq-tutorial-test

3. Define a Genome Model:

$ genome model define rna-seq --groups=8a6d9e68ceed4c96af3cc36d6b321cd6 --reference-sequence-build=106942997 --annotation-build=124434505 --instrument-data=2891354555 --processing-profile=2762841

'instrument_data', 'annotation_build', 'groups', 'reference_sequence_build', and 'processing_profile' may require verification...
Resolving parameter 'instrument_data' from command argument '2891354555'... found 1
Resolving parameter 'annotation_build' from command argument '124434505'... found 1
Resolving parameter 'groups' from command argument '8a6d9e68ceed4c96af3cc36d6b321cd6'... found 1
Resolving parameter 'reference_sequence_build' from command argument '106942997'... found 1
Resolving parameter 'processing_profile' from command argument '2762841'... found 1
Created model:

id: 4db14ea02335464dab7499b83b4a50d4
name: H_NJ-HCC1395-HCC1395_RNA.rna_seq
subject: H_NJ-HCC1395-HCC1395_RNA (TST1 tumor 2889953342)
processing_profile: Default Ovation V2 RNA-seq (2762841)

4. Start a Build of the Genome Model

$ genome model build start 8a6d9e68ceed4c96af3cc36d6b321cd6

6. Interrogate Build Results (see next section)

Create an RNA-seq only annotation build

Sometimes you may have RNA-seq data that you want to analyze with a non-Ensembl set of transcripts. If you have a gtf file, you should be able to do this by creating an rna-seq only imported annotation build.

The resulting build can be used as the --annotation-build input to the RNA-seq model.

Output Files and Formats

This is not a comprehensive overview of RNA-seq analysis, but rather a primer on the RNA-seq pipeline:

GMS Build Data Directory

Name Type Description
build.xml Text File Workflow XML file used by the GMS to run the analysis on the TGI cluster.
alignments Symbolic Link GMS Alignment API software result.
alignment_stats Symbolic Link An in-house tool is used to generate alignment metrics about the BAM file
metrics Symbolic Link Picard CollectRnaSeqMetrics software result
junctions Symbolic Link An in-house tool used to summarize splice junction coverage.
bam-qc Symbolic Link Software result containing general quality control data including: Picard, Samtools, SAMStat, FastQC, and in-house tools.
logs Directory The standard error and standard out of all applications and workflow commands used to generated the build.
de_novo Symbolic Link The 'de novo' mode Cufflinks software result.
reference_only Symbolic Link The 'reference only' mode Cufflinks software result.
expression Symbolic Link The 'reference only' mode Cufflinks software result. Retained for backward compatibility.
reference_guided Symbolic Link The 'reference guided' mode Cufflinks software result.
results Directory A base-level results directory. Created to provide consistency with newer build data directories.
fusions Directory A base-level directory for fusion detection results.
reports Directory The auto-generated pipeline reports created by the GMS. Typically these are only relevant to a build's success or failure.

Alignments

The files contained in the alignments/ directory are the product of the alignment of all sequence reads to a reference genome. Each instrument data is aligned independently and each result exists on the filesystem. The individual alignment results are not linked to the build data directory, but are currently retained. Please see the aligner documentation for the contents of the individual alignment results.

Name Type Description
*.bam Binary file The merged, position sorted, and optionally marked duplicate BAM file of alignments.
*.bam.bai Binary File An index of the BAM file.
*.bam.md5 Text File The md5 checksum of the BAM file.
*.bam.flagstat Text File Samtools flagstat output for the BAM file. See the Alignment Stats section for notes.
*.log Text File (optional) Per-library log files from Picard MarkDuplicates.
*.metrics Text File (optional) Per-library metric files from Picard MarkDuplicates.

Alignment Stats

Transcriptome alignment is not quite as simple as genome alignment. RNAseq reads are typically allowed to align to multiple locations. Therefore, the SAMTools flagstat output is not useful. It only accounts for the number of alignments in a BAM file rather than the number of reads. A tool was developed at The Genome Institute to uniquely count the number of reads and metrics about the alignment of those reads. The tool is a GMS command with a package name of Genome::Model::Tools::BioSamtools::Tophat2AlignmentStats.

Name Type Description
alignment_stats.txt Text File A tab-separated value file of alignment metrics per chromosome. A footer exists with summary metrics prefixed by '##'.

alignment_stats.txt

The tab-delimited columns described here are written per chromosome for the reference:

Column Header Description
1 chr The reference genome chromosome for which each metric is reported.
2 top Reads with one, unique alignment to the genome.
3 top-spliced Spliced, aligned reads with one, unique alignment to the genome.
4 pct_top_spliced A percentage of the top reads with spliced alignment.
5 multi_reads Reads with multiple alignments to the genome.
6 pct_multi_reads A percentage of the total reads that map to multiple locations.
7 multi_spliced_reads Spliced, aligned reads with more than one alignment to the genome.
8 pct_multi_spliced_reads A percentage of the multiple alignment reads that are spliced alignment.
9 multi_hits The total number of alignments that multiple mapped reads consume.

the '##' prefixed summary metrics are for the entire reference genome:

Name Description
Total Reads Total number of reads in the BAM file.
Unmapped Reads Number of unmapped reads in the BAM file.
Unique Alignments Alignments contained in the BAM file that are uniquely mapped.
Multiple Hit Reads Reads in the BAM file that have multiple alignments.
Multiple Hit Sum The sum of the number of alignments consumed by multiply aligned reads.
Total Reads Mapped Number of reads that are mapped, either uniquely or with multiple alignments.

Metrics

All of the files in the metrics/ directory are a product of Picard CollectRnaSeqMetrics. For a complete definition of each metrics see the Picard Metric Definitions page.

Name Type Description
PicardRnaSeqChart.pdf PDF File Normalized RNA−Seq Coverage vs. Normalized Transcript Position for the top 1,000 expressed genes.
PicardRnaSeqMetrics.png Image File Three dimensional pie chart of the proportion of aligned data that belongs to coding, UTR, intronic, intergenic and mitochondrial/ribosomal annotation.
PicardRnaSeqMetrics.txt Text File A text file of RNA-Seq metrics including basic alignment summary, proportion of alignments to annotation, transcript bias and strand specificity.
Picard_mRNA.refFlat Text File refFlat annotation file used by the Picard tool to collect RNA-Seq metrics. Converted from GTF using gtfToGenePred and then an in-house tool to print the first 9 columns as refFlat.
Picard_ribo.intervals Text File The IntervalList annotation file of mitochondrial and ribosomal annotation used by the Picard tool to collect RNA-Seq metrics.

Junctions

Name Type Description
AlignmentResult_*_junctions.bed Text File
*.ExonContent.bed Text File
*.Exons.bed Text File
*.Genes.bed Text File
*.Junction.GeneExpression.top1percent.tsv Text File
*.Junction.GeneExpression.tsv Text File
*.Junction.TranscriptExpression.top1percent.tsv Text File
*.Junction.TranscriptExpression.tsv Text File
*.junctions.bed Text File
*.junctions.tsv Text File
SpliceJunctionSummary.R.stderr Text File
SpliceJunctionSummary.R.stdout Text File
junctions.bed Text File
observed.junctions.anno.*.tsv Text File
observed.junctions.bed Text File
summary Directory

summary

Name Type Description
ObservedJunctions_SpliceSiteUsage_Pie.pdf PDF File
KnownJunctions_SpliceSiteUsage_Pie.pdf PDF File
ObservedJunctions_SpliceSiteAnchorTypes_Pie.pdf PDF File
ObservedJunctions_ReadCounts_Log2_Hist.pdf PDF File
GeneJunctionReadCounts_Log2_Hist.pdf PDF File
TranscriptJunctionReadCounts_Log2_Hist.pdf PDF File
KnownJunctionCounts_Genes_Log2_Hist.pdf PDF File
KnownJunctionCounts_Transcripts_Log2_Hist.pdf PDF File
JPJMs_Genes_Log2_Hist.pdf PDF File
JPJMs_Transcripts_Log2_Hist.pdf PDF File
ExonSkippingEvents_KnownVsNovel_LinePlot.pdf PDF File
PercentGeneJunctionsCovered_BreadthvsDepth_BoxPlot.pdf PDF File
IntronSizes_ObservedVsKnown_Density.pdf PDF File
Stats.tsv Text File

BAM QC

Name Type Description
*.metrics Symbolic Link
*-AlignmentLengthDistribution.tsv Text File
*-ErrorRate.tsv Text File
*-ErrorRate_rates_1.png Image File
*-ErrorRate_rates_2.png Image File
*-PicardGC_chart.pdf PDF File
*-PicardGC_metrics.txt Text File
*-PicardGC_summary.txt Text File
*-PicardMetrics.alignment_summary_metrics Text File
*-PicardMetrics.insert_size_histogram.pdf PDF File
*-PicardMetrics.insert_size_metrics Text File
*-PicardMetrics.quality_by_cycle.pdf PDF File
*-PicardMetrics.quality_by_cycle_metrics Text File
*-PicardMetrics.quality_distribution.pdf PDF File
*-PicardMetrics.quality_distribution_metrics Text File
*-ReadLengthDistribution.tsv Text File
*-ReadLengthSummary.tsv Text File
*.bam Symbolic Link
*.bam.bai Symbolic Link
*.bam.flagstat Symbolic Link
*.bam.html HTML File
*_fastqc Directory
*_fastqc.zip Compressed Archive

*_fastqc

Name Type Description
summary.txt Text File
fastqc_report.html HTML File
fastqc_data.txt Text File
Icons Directory
Images Directory
icons
Name Type Description
error.png Image File
fastqc_icon.png Image File
tick.png Image File
warning.png Image File
images
Name Type Description
per_base_quality.png Image File
per_sequence_quality.png Image File
per_base_sequence_content.png Image File
per_base_gc_content.png Image File
per_sequence_gc_content.png Image File
per_base_n_content.png Image File
sequence_length_distribution.png Image File
duplication_levels.png Image File

Logs

The logs/ directory contains the standard error and standard out of each cluster job generated during the build process. These log files are useful for debugging failures, but their diversity and format are beyond the scope of this documentation.

De Novo (optional)

Cufflinks has three modes: de novo, reference guided and reference only. This symlinked directory contains the output of Cufflinks run in the 'de novo' mode. All Cufflinks output files are well documented in the Cufflinks user manual: http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

Reference Guided (optional)

Cufflinks has three modes: de novo, reference guided and reference only. This symlinked directory contains the output of Cufflinks run in the 'reference guided' mode. All Cufflinks output files are well documented in the Cufflinks user manual: http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

Reference Only

Cufflinks has three modes: de novo, reference guided and reference only. This symlinked directory contains the output of Cufflinks run in the 'reference only' mode. All Cufflinks output files are well documented in the Cufflinks user manual: http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

expression

The expression/ symlink points to the 'reference only' Cufflinks mode. This symlink exists simply to maintain backward compatibility with existing software.

Results

The results directory is a base-level directory to serve as a container for all build results. Newer pipelines use this convention and future results will be added to this directory

underlying_results
Name Type Description
  • | Symbolic Link | A symlink to each instrument data run of htseq-count.
per-instrument data result
Name Type Description
transcript-counts.tsv Text File
transcript.err Text File
gene-counts.tsv Text File
gene.err Text File

TopHat

All TopHat files are well documented in the TopHat user manual:

http://tophat.cbcb.umd.edu/manual.shtml#output

Cufflinks

All Cufflinks output files are well documented in the Cufflinks user manual:

http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

Frequently Asked Questions

Why did the RNA-seq metrics report a large amount of intronic sequence?

http://www.biostars.org/p/42890/

How do I list the BAM location for a model?

Models do not have results of their own--they are merely a container for builds. Thus we need to select a build from the model. "last_succeeded_build" (and "last_complete_build", which is synonymous) give us the most recent build that completed successfully.

"whole_rmdup_bam_file" is something particular to reference alignment builds. Both reference alignment and rna-seq have a merged_alignment_result (which is itself a Genome::InstrumentData::AlignmentResult::Merged). Looking in that module we see that there is a bam_path accessor.

$ genome model list --filter id=abcd1234 --show id,name,last_complete_build.merged_alignment_result.bam_path

For strand specific library construction protocols, which settings do I use for each RNA-seq software tool? TopHat

http://tophat.cbcb.umd.edu/faq.shtml#library_type

Illumina specifically documents the library type in the Illumina RNA-seq Tophat Analysis (originally downloaded from http://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf): "For the TruSeq RNA Sample Prep Kit, the appropriate library type is 'fr-unstranded'. For TruSeq stranded sample prep kits, the library type is specified as 'fr-firststrand'."

This blog post is also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)?

Another suggestion is to load the aligned reads in IGV and look at the read orientation. You can do this by hovering over a read aligned to an exon.

F2 R1 means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5' to 3') this would denote a fr-firststrand setting in TopHat, ie. "the right-most end of the fragment (in transcript coordinates) is the first sequenced". For a negative DNA strand transcript (3' to 5') this would denote a fr-secondstrand setting in TopHat.

F1 R2 means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but it's simply the inverse for F1 R2 mapping.

Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically RF orientation is reserved for large-insert mate-pair libraries. Other orientations like FF and RR seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference.

Why did the RNA-seq metrics report a large amount of intronic sequence?

http://www.biostars.org/p/42890/

How do I list the BAM location for a model?

Models do not have results of their own--they are merely a container for builds. Thus we need to select a build from the model. "last_succeeded_build" (and "last_complete_build", which is synonymous) give us the most recent build that completed successfully.

"whole_rmdup_bam_file" is something particular to reference alignment builds. Both reference alignment and rna-seq have a merged_alignment_result (which is itself a Genome::InstrumentData::AlignmentResult::Merged). Looking in that module we see that there is a bam_path accessor.

$ genome model list --filter id=abcd1234 --show id,name,last_complete_build.merged_alignment_result.bam_path

For strand specific library construction protocols, which settings do I use for each RNA-seq software tool?

TopHat

http://tophat.cbcb.umd.edu/faq.shtml#library_type

Illumina specifically documents the library type in the Illumina RNA-seq Tophat Analysis (originally downloaded from http://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf): "For the TruSeq RNA Sample Prep Kit, the appropriate library type is 'fr-unstranded'. For TruSeq stranded sample prep kits, the library type is specified as 'fr-firststrand'."

This blog post is also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)?

Another suggestion is to load the aligned reads in IGV and look at the read orientation. You can do this by hovering over a read aligned to an exon.

F2 R1 means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5' to 3') this would denote a fr-firststrand setting in TopHat, ie. "the right-most end of the fragment (in transcript coordinates) is the first sequenced". For a negative DNA strand transcript (3' to 5') this would denote a fr-secondstrand setting in TopHat.

F1 R2 means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but it's simply the inverse for F1 R2 mapping.

Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically RF orientation is reserved for large-insert mate-pair libraries. Other orientations like FF and RR seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference.

Picard

http://picard.sourceforge.net/command-line-overview.shtml#CollectRnaSeqMetrics

Summary2

In summary, the settings for commonly used RNA-seq library construction kits are:

Library Kit 5' to 3' IGV Tophat HTSeq Picard
TruSeq Strand Specific Total RNA F2R1 fr-firststrand reverse
NuGEN Encore F1R2 fr-secondstrand yes
NuGEN OvationV2 F2R1 or F1R2 fr-unstranded no
Clone this wiki locally