RnaSeq

in progress

Process Flow

Processing Profile: 2762841

Description

Alignment

Each lane or sub-lane of data is aligned to GRCh37-lite with Ensembl 74 annotation using TopHat _1,2,3 v2.0.4 which performs mappings using Bowtie2 v2.0.0 after "bisecting" each read for optimal cross-exon placement. The only variation from default command line parameters is setting the number of cores and bowtie version (-p 4 --bowtie-version=2.0.0). Downstream BAM file conversion and manipulation performed with Picard4 v1.52.

Quality Assurance

Picard CollectRnaSeqMetrics is performed using the Ensembl v74 annotation and the GRCh37-lite reference to assess the amount of the mapped reads that align to the genomic components (ex. intergenic, intronic, coding, utr, and ribosomal or mitochondrial).

Abundance Estimation

Resultant TopHat mappings are processed through Cufflinks _4,5,6,7 v2.0.2 which assembles aligned RNA-seq reads into transcripts and estimates their abundances. Cufflinks is run in three modes using the annotation in various ways to identify abundance of a de novo transcript assembly(default), known transcripts only(-G), and a reference-guided hybrid mode (-g). Cufflinks is run using Ensembl v74 annotation to help drive transcript identification in the both the known transcripts and reference-guided modes. For all three modes abundance estimates are made using FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) values.

references

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. . Genome Biology 2011, 14:R36
http://picard.sourceforge.net.
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias Genome Biology doi:10.1186/gb-2011-12-3-r22
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq Bioinformatics doi:10.1093/bioinformatics/btr355

What is RNA-seq

Overview

Conceptual

Actual

Pipeline

TopHat2

How to Analyze RNA-seq data?

Start an RNA-seq Genome Model

Determine Model Inputs a) Instrument Data (The first library name starts with H_KH-540077-Tumor-cDNA-8):

$ genome instrument-data list solexa --filter library_name=Pooled_RNA_2891007020-mRNA2-cDNA-1-lig1-lib1
ID           FLOW_CELL_ID   LANE   INDEX_SEQUENCE   SAMPLE_NAME                LIBRARY_NAME                                   READ_LENGTH   IS_PAIRED_END   CLUSTERS    MEDIAN_INSERT_SIZE   SD_ABOVE_INSERT_SIZE   TARGET_REGION_SET_NAME
--           ------------   ----   --------------   -----------                ------------                                   -----------   -------------   --------    ------------------   --------------------   ----------------------
2891354555   C1TD1ACXX      8      ACAGTG           H_NJ-HCC1395-HCC1395_RNA   Pooled_RNA_2891007020-mRNA2-cDNA-1-lig1-lib1   100           1               156248832   247                  1421                   <NULL>

 b) Process Profile (current defaults/only option) is : 2762841 (October 2012 Default Ovation V2 RNA-seq)

 c) Reference Sequence: 106942997

 d) Annotation: 124434505

Create a Genome Model Group to help organize models:

$ genome model-group create $USER-rnaseq-tutorial-test

Created model group:
ID: 8a6d9e68ceed4c96af3cc36d6b321cd6, NAME: gmsuser-rnaseq-tutorial-test

3. Define a Genome Model:

$ genome model define rna-seq --groups=8a6d9e68ceed4c96af3cc36d6b321cd6 --reference-sequence-build=106942997 --annotation-build=124434505 --instrument-data=2891354555 --processing-profile=2762841

'instrument_data', 'annotation_build', 'groups', 'reference_sequence_build', and 'processing_profile' may require verification...
Resolving parameter 'instrument_data' from command argument '2891354555'... found 1
Resolving parameter 'annotation_build' from command argument '124434505'... found 1
Resolving parameter 'groups' from command argument '8a6d9e68ceed4c96af3cc36d6b321cd6'... found 1
Resolving parameter 'reference_sequence_build' from command argument '106942997'... found 1
Resolving parameter 'processing_profile' from command argument '2762841'... found 1
Created model:

id: 4db14ea02335464dab7499b83b4a50d4
name: H_NJ-HCC1395-HCC1395_RNA.rna_seq
subject: H_NJ-HCC1395-HCC1395_RNA (TST1 tumor 2889953342)
processing_profile: Default Ovation V2 RNA-seq (2762841)

4. Start a Build of the Genome Model

$ genome model build start 8a6d9e68ceed4c96af3cc36d6b321cd6

6. Interrogate Build Results (see next section)

Create an RNA-seq only annotation build

Sometimes you may have RNA-seq data that you want to analyze with a non-Ensembl set of transcripts. If you have a gtf file, you should be able to do this by creating an rna-seq only imported annotation build.

The resulting build can be used as the --annotation-build input to the RNA-seq model.

Output Files and Formats

This is not a comprehensive overview of RNA-seq analysis, but rather a primer on the RNA-seq pipeline:

GMS Build Data Directory

Name	Type	Description
build.xml	Text File	Workflow XML file used by the GMS to run the analysis on the TGI cluster.
alignments	Symbolic Link	GMS Alignment API software result.
alignment_stats	Symbolic Link	An in-house tool is used to generate alignment metrics about the BAM file
metrics	Symbolic Link	Picard CollectRnaSeqMetrics software result
junctions	Symbolic Link	An in-house tool used to summarize splice junction coverage.
bam-qc	Symbolic Link	Software result containing general quality control data including: Picard, Samtools, SAMStat, FastQC, and in-house tools.
logs	Directory	The standard error and standard out of all applications and workflow commands used to generated the build.
de_novo	Symbolic Link	The 'de novo' mode Cufflinks software result.
reference_only	Symbolic Link	The 'reference only' mode Cufflinks software result.
expression	Symbolic Link	The 'reference only' mode Cufflinks software result. Retained for backward compatibility.
reference_guided	Symbolic Link	The 'reference guided' mode Cufflinks software result.
results	Directory	A base-level results directory. Created to provide consistency with newer build data directories.
fusions	Directory	A base-level directory for fusion detection results.
reports	Directory	The auto-generated pipeline reports created by the GMS. Typically these are only relevant to a build's success or failure.

Alignments

The files contained in the alignments/ directory are the product of the alignment of all sequence reads to a reference genome. Each instrument data is aligned independently and each result exists on the filesystem. The individual alignment results are not linked to the build data directory, but are currently retained. Please see the aligner documentation for the contents of the individual alignment results.

Name	Type	Description
*.bam	Binary file	The merged, position sorted, and optionally marked duplicate BAM file of alignments.
*.bam.bai	Binary File	An index of the BAM file.
*.bam.md5	Text File	The md5 checksum of the BAM file.
*.bam.flagstat	Text File	Samtools flagstat output for the BAM file. See the Alignment Stats section for notes.
*.log	Text File	(optional) Per-library log files from Picard MarkDuplicates.
*.metrics	Text File	(optional) Per-library metric files from Picard MarkDuplicates.

Alignment Stats

Transcriptome alignment is not quite as simple as genome alignment. RNAseq reads are typically allowed to align to multiple locations. Therefore, the SAMTools flagstat output is not useful. It only accounts for the number of alignments in a BAM file rather than the number of reads. A tool was developed at The Genome Institute to uniquely count the number of reads and metrics about the alignment of those reads. The tool is a GMS command with a package name of Genome::Model::Tools::BioSamtools::Tophat2AlignmentStats.

Name	Type	Description
alignment_stats.txt	Text File	A tab-separated value file of alignment metrics per chromosome. A footer exists with summary metrics prefixed by '##'.

alignment_stats.txt

The tab-delimited columns described here are written per chromosome for the reference:

Column	Header	Description
1	chr	The reference genome chromosome for which each metric is reported.
2	top	Reads with one, unique alignment to the genome.
3	top-spliced	Spliced, aligned reads with one, unique alignment to the genome.
4	pct_top_spliced	A percentage of the top reads with spliced alignment.
5	multi_reads	Reads with multiple alignments to the genome.
6	pct_multi_reads	A percentage of the total reads that map to multiple locations.
7	multi_spliced_reads	Spliced, aligned reads with more than one alignment to the genome.
8	pct_multi_spliced_reads	A percentage of the multiple alignment reads that are spliced alignment.
9	multi_hits	The total number of alignments that multiple mapped reads consume.

the '##' prefixed summary metrics are for the entire reference genome:

Name	Description
Total Reads	Total number of reads in the BAM file.
Unmapped Reads	Number of unmapped reads in the BAM file.
Unique Alignments	Alignments contained in the BAM file that are uniquely mapped.
Multiple Hit Reads	Reads in the BAM file that have multiple alignments.
Multiple Hit Sum	The sum of the number of alignments consumed by multiply aligned reads.
Total Reads Mapped	Number of reads that are mapped, either uniquely or with multiple alignments.

Metrics

All of the files in the metrics/ directory are a product of Picard CollectRnaSeqMetrics. For a complete definition of each metrics see the Picard Metric Definitions page.

Name	Type	Description
PicardRnaSeqChart.pdf	PDF File	Normalized RNA−Seq Coverage vs. Normalized Transcript Position for the top 1,000 expressed genes.
PicardRnaSeqMetrics.png	Image File	Three dimensional pie chart of the proportion of aligned data that belongs to coding, UTR, intronic, intergenic and mitochondrial/ribosomal annotation.
PicardRnaSeqMetrics.txt	Text File	A text file of RNA-Seq metrics including basic alignment summary, proportion of alignments to annotation, transcript bias and strand specificity.
Picard_mRNA.refFlat	Text File	refFlat annotation file used by the Picard tool to collect RNA-Seq metrics. Converted from GTF using gtfToGenePred and then an in-house tool to print the first 9 columns as refFlat.
Picard_ribo.intervals	Text File	The IntervalList annotation file of mitochondrial and ribosomal annotation used by the Picard tool to collect RNA-Seq metrics.

Junctions

Name	Type	Description
AlignmentResult_*_junctions.bed	Text File
*.ExonContent.bed	Text File
*.Exons.bed	Text File
*.Genes.bed	Text File
*.Junction.GeneExpression.top1percent.tsv	Text File
*.Junction.GeneExpression.tsv	Text File
*.Junction.TranscriptExpression.top1percent.tsv	Text File
*.Junction.TranscriptExpression.tsv	Text File
*.junctions.bed	Text File
*.junctions.tsv	Text File
SpliceJunctionSummary.R.stderr	Text File
SpliceJunctionSummary.R.stdout	Text File
junctions.bed	Text File
observed.junctions.anno.*.tsv	Text File
observed.junctions.bed	Text File
summary	Directory

summary

Name	Type	Description
ObservedJunctions_SpliceSiteUsage_Pie.pdf	PDF File
KnownJunctions_SpliceSiteUsage_Pie.pdf	PDF File
ObservedJunctions_SpliceSiteAnchorTypes_Pie.pdf	PDF File
ObservedJunctions_ReadCounts_Log2_Hist.pdf	PDF File
GeneJunctionReadCounts_Log2_Hist.pdf	PDF File
TranscriptJunctionReadCounts_Log2_Hist.pdf	PDF File
KnownJunctionCounts_Genes_Log2_Hist.pdf	PDF File
KnownJunctionCounts_Transcripts_Log2_Hist.pdf	PDF File
JPJMs_Genes_Log2_Hist.pdf	PDF File
JPJMs_Transcripts_Log2_Hist.pdf	PDF File
ExonSkippingEvents_KnownVsNovel_LinePlot.pdf	PDF File
PercentGeneJunctionsCovered_BreadthvsDepth_BoxPlot.pdf	PDF File
IntronSizes_ObservedVsKnown_Density.pdf	PDF File
Stats.tsv	Text File

BAM QC

Name	Type	Description
*.metrics	Symbolic Link
*-AlignmentLengthDistribution.tsv	Text File
*-ErrorRate.tsv	Text File
*-ErrorRate_rates_1.png	Image File
*-ErrorRate_rates_2.png	Image File
*-PicardGC_chart.pdf	PDF File
*-PicardGC_metrics.txt	Text File
*-PicardGC_summary.txt	Text File
*-PicardMetrics.alignment_summary_metrics	Text File
*-PicardMetrics.insert_size_histogram.pdf	PDF File
*-PicardMetrics.insert_size_metrics	Text File
*-PicardMetrics.quality_by_cycle.pdf	PDF File
*-PicardMetrics.quality_by_cycle_metrics	Text File
*-PicardMetrics.quality_distribution.pdf	PDF File
*-PicardMetrics.quality_distribution_metrics	Text File
*-ReadLengthDistribution.tsv	Text File
*-ReadLengthSummary.tsv	Text File
*.bam	Symbolic Link
*.bam.bai	Symbolic Link
*.bam.flagstat	Symbolic Link
*.bam.html	HTML File
*_fastqc	Directory
*_fastqc.zip	Compressed Archive

*_fastqc

Name	Type	Description
summary.txt	Text File
fastqc_report.html	HTML File
fastqc_data.txt	Text File
Icons	Directory
Images	Directory

icons

Name	Type	Description
error.png	Image File
fastqc_icon.png	Image File
tick.png	Image File
warning.png	Image File

images

Name	Type	Description
per_base_quality.png	Image File
per_sequence_quality.png	Image File
per_base_sequence_content.png	Image File
per_base_gc_content.png Image	File
per_sequence_gc_content.png	Image File
per_base_n_content.png	Image File
sequence_length_distribution.png	Image File
duplication_levels.png	Image File

Logs

The logs/ directory contains the standard error and standard out of each cluster job generated during the build process. These log files are useful for debugging failures, but their diversity and format are beyond the scope of this documentation.

De Novo (optional)

Cufflinks has three modes: de novo, reference guided and reference only. This symlinked directory contains the output of Cufflinks run in the 'de novo' mode. All Cufflinks output files are well documented in the Cufflinks user manual: http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

Reference Guided (optional)

Cufflinks has three modes: de novo, reference guided and reference only. This symlinked directory contains the output of Cufflinks run in the 'reference guided' mode. All Cufflinks output files are well documented in the Cufflinks user manual: http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

Reference Only

Cufflinks has three modes: de novo, reference guided and reference only. This symlinked directory contains the output of Cufflinks run in the 'reference only' mode. All Cufflinks output files are well documented in the Cufflinks user manual: http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

expression

The expression/ symlink points to the 'reference only' Cufflinks mode. This symlink exists simply to maintain backward compatibility with existing software.

Results

The results directory is a base-level directory to serve as a container for all build results. Newer pipelines use this convention and future results will be added to this directory

underlying_results

Name	Type	Description

| Symbolic Link | A symlink to each instrument data run of htseq-count.

per-instrument data result

Name	Type	Description
transcript-counts.tsv	Text File
transcript.err	Text File
gene-counts.tsv	Text File
gene.err	Text File

TopHat

All TopHat files are well documented in the TopHat user manual:

http://tophat.cbcb.umd.edu/manual.shtml#output

Cufflinks

All Cufflinks output files are well documented in the Cufflinks user manual:

http://cufflinks.cbcb.umd.edu/manual.html#cufflinks_output

Frequently Asked Questions

Why did the RNA-seq metrics report a large amount of intronic sequence?

http://www.biostars.org/p/42890/

How do I list the BAM location for a model?

Models do not have results of their own--they are merely a container for builds. Thus we need to select a build from the model. "last_succeeded_build" (and "last_complete_build", which is synonymous) give us the most recent build that completed successfully.

"whole_rmdup_bam_file" is something particular to reference alignment builds. Both reference alignment and rna-seq have a merged_alignment_result (which is itself a Genome::InstrumentData::AlignmentResult::Merged). Looking in that module we see that there is a bam_path accessor.

$ genome model list --filter id=abcd1234 --show id,name,last_complete_build.merged_alignment_result.bam_path

For strand specific library construction protocols, which settings do I use for each RNA-seq software tool? TopHat

http://tophat.cbcb.umd.edu/faq.shtml#library_type

Illumina specifically documents the library type in the Illumina RNA-seq Tophat Analysis (originally downloaded from http://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf): "For the TruSeq RNA Sample Prep Kit, the appropriate library type is 'fr-unstranded'. For TruSeq stranded sample prep kits, the library type is specified as 'fr-firststrand'."

This blog post is also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)?

Another suggestion is to load the aligned reads in IGV and look at the read orientation. You can do this by hovering over a read aligned to an exon.

F2 R1 means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5' to 3') this would denote a fr-firststrand setting in TopHat, ie. "the right-most end of the fragment (in transcript coordinates) is the first sequenced". For a negative DNA strand transcript (3' to 5') this would denote a fr-secondstrand setting in TopHat.

F1 R2 means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but it's simply the inverse for F1 R2 mapping.

Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically RF orientation is reserved for large-insert mate-pair libraries. Other orientations like FF and RR seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference.

Why did the RNA-seq metrics report a large amount of intronic sequence?

http://www.biostars.org/p/42890/

How do I list the BAM location for a model?

Models do not have results of their own--they are merely a container for builds. Thus we need to select a build from the model. "last_succeeded_build" (and "last_complete_build", which is synonymous) give us the most recent build that completed successfully.

"whole_rmdup_bam_file" is something particular to reference alignment builds. Both reference alignment and rna-seq have a merged_alignment_result (which is itself a Genome::InstrumentData::AlignmentResult::Merged). Looking in that module we see that there is a bam_path accessor.

$ genome model list --filter id=abcd1234 --show id,name,last_complete_build.merged_alignment_result.bam_path

For strand specific library construction protocols, which settings do I use for each RNA-seq software tool?

TopHat

http://tophat.cbcb.umd.edu/faq.shtml#library_type

Illumina specifically documents the library type in the Illumina RNA-seq Tophat Analysis (originally downloaded from http://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf): "For the TruSeq RNA Sample Prep Kit, the appropriate library type is 'fr-unstranded'. For TruSeq stranded sample prep kits, the library type is specified as 'fr-firststrand'."

This blog post is also very informative: How to tell which library type to use (fr-firststrand or fr-secondstrand)?

Another suggestion is to load the aligned reads in IGV and look at the read orientation. You can do this by hovering over a read aligned to an exon.

F2 R1 means the second read in the pair aligns to the forward strand and the first read in the pair aligns to the reverse strand. For a positive DNA strand transcript (5' to 3') this would denote a fr-firststrand setting in TopHat, ie. "the right-most end of the fragment (in transcript coordinates) is the first sequenced". For a negative DNA strand transcript (3' to 5') this would denote a fr-secondstrand setting in TopHat.

F1 R2 means the first read in the pair aligns to the forward strand and the second read in the pair aligns to the reverse strand. See above for the complete definitions, but it's simply the inverse for F1 R2 mapping.

Anything other than FR orientation is not covered here and discussion with the individual responsible for library creation would be required. Typically RF orientation is reserved for large-insert mate-pair libraries. Other orientations like FF and RR seem impossible with Illumina sequence technology and suggest structural variation between the sample and reference.

Picard

http://picard.sourceforge.net/command-line-overview.shtml#CollectRnaSeqMetrics

Summary2

In summary, the settings for commonly used RNA-seq library construction kits are:

Library Kit	5' to 3' IGV	Tophat	HTSeq	Picard
TruSeq	Strand Specific Total RNA	F2R1	fr-firststrand	reverse
NuGEN	Encore	F1R2	fr-secondstrand	yes
NuGEN	OvationV2	F2R1 or F1R2	fr-unstranded	no

Home	Install	Docs	Tutorials	FAQ

RnaSeq

in progress

Contents

Process Flow

Description

Alignment

Quality Assurance

Abundance Estimation

references

What is RNA-seq

Overview

Conceptual

Actual

Pipeline

TopHat2

How to Analyze RNA-seq data?

Start an RNA-seq Genome Model

Create an RNA-seq only annotation build

Output Files and Formats

GMS Build Data Directory

Alignments

Alignment Stats

alignment_stats.txt

Metrics

Junctions

summary

BAM QC

*_fastqc

icons

images

Logs

De Novo (optional)

Reference Guided (optional)

Reference Only

expression

Results

underlying_results

per-instrument data result

TopHat

Cufflinks

Frequently Asked Questions

Why did the RNA-seq metrics report a large amount of intronic sequence?

How do I list the BAM location for a model?

For strand specific library construction protocols, which settings do I use for each RNA-seq software tool?

TopHat

Picard

Summary2

Clone this wiki locally