Skip to content

decodebiology/rpkm_rnaseq_count

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 

Repository files navigation

Update 2020/10/24: You can also use simplified and faster version of normalization script from here.

RPKM_normalization

RPKM for RNAseq V1.3

Usage for sample input provided:

perl rpkm_script_beta.pl sample_count_test.count 2:9 28 > sample_count_test.rpkm

Description

In above example 'sample_count_test.count' file has count data from 2 to 9th column;
28th column has length of each genes calculated from Gencode GTF (Note below).

General usage:

perl rpkm_script_beta.pl input_count_file.txt ActualColumnStart:ActualColumnEnd ColumnGeneLength > OUTPUT_RPKM_FILE 

ActualColumnStart = For example you have GeneID in first column and counts starts from second column. This should be '2'

ActualColumnEnd = Upto which column you need RPKM

ColumnGeneLength = Length of each gene (**NOTE below)

**NOTE: Steps to prepare your input

  1. Length of the gene can be obtained from Gencode GTF by following command (Successfully tested upto Gencode V19)
  2. cat gencode.vXX.annotation.gtf | awk -F'\t' '{if($3=="gene") {split($9,a,";"); print a[1]"\t"$5-$4};}' | sed 's/[gene_id |"|]//g' > YOUR_GENE_LENGTH_FILE
  3. Combine input_count_file.txt and YOUR_GENE_LENGTH_FILE by GeneID or First column
  4. join -j1  <(sort input_count_file.txt) <(sort YOUR_GENE_LENGTH_FILE) > OUTPUT_ANNOTATED_COUNT_FILE
  5. Run the script over OUTPUT_ANNOTATED_COUNT_FILE
  6. perl rpkm_script_beta.pl OUTPUT_ANNOTATED_COUNT_FILE ActualColumnStart:ActualColumnEnd ColumnGeneLength > OUTPUT_ANNOTATED_RPKM_FILE

Description

ActualColumnStart = For example you have GeneID in first column and counts starts from second column. This should be '2'

ActualColumnEnd = Upto which column you need RPKM

ColumnGeneLength = Length of each gene

RPKM calculation

RPKM = (10^6 * C)/(N * L), where

C = Number of reads mapped to a gene

N = Total mapped reads in the experiment

L = gene length in base-pairs for a gene

Author: Santhilal Subhash

Releases

No releases published

Packages

No packages published

Languages