CustomSelection - package for selection of reference genes from RNAseq data

Citation

dos Santos, K.C.G., Desgagné-Penix, I. & Germain, H. Custom selected reference genes outperform pre-defined reference genes in transcriptomic analysis. BMC Genomics 21, 35 (2020). https://doi.org/10.1186/s12864-019-6426-2

dos Santos, K.C.G., Desgagné-Penix, I. & Germain, H. Correction to: Custom selected reference genes outperform pre-defined reference genes in transcriptomic analysis. BMC Genomics 22, 607 (2021). https://doi.org/10.1186/s12864-021-07743-7

This package calculates the Transcripts per Million (TPM) data frame from the read count matrix, calculates the minimum expressin level for a gene to be considered as expressed in each sample and selects as reference genes those with lowest covariance.

It contains three main functions: Counts_to_tpm, DAFS and gene_selection. The function customReferences merges the three main functions and only returns the data frame with the selected genes (in the row names), their average TPM (column "Mean") and the covariance of their TPM (column "Covariance").

If the you want to keep the tpm data frame and the vector generated from DAFS, run the functions in this specific order:

Counts_to_tpm
DAFS
gene_selection

The package only needs the read count matrix and the length of the genes in the read count matrix. We offer for a test a data frame of read counts for three samples (4 replicates per sample) of Arabidopsis plants expressing GFP (control) or one of two candidate effectors from Melampsora larici-populina (Mlp37347-GFP and Mlp124499-GFP). We also offer the gene lengths from Arabidopsis thaliana TAIR10 obtained from Ensembl with biomaRt.

Count matrix example

Deposited in NCBI GEO under accession GSE136038

Counts_to_tpm

Modified from Slowkow gist

DAFS

Obtained from George and Chang (2014)

Installation

Dependencies

CustomSelection needs the R packages "earth" and "mclust".

R

library(devtools)
install_github("KarenGoncalves/CustomSelection")
library("CustomSelection")

Workflow

Starting with raw reads run:

Quality filtering Trimmomatic
Alignment (HISAT2, TopHat2) or do a De novo assembly (Trinity, Trans-abyss), then align the reads to the assembly
Sort the alignment (samtools sort)
Get the read count for each gene in each sample (proposed workflow in R below)
Run Counts_to_tpm, DAFS and gene_selection OR run customReferences

Getting read counts

Use the package "GenomicFeatures" to construct a transcript database from biomart. Example, for Arabidopsis thaliana:

  library("GenomicFeatures")

  ath <- makeTxDbFromBiomart(biomart = "plants_mart",
                             dataset = "athaliana_eg_gene",
                             transcript_ids = NULL,
                             circ_seqs = DEFAULT_CIRC_SEQS,
                             filter = NULL,
                             id_prefix = "ensembl_",
                             host = "plants.ensembl.org",
                             taxonomyId = 3702,
                             miRBaseBuild = NA)

  tx <- transcriptsBy(ath)

Create a table with the names of the samples as they are in the bam and bai files. Then use the package "Rsamtools" to create the bam file list. Example, for three bam files of three replicates of the control sample (CNT-1, CNT-2, CNT-3):

  samples <- c(paste0("CNT-", 1:3))

  sampleTable <- data.frame(sampleName = samples, 
                            fileNameBam = paste0(samples, ".bam"), 
                            fileNameBai = paste0(samples, ".bai"))

  library("Rsamtools")
  bfl <- BamFileList(file.path(sampleTable$fileNameBam), 
                     file.path(sampleTable$fileNameBai), 
                     yieldSize = 20000)

Use the package "GenomicAlignments" to count the number of reads aligned to each gene in the transcript database. Then, create a data frame of read counts per sample. Example for paired reads:

  library("GenomicAlignments")
  txMat=list()

  for (i in names(bfl)){
     txMat[[i]] <- assays(summarizeOverlaps(
                     features = tx, 
                     reads = bfl[[i]], 
                     mode = Union, 
                     singleEnd = F, 
                     ignore.strand = F, 
                     fragments = T))$counts
                       
     write.csv(x = as.data.frame(txMat[[i]]), 
               file = paste(gsub(".bam", "" , i), ".csv", sep = ""))
            }

  counts <- do.call(cbind, txMat)

  colnames(counts) <- samples

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
R		R
data		data
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitignore		.gitignore
CustomSelection.Rproj		CustomSelection.Rproj
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md

KarenGoncalves/CustomSelection

Folders and files

Latest commit

History

Repository files navigation

CustomSelection - package for selection of reference genes from RNAseq data

Citation

Count matrix example

Counts_to_tpm

DAFS

Installation

Dependencies

R

Workflow

Getting read counts

About

Topics

Resources

Stars

Watchers

Forks

Languages