ClustREval

Description

ClustREval is a package for evaluating the performance of different clustering pipelines on scRNA-seq data using unsupervised metrics and gene set enrichment analysis. Clustering is a powerful tool for helping researchers to detect cellular heterogeniety. However, clustering performance is highly dependent on parameters used in the clustering pipeline, for which there are no systematic recommendations. This package allows users to compute clustering results from various clustering pipelines defined by user-specified parameters. Clustering results can then be evaluated using comparing unsupervised clustering metrics and differential gene expression between results.

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Installation

To install the latest version of the package:

require("devtools")
devtools::install_github("cindyfang70/clustREval", build_vignettes = TRUE)
library("clustREval")

Overview

ls("package:clustREval")
data(package = "clustREval")

To run the shiny app, run: runClustREval()

browseVignettes("clustREval")

clustREval contains four functions that aid in the evaluation of clustering performance.

The runPipelineCombs function runs all combinations of user-specified clustering pipelines. The user simply has to define the various parameters to use at each step in the pipeline and provide the data to perform clustering on.

The computeUnsupervisedMetrics function computes the Dunn index and mean silhouette width of a clustering output.

The geneSetEval function performs Gene Set Enrichment Analysis (GSEA) on each of the clusters from a clustering output and returns enrichment scores based on the Hallmark Pathways from MSigDB.

The plotGeneSetEval function plots the enrichment scores from GSEA.

An overview of the package is illustrated below:

Contributions

The author for this package is Xin Zhi Fang.

The runPipelineCombs uses the pipeComp library to run all pipeline combinations, but does not use the same end results. The evaluation step from the pipeComp library is bypassed as it depends on cell type labels (which are not always available).

The fgsea package is used to perform GSEA on the clusters. As well, AnnotationDbi and org.Hs.eg.db are used to help map the gene symbols to names.

scran,Seurat, and scuttle are used in various functions for preprocessing of the scRNA-seq data.

SingleCellExperiment is used in almost all functions to store scRNA-seq data.

cluster and clValid were used to calculate the silhouette score and Dunn index, respectively.

ggplot2 and gridExtra were used for plotting functionality.

labelled, tibble, tidyverse, dplyr, and magrittr were used for data manipulation.

References

Germain, P. L., Sonrel, A., & Robinson, M. D. (2020). pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biology, 21(1). https://doi.org/10.1186/s13059-020-02136-7
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). https://doi.org/10.1186/s13059-014-0550-8
Amezquita, R. A., Lun, A. T. L., Becht, E., Carey, V. J., Carpp, L. N., Geistlinger, L., Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., Waldron, L., Pagès, H., Smith, M. L., Huber, W., Morgan, M., Gottardo, R., & Hicks, S. C. (2019). Orchestrating single-cell analysis with Bioconductor. Nature Methods, 17(2), 137–145. https://doi.org/10.1038/s41592-019-0654-x
Korotkevich, G., Sukhov, V., Budin, N., Shpak, B., Artyomov, M. N., & Sergushichev, A. (2016). Fast gene set enrichment analysis. BioRxiv. Published. https://doi.org/10.1101/060012
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi: 10.1093/bioinformatics/btw777 (URL: https://doi.org/10.1093/bioinformatics/btw777).
Joseph Larmarange (2021). labelled: Manipulating Labelled Data. R package version 2.9.0. https://CRAN.R-project.org/package=labelled
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi: 10.12688/f1000research.9501.2 (URL: https://doi.org/10.12688/f1000research.9501.2).
Hervé Pagès, Marc Carlson, Seth Falcon and Nianhua Li (2020). AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. R package version 1.52.0. https://bioconductor.org/packages/AnnotationDbi
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1-22. URL https://www.jstatsoft.org/v25/i04/
Hao and Hao et al. Integrated analysis of multimodal single-cell data. Cell (2021)
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2021). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.2.
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7. https://CRAN.R-project.org/package=dplyr
Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.6. https://CRAN.R-project.org/package=tibble
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Stefan Milton Bache and Hadley Wickham (2020). magrittr: A Forward-Pipe Operator for R. R package version 2.0.1. https://CRAN.R-project.org/package=magrittr
Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra
Carlson M (2019). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.8.2.
Yan L, Yang M, Guo H, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nature Structural & Molecular Biology. 2013 Sep;20(9):1131-1139. DOI: 10.1038/nsmb.2660. PMID: 23934149.
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell systems, 1(6), 417–425. https://doi.org/10.1016/j.cels.2015.12.004

Acknowledgements

This package was developed for BCB410H: Applied Bioinformatics, University of Toronto, Toronto, CANADA, 2019-2021.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
R		R
data		data
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
clustREval.Rproj		clustREval.Rproj

License

cindyfang70/clustREval

Folders and files

Latest commit

History

Repository files navigation

ClustREval

Description

Installation

Overview

Contributions

References

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages