singleCellNet

Table of content

Introduction
Data
Train SCN claissfier
Assess SCN claissfier with heldout data
Query
Visualization
Train cross-species SCN classifier
Query for cross-species data
Assess SCN claissfier with external dataset
More detailed visualization examples
Explore important celltype-specific top-pairs
SCN score calibration
Loom integration
Seurat integration
SCE integration
Available training datasets

Introduction

SingleCellNet enables the classifcation of single cell RNA-Seq data across species and platforms. See our recent publication for more details. Additionally, we have a vignette to guide you through the steps as well.

Here, we illustrate ...

how to build and assess single cell classifiers
how to build and assess cross-species single cell classifiers
how to use these classifiers to quantify 'cell identity' from query scRNA-Seq data

If you want to use the bulk RNA-Seq version of CellNet, go to bulk CellNet.

Our singleCellNet is available on Python pySCN which is Scanpy and AnnData compatible.

Data

In this example, we use a subset of the Tabula Muris data to train singleCellNet. To learn more about the Tabula Muris project, see the manuscript. As query data, we use scRNA-Seq of kidney cells as reported in Park et al 2018. We also provide an example of classifying human, bead enriched PBMCs (from https://www.ncbi.nlm.nih.gov/pubmed/28091601). You can download this data here:

APPLICATION	METADATA	EXPRESSION
Query	metadata	expression data
Training	metadata	expression data
cross-species	human-mouse orthologs
cross-species	metadata	expression data

*more training datasets (metadata and expression data) are provided at the bottom of the page.

Training

Setup

install.packages("devtools")
devtools::install_github("pcahan1/singleCellNet")
library(singleCellNet)

Optional set up if you are working with loom files

devtools::install_github(repo = "hhoeflin/hdf5r")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
library(loomR)

Fetch the data if you have not already done so

download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/sampTab_Park_MouseKidney_062118.rda", "sampTab_Park_MouseKidney_062118.rda")

download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/expMatrix_Park_MouseKidney_Oct_12_2018.rda", "expMatrix_Park_MouseKidney_Oct_12_2018.rda")

download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/expMatrix_TM_Raw_Oct_12_2018.rda", "expMatrix_TM_Raw_Oct_12_2018.rda")

download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/sampTab_TM_053018.rda", "sampTab_TM_053018.rda")

## For cross-species analyis:
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/human_mouse_genes_Jul_24_2018.rda", "human_mouse_genes_Jul_24_2018.rda")

download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/6k_beadpurfied_raw.rda", "6k_beadpurfied_raw.rda")

download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/stDat_beads_mar22.rda", "stDat_beads_mar22.rda")

## To demonstrate how to integrate loom files to SCN
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/pbmc_6k.loom", "pbmc_6k.loom")

Load query data

stPark = utils_loadObject("sampTab_Park_MouseKidney_062118.rda")
expPark = utils_loadObject("expMatrix_Park_MouseKidney_Oct_12_2018.rda")
dim(expPark)
[1] 16272 43745

genesPark = rownames(expPark)

rm(expPark)
gc()

Load the training data

expTMraw = utils_loadObject("expMatrix_TM_Raw_Oct_12_2018.rda")
dim(expTMraw)
[1] 23433 24936

stTM = utils_loadObject("sampTab_TM_053018.rda")
dim(stTM)
[1] 24936    17

stTM<-droplevels(stTM)

Find genes in common to the data sets and limit analysis to these

commonGenes = intersect(rownames(expTMraw), genesPark)
length(commonGenes)
[1] 13831

expTMraw = expTMraw[commonGenes,]

Split for training and assessment, and transform training data

set.seed(100) #can be any random seed number
stList = splitCommon(sampTab=stTM, ncells=100, dLevel="newAnn")
stTrain = stList[[1]]
expTrain = expTMraw[,rownames(stTrain)]

Train the classifier

- If you increase nTopGenes and nTopGenePairs, you may get a even better classifier performance on query data!

system.time(class_info<-scn_train(stTrain = stTrain, expTrain = expTrain, nTopGenes = 10, nRand = 70, nTrees = 1000, nTopGenePairs = 25, dLevel = "newAnn", colName_samp = "cell"))
   user  system elapsed 
 476.839  25.809 503.351

Assessing the classifier with heldout data

Apply to held out data

#validate data
stTestList = splitCommon(sampTab=stList[[2]], ncells=100, dLevel="newAnn") #normalize validation data so that the assessment is as fair as possible
stTest = stTestList[[1]]
expTest = expTMraw[commonGenes,rownames(stTest)]

#predict
classRes_val_all = scn_predict(cnProc=class_info[['cnProc']], expDat=expTest, nrand = 50)

Assess classifier

tm_heldoutassessment = assess_comm(ct_scores = classRes_val_all, stTrain = stTrain, stQuery = stTest, dLevelSID = "cell", classTrain = "newAnn", classQuery = "newAnn", nRand = 50)

plot_PRs(tm_heldoutassessment)

plot_metrics(tm_heldoutassessment)

Classification result heatmap

#Create a name vector label used later in classification heatmap where the values are cell types/ clusters and names are the sample names
 
nrand = 50
sla = as.vector(stTest$newAnn)
names(sla) = as.vector(stTest$cell)
slaRand = rep("rand", nrand) 
names(slaRand) = paste("rand_", 1:nrand, sep='')
sla = append(sla, slaRand) #include in the random cells profile created

sc_hmClass(classMat = classRes_val_all,grps = sla, max=300, isBig=TRUE)

Attribution plot

plot_attr(classRes=classRes_val_all, sampTab=stTest, nrand=nrand, dLevel="newAnn", sid="cell")

Viusalize average top pairs genes expression for training data

gpTab = compareGenePairs(query_exp = expTest, training_exp = expTrain, training_st = stTrain, classCol = "newAnn", sampleCol = "cell", RF_classifier = class_info$cnProc$classifier, numPairs = 20, trainingOnly= TRUE)

train = findAvgLabel(gpTab = gpTab, stTrain = stTrain, dLevel = "newAnn")

hm_gpa_sel(gpTab, genes = class_info$cnProc$xpairs, grps = train, maxPerGrp = 50)

Query

Apply to Park et al query data

expPark = utils_loadObject("expMatrix_Park_MouseKidney_Oct_12_2018.rda") 
  
nqRand = 50
system.time(crParkall<-scn_predict(class_info[['cnProc']], expPark, nrand=nqRand))
   user  system elapsed 
 89.633   5.010  95.041

Visualization

sgrp = as.vector(stPark$description1)
names(sgrp) = as.vector(stPark$sample_name)
grpRand =rep("rand", nqRand)
names(grpRand) = paste("rand_", 1:nqRand, sep='')
sgrp = append(sgrp, grpRand)

# heatmap classification result
sc_hmClass(crParkall, sgrp, max=5000, isBig=TRUE, cCol=F, font=8)

Classification annotation assignment

# This classifies a cell with  the catgory with the highest classification score or higher than a classification score threshold of your choosing.
# The annotation result can be found in a column named category in the query sample table.

stPark <- get_cate(classRes = crParkall, sampTab = stPark, dLevel = "description1", sid = "sample_name", nrand = nqRand)

Classification result violin plot

sc_violinClass(sampTab = stPark, classRes = crParkall, sid = "sample_name", dLevel = "description1", addRand = nqRand)

Skyline plot of classification results

library(viridis)
stKid2 = addRandToSampTab(crParkall, stPark, "description1", "sample_name")
skylineClass(crParkall, "T cell", stKid2, "description1",.25, "sample_name")

Cross-species classification

Load the mouse training and human query data

stQuery = utils_loadObject("stDat_beads_mar22.rda")
expQuery = utils_loadObject("6k_beadpurfied_raw.rda") # use Matrix if RAM low
dim(expQuery)
[1] 32643  6000

stTM = utils_loadObject("sampTab_TM_053018.rda")
expTMraw = utils_loadObject("expMatrix_TM_Raw_Oct_12_2018.rda") # reload training

Load the ortholog table and convert human gene names to mouse ortholog names, and limit analysis to genes in common between the training and query data.

oTab = utils_loadObject("human_mouse_genes_Jul_24_2018.rda")
dim(oTab)
[1] 16688     3

aa = csRenameOrth(expQuery, expTMraw, oTab)
expQueryOrth = aa[['expQuery']]
expTrainOrth = aa[['expTrain']]

Limit anlaysis to a subset of the TM cell types

cts = c("B cell",  "cardiac muscle cell", "endothelial cell", "erythroblast", "granulocyte", "hematopoietic precursor cell", "late pro-B cell", "limb_mesenchymal", "macrophage", "mammary_basal_cell", "monocyte", "natural killer cell", "T cell", "trachea_epithelial", "trachea_mesenchymal")

stTM2 = filter(stTM, newAnn %in% cts)
stTM2 = droplevels(stTM2)
rownames(stTM2) = as.vector(stTM2$cell) # filter strips rownames

expTMraw2 = expTrainOrth[,rownames(stTM2)]
dim(expTMraw2)
[1] 14550 15161

Train Classifier

stList = splitCommon(stTM2, ncells=100, dLevel="newAnn")
stTrain = stList[[1]]
expTrain = expTMraw2[,rownames(stTrain)]

system.time(class_info2<-scn_train(stTrain = stTrain, expTrain = expTrain, nTopGenes = 10, nRand = 70, nTrees = 1000, nTopGenePairs = 25, dLevel = "newAnn", colName_samp = "cell"))
   user  system elapsed 
 41.029   6.747  47.963

Apply to held out data

#validate data
stTestList = splitCommon(stList[[2]], ncells=100, dLevel="newAnn") 
stTest = stTestList[[1]]
expTest = expTMraw2[,rownames(stTest)]

#predict
system.time(classRes_val_all2 <- scn_predict(class_info2[['cnProc']], expTest, nrand = 50))
   user  system elapsed 
  0.691   0.032   0.724

Assess classifier

tm_heldoutassessment = assess_comm(ct_scores = classRes_val_all2, stTrain = stTrain, stQuery = stTest, dLevelSID = "cell", classTrain = "newAnn", classQuery = "newAnn", nRand = 50)

plot_PRs(tm_heldoutassessment)

plot_metrics(tm_heldoutassessment)

Classification result heatmap

nrand=50
sla = as.vector(stTest$newAnn)
names(sla) = as.vector(stTest$cell)
slaRand = rep("rand", nrand)
names(slaRand) = paste("rand_", 1:nrand, sep='')
sla = append(sla, slaRand)

# heatmap classification result
sc_hmClass(classRes_val_all2, sla, max=300, font=7, isBig=TRUE)

Attribute plot

plot_attr(classRes_val_all2, stTest, nrand=nrand, dLevel="newAnn", sid="cell")

Apply to human query data

stQuery$description = as.character(stQuery$description)
stQuery[which(stQuery$description == "NK cell"), "description"] = "natural killer cell"

nqRand = 50
system.time(crHS <- scn_predict(class_info2[['cnProc']], expQueryOrth, nrand=nqRand))
   user  system elapsed 
  3.566   0.548   4.166

Assess classifier with external dataset

tm_pbmc_assessment = assess_comm(ct_scores = crHS, stTrain = stTrain, stQuery = stQuery, classTrain = "newAnn",classQuery="description",dLevelSID="sample_name")
plot_PRs(tm_pbmc_assessment)

plot_metrics(tm_pbmc_assessment)

More visualization

Classification result heatmap

sgrp = as.vector(stQuery$prefix)
names(sgrp) = as.vector(stQuery$sample_name)
grpRand = rep("rand", nqRand)
names(grpRand) = paste("rand_", 1:nqRand, sep='')
sgrp = append(sgrp, grpRand)

sc_hmClass(crHS, sgrp, max=5000, isBig=TRUE, cCol=F, font=8)

Note that the macrophage category seems to be promiscuous in the mouse held out data, too.

Classification violin plot

sc_violinClass(sampTab = stQuery, classRes = crHS, sid = "sample_name", dLevel = "description")

Classification violin plot with adjusted width

sc_violinClass(sampTab = stQuery,classRes = crHS, sid = "sample_name", dLevel = "description", ncol = 12)

Classification violin plot with selected cluster

sc_violinClass(stQuery, crHS, sid = "sample_name", dLevel = "description", ncol = 12, sub_cluster = "B cell")

Attribution plot

plot_attr(crHS, stQuery, nrand=nqRand, sid="sample_name", dLevel="description")

Attribution plot with subcluster focus

plot_attr(sampTab = stQuery, classRes = crHS, sid = "sample_name", dLevel = "description", nrand = 50, sub_cluster = c("B cell", "T cell"))

UMAP by category

system.time(umPrep_HS<-prep_umap_class(crHS, stQuery, nrand=nqRand, dLevel="description", sid="sample_name", topPC=5))
  user  system elapsed 
 25.703   0.740  26.450 
plot_umap(umPrep_HS)

Heatmap top pairs genes for training sample average

system.time(gpTab2 <- compareGenePairs(query_exp = expQueryOrth, training_exp = expTrainOrth, training_st = stTrain, classCol = "newAnn", sampleCol = "cell", RF_classifier = class_info2$cnProc$classifier, numPairs = 20, trainingOnly = FALSE))
   user  system elapsed 
 84.130   0.677  84.826

sgrp = as.vector(stQuery$prefix)
names(sgrp) = rownames(stQuery)
train2 = findAvgLabel(gpTab2, stTrain = stTrain, dLevel = "newAnn")
sgrp = append(sgrp, train2)

hm_gpa_sel(gpTab2, genes = class_info2$cnProc$xpairs, grps = sgrp, maxPerGrp = 5)

How to calibrate/make sense of a given SCN score

#this function aims to give you a sense of how precise/sensitive SCN is with the assigned score of a given cell type for a cell

#tm_assess_matrix = tm_heldoutassessment$nonNA_PR

#tm_assess_matrix is a held_out assessment metric extracted from tm_heldoutassessment, which is already stored in SCN.
#e_assess_matrix is also provided for a gastrulation SCN classifier 

score = 0.6
celltype = "B cell"

calibration = scn_calibration(score = score, celltype = celltype, matrix=tm_assess_matrix)
#[1] "SCN score of 0.6 for cell type B cell has precision of 0.979 ~ 0.979 and sensitivity of 0.93 ~ 0.93"

calibration

#$score
#[1] 0.6

#$celltype
#[1] "B cell"

#$precision
#[1] 0.979 0.979

#$recall
#[1] 0.93 0.93

How to integrate loom file to SCN

lfile = loadLoomExpCluster("pbmc_6k.loom", cellNameCol = "obs_names", xname = "description")
stQuery = lfile$sampTab
dim(stQuery)
[1] 6000    2

expQuery = lfile$expDat
dim(expQuery)
[1] 32643  6000

#With this you can rerun the cross-species analysis and follow the exact same steps

Integrate Seurat object to SCN analysis

#exp_type options can be: counts, normcounts, and logcounts, if they are available in your sce object
seuratfile = extractSeurat(seurat_object, exp_slot_name = "counts")
sampTab = seuratfile$sampTab
expDat = seuratfile$expDat

Integrate SCE object to SCN analysis

#exp_type options can be: counts, data, and scale.data if they are available in your sce object
scefile = extractSCE(sce_object, exp_slot_name = "counts") 
sampTab = scefile$sampTab
expDat = scefile$expDat

More training data for your own analysis

study	species	organ/tissue	seq method	data
Baron	mouse	pancreas	inDrop	data
Baron	human	pancreas	inDrop	data
Murano*	human	pancreas	Cel-Seq2	data
Segerstolp	human	pancreas	Smart-Seq	data
Park	human	kidney	10x	data
Haber	mouse	intestine	Smart-Seq2	data
TM10x	mouse	atlas subset	10x	data
TM10x	mouse	atlas	10x	data
TMfacs	mouse	atlas subset	Smart-Seq	data
TMfacs	mouse	atlas	Smart-Seq	data
MWS	mouse	atlas	microwell-seq	data
Zeisel	mouse	barin altas	10x	data
Loo	mouse	cortex(e14.5)	Dropseq	data
Darmanis	human	cortex	C1	data
Gokce*	human	striatum	C1 and Smart-Seq2	data

*the expresion data is log-transformed.

Name		Name	Last commit message	Last commit date
Latest commit History 343 Commits
R		R
data		data
img		img
man		man
md_img		md_img
.DS_Store		.DS_Store
.Rapp.history		.Rapp.history
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
singleCellNet.Rproj		singleCellNet.Rproj

License

CahanLab/singleCellNet

Folders and files

Latest commit

History

Repository files navigation

singleCellNet

Table of content

Introduction

Data

Training

Setup

Optional set up if you are working with loom files

Fetch the data if you have not already done so

Load query data

Load the training data

Find genes in common to the data sets and limit analysis to these

Split for training and assessment, and transform training data

Train the classifier

Assessing the classifier with heldout data

Apply to held out data

Assess classifier

Classification result heatmap

Attribution plot

Viusalize average top pairs genes expression for training data

Query

Apply to Park et al query data

Visualization

Classification annotation assignment

Classification result violin plot

Skyline plot of classification results

Cross-species classification

Load the mouse training and human query data

Load the ortholog table and convert human gene names to mouse ortholog names, and limit analysis to genes in common between the training and query data.

Limit anlaysis to a subset of the TM cell types

Train Classifier

Apply to held out data

Assess classifier

Classification result heatmap

Attribute plot

Apply to human query data

Assess classifier with external dataset

More visualization

Classification result heatmap

Classification violin plot

Classification violin plot with adjusted width

Classification violin plot with selected cluster

Attribution plot

Attribution plot with subcluster focus

UMAP by category

Heatmap top pairs genes for training sample average

How to calibrate/make sense of a given SCN score

How to integrate loom file to SCN

Integrate Seurat object to SCN analysis

Integrate SCE object to SCN analysis

More training data for your own analysis

About

Resources

License

Stars

Watchers

Forks

Languages