pySingleCellNet

Introduction

SingleCellNet (SCN) is a tool to perform 'cell typing', or classification of single cell RNA-Seq data. Two nice features of SCN are that it works (1) across species and (2) across platforms. See the original paper for more details. This repository contains the Python version of SCN. The original code was written in R.

Prerequisites

pip install pandas numpy sklearn scanpy sklearn statsmodels scipy matplotlib seaborn umap-learn

Installation

!pip install git+https://github.com/pcahan1/PySingleCellNet/

Summary

Below is a brief tutorial that shows you how to use SCN. In this example, we train a classifier based on mouse lung cells, we assess the performance of the classifier on held out data, then we apply the classifier to analyze indepdendent mouse lung data.

Training data

SCN has to be trainined on well-annotated reference data. In this example, we use data generated as part of the Tabula Muris (Senis) project. Specifically, we use the droplet lung data. We have compiled several other training data sets as listed below.

Lung training data

Query data

To illustrate how you might use SCN to perform cell tying, we apply it to another dataset from mouse lung:

Angelidis I, Simon LM, Fernandez IE, Strunz M et al. An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics. Nat Commun 2019 Feb 27;10(1):963. PMID: 30814501

Query expression data <- You will need to decompress this file prior to loading it.

Query meta-data

Query gene list

Initialize session

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns 
import scanpy as sc
import scipy as sp
import numpy as np
# Loompy is only needed if using loom files
# import loompy 
import anndata

sc.settings.verbosity = 3 
sc.logging.print_header()

import pySingleCellNet as pySCN

Load the training data.

You should always start with the raw counts.

adTrain = sc.read("adLung_TabSen_100920.h5ad")
adTrain
# AnnData object with n_obs × n_vars = 14813 × 21969 ...

Load the query data.

You should always start with the raw counts. If your expression is stored as a numpy array, you can convert it with the check_adX(adata) function.

qDatT = sc.read_mtx("GSE124872_raw_counts_single_cell.mtx")
qDat = qDatT.T

genes = pd.read_csv("genes.csv")
qDat.var_names = genes.x

qMeta = pd.read_csv("GSE124872_Angelidis_2018_metadata.csv")
qMeta.columns.values[0] = "cellid"

qMeta.index = qMeta["cellid"]
qDat.obs = qMeta.copy()

# If your expression data is stored as a numpy array, convert it 
# type(qDat.X)
# <class 'numpy.ndarray'>
# pySCN.check_adX(qDat)

Find common genes

When you train the classifier, you should ensure that the query data and the reference data are limited to a common set of genes. In this case, we also limit the query data to those cells with at least 500 genes.

genesTrain = adTrain.var_names
genesQuery = qDat.var_names

cgenes = genesTrain.intersection(genesQuery)
len(cgenes)
# 16543

adTrain1 = adTrain[:,cgenes]
adQuery = qDat[:,cgenes].copy()
adQuery = adQuery[adQuery.obs["nGene"]>=500,:].copy()
adQuery
# AnnData object with n_obs × n_vars = 4240 × 16543

Split the reference data into training data and data held out for later assessment.

Ideally, we would assess performance on an indepdendent data. The dLevel parameter indicates the label used to group cells into categories or classes. Set this argument as appropriate for your training data.

expTrain, expVal = pySCN.splitCommonAnnData(adTrain1, ncells=200,dLevel="cell_ontology_class")

Train the classifier.

This can take several minutes.

[cgenesA, xpairs, tspRF] = pySCN.scn_train(expTrain, nTopGenes = 100, nRand = 100, nTrees = 1000 ,nTopGenePairs = 100, dLevel = "cell_ontology_class", stratify=True, limitToHVG=True)

Classify the held-out data and visualize.

Rows indicate class labels as defined in the dLevel argument. Columns represent cells, which are grouped by the class with the maximum score.

adVal = pySCN.scn_classify(expVal, cgenesA, xpairs, tspRF, nrand = 0)

ax = sc.pl.heatmap(adVal, adVal.var_names.values, groupby='SCN_class', cmap='viridis', dendrogram=False, swap_axes=True)

Determine how well the classifier predicts cell type of held out data and plot

The assessment object holds other evaluation metrics including multiLogLoss, Kappa, and accuracy.

assessment =  pySCN.assess_comm(expTrain, adVal, resolution = 0.005, nRand = 0, dLevelSID = "cell", classTrain = "cell_ontology_class", classQuery = "cell_ontology_class")

pySCN.plot_PRs(assessment)
plt.show()

Classify the independent query data and visualize the results.

The heatmap groups the cells according to the cell type with the maximum SCN classification score. Cells in the 'rand' SCN_class or category have a higher SCN score in the 'random' SCN_class than any cell type from the training data.

adQlung = pySCN.scn_classify(adQuery, cgenesA, xpairs, tspRF, nrand = 0)

ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='SCN_class', cmap='viridis', dendrogram=False, swap_axes=True)

Visualize again.

Now group cells according to the annotation provided by the associated study.

ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='celltype', cmap='viridis', dendrogram=False, swap_axes=True)

Add the classification result to the query annData object

We add the SCN scores as well as the softmax classification (i.e. a label corresponding to the cell type with the maximum SCN score -- this goes in adata.obs["SCN_class"]).

pySCN.add_classRes(adQuery, adQlung)

Now, you can run your typical Scanpy pipeline to find cell clusters.

Normalization and PCA

adM1Norm = adQuery.copy()
sc.pp.filter_genes(adM1Norm, min_cells=5)
sc.pp.normalize_per_cell(adM1Norm, counts_per_cell_after=1e4)
sc.pp.log1p(adM1Norm)

sc.pp.highly_variable_genes(adM1Norm, min_mean=0.0125, max_mean=4, min_disp=0.5)

adM1Norm.raw = adM1Norm
sc.pp.scale(adM1Norm, max_value=10)
sc.tl.pca(adM1Norm, n_comps=100)

sc.pl.pca_variance_ratio(adM1Norm, 100)

KNN, Leiden, UMAP

npcs = 20
sc.pp.neighbors(adM1Norm, n_neighbors=10, n_pcs=npcs)
sc.tl.leiden(adM1Norm,.1)
sc.tl.umap(adM1Norm, .5)
sc.pl.umap(adM1Norm, color=["leiden", "SCN_class"], alpha=.9, s=15, legend_loc='on data')

To plot a heatmap with the clustering information, you need to add this annotation to the annData object that is returned from scn_classify()

adQlung.obs['leiden'] = adM1Norm.obs['leiden'].copy()
adQlung.uns['leiden_colors'] = adM1Norm.uns['leiden_colors']
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='leiden', cmap='viridis', dendrogram=False, swap_axes=True)

You can also overlay SCN score on UMAP embedding.

sc.pl.umap(adM1Norm, color=["epithelial cell", "stromal cell", "B cell"], alpha=.9, s=15, legend_loc='on data', wspace=.3)

Cross-species classification

Cross-species classification depends on an ortholog table. Mouse-to-human ortholog table

To use this, you need to convert query gene symbols to the ortholog names of the species of the training data

oTab = pd.read_csv("oTab.csv")
[adQuery,adTrain] = pySCN.csRenameOrth(adQuery, adTrain, oTab)

Then you can proceed with the same training and analysis steps as above, starting with the call to splitCommonAnnData.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
inProgressFunctions		inProgressFunctions
md_img		md_img
pySingleCellNet		pySingleCellNet
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

License

RainHan2001/PySingleCellNet

Folders and files

Latest commit

History

Repository files navigation