This repository provides code for the paper Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data by Lun et al. (2017).
Run download.sh
to download the count matrices from ArrayExpress (E-MTAB-5522), unpack them and move them to the relevant directories.
This will also download the SDRF file describing the annotation for all samples.
Install package/
using R CMD INSTALL --preclean package
(or an equivalent command for your installation).
This was last tested with R version 3.4.* and Bioconductor version 3.6.
Enter real
and follow these instructions:
- Enter
Calero/trial_20160113/
and runrun_me.sh
. This will produce a Markdown document containing the variance estimates, along with some serialized R objects for further inspection if necessary. - Repeat for
Calero/trial_20160325
,Liora/test_20160906
andLiora/test_20170201
. - Run
make_pics.R
to reproduce the figures in the paper. - Enter
depth/
and runrunner.R
to perform the simulations for sampling noise. - Enter
index_swapping/
and runcheck_swap.R
to generate the figures checking for index swapping.
To run the simulations, enter the simulations
directory:
- Enter
datasets/
and rundownload.sh
to download the various data files. Also runprerunner.R
to pre-process them into serialized R objects. - Enter
sampling/
and runresampler.R
to simulate sampling noise. You can also runsizevar.R
to quantify the variation in size factors across cells. - Enter
variance/
and runvarsim.R
to perform the simulations for detecting HVGs. - Enter
diffexp/
and rundesim.R
to perform the simulations for detecting DEGs. - Enter
clustering/
and runclustsim.R
to perform the simulations for clustering and PCA. This requires you to obtainpancreas_refseq_rpkms_counts_3514sc.txt.gz
from the processed files in E-MAT-5061, along with the corresponding SDRF file. - Enter
pics/
and runpicmaker.R
andget_properties.R
to reproduce the figures in the paper.
The sequences/biophysical/
directory contains a script to examine the differences in biophysical properties between the spike-in and endogenous mouse genes.
Readers interested in regenerating the count matrices from the FASTQ files are advised to:
- Follow the instructions in
sequences/genomes/README.md
to build the genome indices. Similarly, follow the instructions insequences/annotation/README.md
to obtain the annotation. - Download the FASTQ files from ArrayExpress.
Files corresponding to each batch of data should be placed in
Calero/trial_20160113/fastq
, etc. - Run the various
mapme.sh
scripts to execute the master scripts for alignment, andcount_me.sh
for read counting. Paths should refer to the top-leveltools/
directory obtained usingdownload.sh
.
Note that the mapme.sh
scripts in Liora
need to be modified to retrieve file names from the fastq
subdirectory.
The manuscript
directory contains all LaTeX code used to generate the manuscript.
This can be compiled with make
.