Code to reproduce the analysis seen in McDonald, Wu et al. Cell, 2020 (https://doi.org/10.1016/j.cell.2020.10.018)
The processed Seurat object for the H1 teratomas and the corresponding cell type annotations and metadata can be found on our FTP server
- Seurat: https://satijalab.org/seurat/
- SWNE: https://github.com/yanwu2014/swne
- cellMapper: https://github.com/yanwu2014/cellMapper
- perturbLM: https://github.com/yanwu2014/perturbLM
-
Clone this repository into your local directory:
git clone https://github.com/yanwu2014/teratoma-analysis-code.git
-
Download the processed data from GEO: GSE156170
-
Untar the processed data files which should create a
Counts/
directory and aReference_Data
directory
tar xvf GSE156170_teratoma_merged_counts.tar.gz
tar xvf GSE156170_external_reference_data.tar.gz
- Run the R scripts within each Figure directory to reproduce the analysis used to create that figure. The scripts need to be run in a certain order reflected in their numbering. For example run
01_human_clustering.R
before02_human_cluster_mapping.R
in Figure1. Scripts with the same number can be run in any order. The scripts in Figure 1 need to be run first as they will generate the clustering results needed for the rest of the analysis.
Generating the genotype dictionaries in Figure 2 and Figure 4/S4 from the lentiviral barcode/gRNA amplicon Fastq files
The genotype dictionaries map either CRISPR-Cas9 gRNAs or lentiviral barcodes to single cells. We provide the processed genotype dictionaries in both this github repository and the supplementary files at GEO: GSE156170. The genotype dictionaries end in pheno_dict.csv.gz
.
To reproduce the genotype dictionaries for the embryonic lethal screen and replicate screen in Figure 4 we can download the original gRNA barcode Fastq files and reprocess them using either the files in the Genotyping
directory. We'll walk through an example using the gRNA fastq files from one of the 10X runs from the embryonic lethal screen which you can download at GSM4725940:
- Download and install PicardTools if not already installed. These scripts assume the
picard.jar
file is at$HOME/PicardTools/picard.jar
but you can install PicardTools anywhere as long as you edit thePicardToolsPath
parameter in the genotyping scripts - Clone https://github.com/yanwu2014/genotyping-matrices into your home directory (or wherever you like as long as you edit the
GenotypingMatricesPath
in your genotyping scripts) - Download the gRNA fastq files from GEO and move to the
Genotyping/Lethal_Screen/
directory - Edit the fastqName parameter in the
01_run_genotyping.sh
script to match the files that were just downloaded. The full fastq file names will be assumed to be[fastqName]_R1_001.fastq.gz
and[fastqName]_R2_001.fastq.gz
in the rest of the shell script. - Run
01_run_genotyping.sh
which should generate a genotype dictionary - Download the rest of the gRNA fastq files (GSM4725934 - GSM4725939, GSM4725946 - GSM4725951) and change the cellBarcodePath and outputFileName to generate the remaining genotype dictionaries. For example if you want to generate the genotype dictionary for the embryonic lethal screen teratoma 2 10X replicate 1, set
cellBarcodePath=dm-ter-screen2-1_cell_barcodes.tsv
andoutputFileName=dm-ter-screen2-1_pheno_dict.csv
- To merge the genotype dictionaries you can use the
02_merge_pheno_dicts.R
script which has the usage:Rscript 02_merge_pheno_dicts.R [output_merged_pheno_dict.csv] [input_pheno_dict_1.csv] [input_pheno_dict_2].csv ...
To reproduce the genotype dictionaries for the neural disase screen in FigureS4 simply download the appropriate gRNA fastq files (GSM4725956 - GSM4725959) from GEO: GSE156170 and move them to the Genotyping/Neural_Disease_Screen/
directory instead and use the same steps as for the embryonic lethal screen.
To reproduce the lentiviral barcoding genotype dictionaries again download the appropriate barcode fastq files (GSM4725916 - GSM4725918) from GEO: GSE156170 and move to the Lentiviral_Barcoding/
directory
- Download the pre and post teratoma injection gDNA fastq files (GSM4725919 - GSM4725924) from GEO: GSE156170 and move to the
Lentiviral_Barcoding/
directory. - Run
01_count_bcs.py
on each gDNA fastq file with the usage:python 01_count_bcs.py [gDNA_fastq_file.fastq.gz] [min_barcode_reads] > [output_file]
. For the paper we setmin_barcode_reads = 1
but feel free to play around with the parameter to see how it affects the number of barcodes you get out. The02_compute_barcode_frac.py
script by default expects barcode counts files in the format:dm-ter-bc[1-3]_[pre/post]_gDNA_bc_counts.txt
so we recommend using that naming convention. - If you used the naming convention recommended in the previous step you can simply run
python 02_compute_barcode_frac.py
. Otherwise edit02_compute_barcode_frac.py
so that it lines up with your barcode counts file names.
- Install CRISPResso https://github.com/lucapinello/CRISPResso
- Download the GSE156170_editing_rate_data.tar.gz and extract.
- Run
python 01_run_editing_analysis.py
which is a wrapper script that runs CRISPResso on all of the amplicon fastq files in the directory - Run
python 02_prase_editing_rates.py
which extracts the CRISPResso output and formats it into a single tab separated output
- Download the merged hg19/mm10 genome reference from the CellRanger site: https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest
- Download and install CellRanger https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation
- Download the fastq file for the 10X run you want to analyze (i.e. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4725909)
- Run CellRanger using
cellranger count
on the fastq files using default settings (adjusting the local cores and local memory usage as appropriate for your system)