Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Iso Seq Single Cell Analysis: Recommended Analysis Guidelines

Elizabeth Tseng edited this page Oct 25, 2022 · 87 revisions

Last Updated: 08/23/2022

MAJOR UPDATE!!! As of 08.23.2022, the official Iso-Seq single-cell tools at isoseq.how represents the latest and recommended single-cell Iso-Seq analysis pipeline. Please refer to isoseq.how for the latest guidelines.

workflow


  1. Generate CCS Reads
  2. Read segmentation with Skera
  3. Remove cDNA primers, polyA tails, and extract UMI/BCs
  4. Barcode correction and UMI deduplication
  5. Align to Genome
  6. Collapse into Unique Transcripts
  7. Compare Against Annotation
  8. Filter Artifacts
  9. Process into CSV Report and (optional) UMI/BC Error Correction
  10. Producing a Seurat-compatible input from SQANTI3 isoforms
  11. Advanced: Linking molecule info to cell type, running IsoPhase, tagging BAMs
  12. Advanced: Calling variants using bcftools


Prerequisites

If you have SMRT Analysis installed, you already isoseq3 (need v3.7.0 and up). Otherwise, I highly recommend using AnaConda to install isoseq3 through pbbioconda.

The Cupcake scripts for single cell are all in cDNA_cupcake/singlecell subdirectory and do not require installation. You can execute the scripts directly.

Following is an example of installing BioPython and isoseq3 BioConda, then downloading cDNA_Cupcake:

# First, launch Conda environment 
$ source activate anaCogent
(anaCogent)-bash-4.1$

# Install BioPython
(anaCogent)-bash-4.1$ conda install biopython

# Install bamtools and isoseq3
(anaCogent)-bash-4.1$ conda install -c bioconda bamtools    
(anaCogent)-bash-4.1$ conda install -c bioconda isoseq3      

# Download cDNA_Cupcake from GitHub
(anaCogent)-bash-4.1$ git clone https://github.com/Magdoll/cDNA_Cupcake.git

# Installation of cDNA_Cupcake is optional, 
# only if you are doing further processing after mapping to genome
(anaCogent)-bash-4.1$ cd cDNA_Cupcake
(anaCogent)-bash-4.1$ python setup.py build
(anaCogent)-bash-4.1$ python setup.py install

Tutorial

Please first read the official isoseq3 UMI/BC deduplication documentation.

For this tutorial, we assume that the UMIs and cell barcodes (BCs) are on the 3' end, between the polyA tail and 3' primer. The UMIs and BCs can be of any length.

fig3gem

1. Generate CCS Reads

Generate CCS reads. You can use either SMRT Link, SMRT Analysis, or CCS through pbbioconda. Refer to the ccs for isoseq3 section for the latest recommended ccs parameters for Iso-Seq.

The output from running CCS is either a ccs.bam or a ConsensusReadSet ccs.consensusreadset.xml. Multiple CCS BAMs can be combined into a single ConsensusReadSet dataset using dataset create.

2. Read segmentation with Skera

Refer to skera.how for deconcatenating MAS-Seq reads into segmented reads (S-reads).

skera split <movie>.hifi_reads.bam adapters.fasta <movie>.skera.bam

3. Remove cDNA primers, polyA tails, and extract UMI/BCs

After skera, we are essentially left with just a regular CCS read as if it did not get concatenated!

We now run lima, then isoseq3 tag and isoseq3 refine.

# Remove 5' and 3' cDNA primer
$ lima --isoseq --dump-clips [-j cpus] <movie>.skera.bam primers.fasta output.bam

# Clip UMI and cell barcode
$ isoseq3 tag [-j cpus] output.5p--3p.bam output.5p--3p.tagged.bam --design T-8U-12B

# Remove poly(A) tails and concatemer
$ isoseq3 refine [-j cpus] \
     --require-polya \
     output.5p--3p.tagged.bam primers.fasta output.5p--3p.tagged.refined.bam 

lima must be run with --isoseq mode to proper identify and 5' and 3' cDNA primers (if you're using 10x Chromium, you'd supply the 10x cDNA primers).

isoseq3 tag is then called to extract the UMIs and single cell cell barcodes. Refer to isoseq.how/umi on how to designate the length and relative position of the UMI and single cell BC. In this example, T-8U-12B designates that there's a 8bp UMI and a 12bp BC after the polyA tail. The UMI is stored in the XM BAM tag and BC in the XC BAM tag (see isoseq BAM tag page)

isoseq3 refine removes the polyA tail (use the --require-polya option).

4. Barcode correction and UMI deduplication

Use isoseq3 correct with a single cell barcode list (ex: see 10x single cell barcode list). I've made a compressed gzip file for two of the most common barcode sets, see here, note that for the 3' kit, the barcodes had to be reverse-complemented.

# barcode correction based on single cell barcode list
$ isoseq3 correct [-j cpus] --barcodes <barcode_list> \
   output.5p--3p.tagged.refined.bam \
   output.5p--3p.tagged.refined.corrected.bam

The corrected barcodes will be stored in CB in the BAM tag.

Before running UMI deduplication, we need to sort the BAM file by the CB tag first.

$ samtools sort -t CB \
   output.5p--3p.tagged.refined.corrected.bam \ 
   output.5p--3p.tagged.refined.corrected.sorted.bam

$ isoseq3 groupdedup \
   output.5p--3p.tagged.refined.corrected.sorted.bam \
   output.5p--3p.tagged.refined.corrected.sorted.dedup.bam 

The output is xxxx.dedup.[bam|fasta] which can be mapped to the genome for further analysis.

5. Align to Genome

Either use pbmm2 (PacBio's minimap2) wrapper:

pbmm2 align [-j CPUS] \
       --sort --preset ISOSEQ \
       hg38.fasta \
       dedup.bam \
       mapped.bam

Or run minimap2 and convert to sorted BAM later:

minimap2 -t 30 -ax splice -uf --secondary=no -C5 \
   hg38.fasta dedup.fasta \
   > mapped.sam 

samtools view -bS mapped.sam|samtools sort > mapped.bam

6. Collapse into Unique Transcripts

We will now be using isoseq3 collapse as described in isoseq.how

isoseq3 collapse mapped.bam collapsed.gff
pigeon sort collapsed.gff -o collapsed.sorted.gff

7. Compare Against Annotation

We will use SQANTI3 to annotate each unique transcript against an annotation.

To run SQANTI3, you must provide a GTF annotation (such as GENCODE) and a reference genome fasta.

python sqanti3_qc.py --gtf dedup.5merge.collapsed.gff \
    gencode.annotation.gtf hg38.fa \
    [--fl_count dedup.5merge.collapsed.abundance.txt]

The output from SQANTI3 are dedup.5merge.collapsed_classification.txt, dedup.5merge.collapsed_junctions.txt, and a PDF file dedup.5merge.collapsed_sqanti_report.pdf.

8. Filter Artifacts

This step is optional, but I recommend using SQANTI3 filter to remove library artifacts. These are artifacts that can only be detected after mapping back to the genome, including intrapriming (accidental priming off genomic 'A's), RT switching artifacts, and mismapping to non-canonical junctions.

python sqanti3_RulesFilter.py \
     dedup.5merge.collapsed_classification.txt \
     dedup.5merge.collapsed_corrected.fasta \
     dedup.5merge.collapsed.gff

The output filtered fasta/gff file is dedup.5merge.collapsed_classification.filtered_lite.[fasta|gff].

9. Process into CSV Report

We then use the Cupcake/singlecell scripts to generate a collated CSV file that links each mapped FLNC read to its classified genes and transcripts based on SQANTI3's output:

gzip dedup.info.csv.gz

python <path_to>/cDNA_Cupcake/singlecell/collate_FLNC_gene_info.py \
       dedup.5merge.collapsed.group.txt \
       dedup.info.csv.gz \
       dedup.5merge.collapsed_classification.filtered_lite_classification.txt \
       dedup.annotated.csv

Following is an example output dedup.annotated.csv:

id pbid length transcript gene category ontarget ORFgroup UMI UMIrev BC BCrev
molecule/1359076 PB.72530.1 879 novel WASH3P novel_not_in_catalog NA NA GGTATTCTAAGA TCTTAGAATACC TGCCGGATGTATTCCT AGGAATACATCCGGCA
molecule/3250406 PB.10.8 436 ENST00000624431.2 FO538757.2 incomplete-splice_match NA NA AGTGAACATTGC GCAATGTTCACT TATCGAGTGTTCTATC GATAGAACACTCGATA
molecule/1830928 PB.11.2 580 ENST00000488147.1 WASH7P incomplete-splice_match NA NA TTCCATTTAGGT ACCTAAATGGAA TATGGAGCTCACAGCC GGCTGTGAGCTCCATA
molecule/1869216 PB.11.2 580 ENST00000488147.1 WASH7P incomplete-splice_match NA NA GTCTAGACGGAG CTCCGTCTAGAC CCACCTTGATGTACGT ACGTACATCAAGGTGG

We no longer recommend correcting error for UMI and BC (script UMI_BC_error_correct.py) since isoseq3 dedup already can handle mismatches and indels with --max-tag-mismatches and --max-tag-shift.

At this point, the output dedup.annotated.csv can be imported into tools like R for manipulation. Happy analyzing!

10. Producing a Seurat-compatible input from SQANTI3 isoforms

After producing an annotated csv file (from either collate_FLNC_gene_info.py or UMI_BC_error_correct.py) and filtered gtf file from SQANTI3, you can use make_seurat_input.py to generate a Seurat-compatible input. This script will make a sparse matrix based on the isoform counts per cell. By default, it will filter out RPS, RPL, MT-, and novel genes. These filters can be disabled by using the --keep_RiboMito and --keep_novel options.

python <path_to>/cDNA_Cupcake/singlecell/make_seurat_input.py \
    -i dedup.annotated.(corrected.)csv(.gz) \
    -a dedup.5merge.collapsed_classification.filtered_lite.gtf \
    -o <path_to>/output/

make_seurat_input.py will generate an output like this:

path/to/output
└── isoforms_seurat
    ├── barcodes.tsv
    ├── genes.tsv
    └── matrix.mtx

This output can then be loaded into Seurat:

pbmc.data = Read10X(data.dir="<path_to>/output/isoforms_seurat")

You may follow the standard scRNA-seq processing tutorial by Seurat from this point on for clustering and marker genes discovery. The major difference comparing to traditional scRNA-seq is that we are working with isoforms instead of genes, so the standard QC metrics may need to be modified according to your goal. For example, isoforms tend to be more sparse in terms of counts compared to genes, and you may not want to apply a UMI count threshold that is too stringent and risk removing important isoforms.

In addition, a new tool called kana (Preprint available here) allows you to load the matrix directly in a browser to analyze the scIso-Seq data. It is extremely fast (~12k cells in a few mins in our experience, using a 13 inch MacBook 2019) and does most of the usual processing step including clustering and annotation.

11. Advanced: linking molecule info to cell type, running IsoPhase, tagging BAMs

(i) Linking molecule info to cell type

If you already have dedup.annotated.csv.gz from the previous step and you also have cell barcode-to-cell type information (say, from matching short read data), which must contain the BCrev and Celltype columns below (extra columns are ok and will be ignored):

# bc_info.csv example
BCrev,Celltype
AAACCCAAGACCAGAC,NA
AAACCCAAGGATCACG,CD4 Memory
AAACCCACAAACACCT,CD16 Mono
AAACCCACAACACTAC,CD4 Memory

Then you can add the Celltype column to your annotated CSV file using:

python <path_to>/singlecell/link_molecule_to_celltype.py \
    bc_info.csv \
    dedup.annotated.csv.gz \
    dedup.annotated.celltype.csv

Note that dedup.annotated.celltype.csv will be comma-delimited instead of tab-delimited (since cell types can have spaces, using comma is less prone to processing error later).

id pbid length transcript gene category ontarget ORFgroup UMI UMIrev BC BCrev Celltype
molecule/1359076 PB.72530.1 879 novel WASH3P novel_not_in_catalog NA NA GGTATTCTAAGA TCTTAGAATACC TGCCGGATGTATTCCT AGGAATACATCCGGCA CD4 Naive
molecule/3250406 PB.10.8 436 ENST00000624431.2 FO538757.2 incomplete-splice_match NA NA AGTGAACATTGC GCAATGTTCACT TATCGAGTGTTCTATC GATAGAACACTCGATA NA
molecule/1830928 PB.11.2 580 ENST00000488147.1 WASH7P incomplete-splice_match NA NA TTCCATTTAGGT ACCTAAATGGAA TATGGAGCTCACAGCC GGCTGTGAGCTCCATA CD14 Mono
molecule/1869216 PB.11.2 580 ENST00000488147.1 WASH7P incomplete-splice_match NA NA GTCTAGACGGAG CTCCGTCTAGAC CCACCTTGATGTACGT ACGTACATCAAGGTGG CD4 Memory

This celltype-annotated CSV file can be used to tag BAM files later once we run IsoPhase.

(ii) Running IsoPhase for isoform level phasing

We will first run IsoPhase using the dedup reads and the collapsed results. We will be essentially following the same IsoPhase steps as used for bulk Iso-Seq data, except that we do not have a .read_stat.txt file that we must convert from the collapsed .group.txt from step 7.

python <path_to>/phasing/utils/convert_group_to_read_stat_file.py \
    dedup.5merge.collapsed.group.txt \
    dedup.5merge.collapsed.read_stat.txt

Now we have a .read_stat.txt that we can input to IsoPhase. Follow the IsoPhase tutorial all the way through, which will include commands like below:

python <path_to>/phasing/utils/select_loci_to_phase.py \
    hg38.fa \
    dedup.fasta \
    dedup.5merge.collapsed_classification.filtered_lite.gtf \
    dedup.5merge.collapsed.read_stat.txt \
    -c 40

where dedup.5merge.collapsed_classification.filtered_lite.gtf is the result of running SQANTI3 filtering from step 9.

Then move on to generating IsoPhase commands for each locus and run them.

(iii) Tagging an aligned BAM file after IsoPhase

Suppose we found a locus that we want to tag both the haplotype information and the cell type information. We would first align the input reads (they're from dedup.fasta, but in each folder they would be called ccs.fasta) using GMAP or minimap2, then tag the aligned BAM file with information from IsoPhase and the celltype-annotated CSV file from step (i).

# going to a locus of interest
cd by_loci/PB.211762_size130

# align the input (dedup reads) to genome to get a BAM file; use GMAP or minimap2
minimap2 -ax splice -uf --secondary=no -C5 hg38.fa ccs.fasta|samtools view -bS|samtools sort > ccs.hg38.sorted.bam

# tag the BAM file
python <path_to>/phasing/utils/tag_bam_post_phasing.py \
    ccs.hg38.sorted.bam \
    phased.partial.cleaned.hap_info.txt \
    ccs.hg38.sorted.tagged.bam \
    --celltype dedup.annotated.celltype.csv

samtools index ccs.hg38.sorted.tagged.bam

Now the output ccs.hg38.sorted.tagged.bam will contain cell type information using the XC tag and the haplotype information using the RG tag.

You can use IGV to visualize by first choosing "Color Alignments by" --> "tag" --> type in "RG". Then "Sort Alignments by" --> "tag" --> type in "XC".

The IGV screenshot will look something like this:

11. Variant calling for MAS-Iso-Seq

Simple and accurate variants calling using bcftools

HiFi reads are very accurate. While IsoPhase will generate a VCF file, it contains only heterozygous variants and does not call variant with respect to a reference genome. For traditional variant calling, we can use bcftools to directly call variants from Iso-Seq reads. We will start with dedup.mapped.bam generated from mapping the de-duplicated Iso-Seq reads (After isoseq3 dedup) to a reference genome. bcftools contains a preset designed for PacBio HiFi reads. The option --indel-size 500 allows bcftools to be more sensitive to larger indel in our experience. Example of using hg38 to call variants:

bcftools mpileup --threads 32 -f hg38.fasta dedup.mapped.bam \
    -X pacbio-ccs --indel-size 500 --annotate FORMAT/AD,FORMAT/DP | \
    bcftools call -f GQ --threads 32 -vm -Ov | \
    bcftools norm --threads 32 -c s -f hg38.fasta - | \
    bcftools filter --threads 32 -i 'QUAL > 20 && INFO/DP > 4' > bcftools_variants.vcf

In the above, we applied a simple filter of QUAL > 20 and coverage depth of 4 to remove many low quality variants. It is important to note that variant calling in RNA-seq or Iso-Seq is generally challenging due to the fluctuation in expression level across genes/isoforms (i.e. lower sensitivity in lowly-expressed isoforms). To complicate things further, isoforms are sparse in single-cells sequencing resulting in allelic drop-out in cells.

Piling up variants in single cells and genotype demultiplexing

With a set of paediatric AML MAS-Iso-Seq data consisting of cells from a HCT donor and recipient, we showed the accuracy of the variants (called with bcftools) by demultiplexing the donor and recipient using cellsnp-lite and vireo.

First we can filter to just the heterozygous variants and set an even more stringent QUAL score (This is set arbitrarily) to ensure we have the highest quality variants for genotype demultiplexing:

bcftools filter -i 'GT="0/1" && QUAL > 51' bcftools_variants.vcf > bcftools_variants.hq.vcf

In single-cells sequencing, there are usually many "empty" cell barcodes that has low or no transcripts associated. The file "dedup.annotated.csv" allow us to count the number of UMIs associated with cell barcodes directly (You can also use R/Python to do this):

cut -f11 dedup.annotated.csv | \
    datamash -g 1 --sort count 1 | \
    sort -rnk2 > barcodes_UMI_count.tsv

awk '$2>100{print $1}' barcodes_UMI_count.tsv > barcodes_100UMI.tsv

Next, we "pile-up" the alleles in each single cell with cellsnp-lite and demultiplex the donor and recipient using vireo. Note that the isoseq pipeline currently "tag" the cell barcodes as the XC tag and the UMI as XM tag in the BAM file, so we need to set that in the parameters of cellsnp-lite.

cellsnp-lite --cellTAG XC --UMItag XM -s dedup.mapped.bam \
    -b barcodes_100UMI.tsv -R bcftools_variants.hq.vcf \
    --minMAF 0.1 --minCount 20 --gzip \
    -O cellsnp-lite_bcftools_het

vireo -c cellsnp-lite_bcftools_het -N 2 \
    -o vireo_bcftools_het \
    --callAmbientRNAs

vireo will generate a few output files. The file donor_ids.tsv will assign a genotype to each of the cell barcodes assigned using the alleles called in the cell.

Clone this wiki locally