Cupcake: supporting scripts for Iso Seq after clustering step

Last Updated: 08/23/2022

IMPORTANT: Cupcake now uses Python 3.7!

What is Cupcake?
Installing Cupcake

Using Cupcake scripts:

Collapse redundant isoforms (has genome)
Collapse redundant isoforms (no genome)
Obtain associated count information
Obtain associated count information by barcode
Filter collapse results by minimum FL count support
Filter away 5' degraded isoforms
Chaining samples together
Finding fusion genes
Summarizing junction and scrubbing junctions across multiple samples
Making colored BED format shaded by isoform abundance
Summarizing and plotting transcript/exon/intron stats after collapse

Python Requirement

Python 3.7
BioPython
bx-python
Installation of the Cupcake repository

What is Cupcake?

Cupcake is a set of supporting scripts for processing Iso-Seq data. It expects either high-quality isoform sequences (output of Iso-Seq clustering step) or full-length reads (output of Iso-Seq classify step).

Cupcake is not Iso-Seq! Official Iso-Seq software documentation through (pb)BioConda is here

Do you need SMRTAnalysis/SMRTLink to run Cupcake?

No! Cupcake is not Iso-Seq. It is a set of downstream supporting scripts.

What does Cupcake scripts do?

Understand that, after running through the Iso-Seq pipeline, you should now have the HQ isoform sequences or FLNC reads. Each sequence is assumed to be full-length, but sequences may be redundant as they represent the same transcripts.

So, the recommended next steps are:

Best practice for aligning Iso Seq to reference genome: minimap2, GMAP, STAR, BLAT
Collapse identical isoforms to obtain final set of unique, full-length, high-quality isoforms
Obtain associated count information for each unique isoform
Running SQANTI3 for classification and library artifact removal
Fusion gene finding

Installing Cupcake

Unlike most of the other scripts in Cupcake (where you can just use it by directly copying the script or adding it to PYTHONPATH), using Cupcake does require installation.

The prerequisite for installation is BioPython and bx-python. You must already have BioPython installed. I also recommend using a Python distribution package like Anaconda where you can maintain a clean install environment.

Anaconda already comes with BioPython. If you already use Cogent or ANGEL, you likely already have Anaconda already installed. Then installation Cupcake is like below:

To create an anaconda environment for Python 3.7:

export PATH="/home/anacondaPy37/bin:$PATH"
conda -V
conda update conda
conda create -n anaCogent python=3.7 anaconda
// remember to log out log in
export PATH="/home/anacondaPy37/bin:$PATH"
conda activate anaCogent

Note: if you don't already have Cython, you have to install it

conda install -n anaCogent -c anaconda cython

After having anaconda environment created:

# activate anaconda environment, here I assume it's called anaCogent
$ source activate anaCogent
(anaCogent)$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
(anaCogent)$ cd cDNA_Cupcake
(anaCogent)$ python setup.py build
(anaCogent)$ python setup.py install

After installation is complete, confirm the Cupcake scripts are in your bin:

(anaCogent)$ which collapse_isoforms_by_sam.py 
~/anaconda3/envs/anaCogent/bin/collapse_isoforms_by_sam.py

Collapse redundant isoforms (has genome)

UPDATE 08/23/2022!! collapse_isoforms_by_sam.py is now implemented as isoseq3 collapse in the official Iso-Seq software suite. It is recommended that you transition to isoseq3 collapse as the Cupcake collapse script will be deprecated by 2023.

The collapse script usage is:

usage: collapse_isoforms_by_sam.py [-h] [--input INPUT] [--fq] -s SAM -o
                                   PREFIX [-c MIN_ALN_COVERAGE]
                                   [-i MIN_ALN_IDENTITY]
                                   [--max_fuzzy_junction MAX_FUZZY_JUNCTION]
                                   [--max_5_diff MAX_5_DIFF]
                                   [--max_3_diff MAX_3_DIFF]
                                   [--flnc_coverage FLNC_COVERAGE]
                                   [--gen_mol_count] [--dun-merge-5-shorter]

Test data is available in the cupcake/test_data subdirectory. The input we will need is hq_isoforms.fastq, which is the output from running Iso-Seq.

I've already provided the aligned, sorted sam file hq_isoforms.fastq.sorted.sam, but in case you want to try it for yourself, the minimap2 command is:

minimap2 -ax splice -t 30 -uf --secondary=no -C5 \
   hg38.fa hq_isoforms.fastq > hq_isoforms.fastq.sam

sort -k 3,3 -k 4,4n hq_isoforms.fastq.sam > hq_isoforms.fastq.sorted.sam

Now we can run the collapse script:

collapse_isoforms_by_sam.py --input hq_isoforms.fastq --fq \
   -s hq_isoforms.fastq.sorted.sam --dun-merge-5-shorter -o test

The output files are test.collapsed.gff, test.collapsed.rep.fq, and test.collapsed.group.txt.

There's also test.ignored_ids.txt, which shows us which sequences were ignored because they were unmapped or have too low coverage (you can change cutoff using -c option, default is 0.99), too low identity (-i option, default is 0.95).

We can also look at test.collapsed.group.txt to see how much collapse/merging of identical isoforms there is. For example,

PB.4.1  sample1_HQ_transcript/59
PB.4.2  sample1_HQ_transcript/63,sample1_HQ_transcript/64
PB.4.3  sample1_HQ_transcript/71

shows that the unique isoform PB.4.2 collapsed 2 identical isoforms.

The naming system for the post-collapse isoform is PB.<loci_index>.<isoform_index>. Each locus consists of a strand-specific locus with isoforms that overlap by at least 1 bp. So PB.11.1 and PB.11.2 means this locus has two isoforms.

Collapse redundant isoforms (no genome)

If you do not have a reference genome, please look at the tutorial here for using either CD-HIT or Cogent for collapse.

Obtain associated count information

The usage for obtaining count information after collapse is:

usage: Get abundance/read stat information after running collapse script.
       [-h] collapse_prefix cluster_report

Now that we have unique isoforms, we can go back and retrieve the number of associated FL and nFL counts. To get this information, we need cluster_report.csv. The CSV report tells us every full-length read that was associated with each cluster. Note that not every cluster becomes a HQ isoform, and not all HQ isoforms get used into the post-collapse unique isoforms (may be filtered out due to lack of mapping or low identity/coverage), so not all reads will contribute to the unique isoform count information.

Let's run the abundance script:

get_abundance_post_collapse.py test.collapsed cluster_report.csv

And we get two output. One is the detailed per-read report test.collapsed.read_stat.txt.

The read stat file associates each FL read (id) with an isoform (pbid):

id  length  is_fl   stat    pbid
m64012_190727_053041/105120134/ccs  NA  Y   unique  PB.16.1
m64012_190727_053041/21628950/ccs   NA  Y   unique  PB.16.1
m64012_190727_053041/114950281/ccs  NA  Y   unique  PB.16.1
m64012_190727_053041/87229073/ccs   NA  Y   unique  PB.16.1

The next file is test.collapsed.abundance.txt, which provides the FL count information.

#
# -----------------
# Field explanation
# -----------------
# count_fl: Number of associated FL reads
# norm_fl: count_fl / total number of FL reads, mapped or unmapped
# Total Number of FL reads: 1065
#
pbid    count_fl        norm_fl
PB.1.1  2       1.8779e-03
PB.1.2  6       5.6338e-03
PB.1.3  3       2.8169e-03
PB.2.1  3       2.8169e-03
PB.3.1  2       1.8779e-03
PB.4.1  8       7.5117e-03

Obtain associated count information by barcode

If you barcoded your samples and ran IsoSeq with the barcodes pooled, you can use the demux scripts to get demultiplexed FL count information per barcode/sample.

Filter collapse results by minimum FL count support

The usage for filter_by_count.py is:

usage: filter_by_count.py 
    [--min_count MIN_COUNT] [--dun_use_group_count] input_prefix

Now that we have the collapsed result and the count information associated with each unique isoform, we can further filter out isoforms with low FL count support. Currently, by default, an HQ isoform (and hence a unique isoform) must have at least 2 or more FL supporting it.

If we use a minimum of 10 FL counts as a cutoff:

filter_by_count.py --min_count 10 --dun_use_group_count test.collapsed

We will see that the output files are test.collapsed.min_fl_10.gff, test.collapsed.min_fl_10.abundance.txt, and test.collapsed.min_fl_10.rep.fq.

Filter away 5' degraded isoforms

The usage for filter_away_subset.py is:

usage: filter_away_subset.py [--fuzzy_junction] input_prefix

If we revisit the collapsed results, we'll notice that PB.10.3, PB.10.4, PB.10.5, PB.10.6 have the same exonic structures as PB.10.1 or PB.10.2 but are missing some of the 5' exons.

Since the Clontech SMARTer cDNA kit used to create the full-length cDNA does not do cap trap, it is possible these are 5' RNA degradation products and not biologically meaningful. In that case, we may wish to filter it away.

Using the output from the collapsed step:

filter_away_subset.py test.collapsed

We now obtained output files named test.collapsed.filtered.gff, test.collapsed.filtered.abundance.txt, and test.collapsed.filtered.rep.fq.

We can see that the filtered result leaves only PB.10.1 and PB.10.2:

Chain samples together

If you run multiple Iso-Seq jobs, after mapping them back to the genome and getting the non-redundant, collapsed transcripts, each sample will have its own set of unique IDs, such as PB.1.1, PB.1.2, PB.2.1, and so on.

This tutorial shows how to chain samples together to display the correspondence between the different IDs.

You will need to first make a config file:

SAMPLE=<sample name>;<path>
SAMPLE=<sample name>;<path>

GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq

You can add as many SAMPLE lines for as many samples as you have. Note: Do not include blank spaces, semicolons, and other punctuations in the sample names. <path> denotes the directory in which the sample output resides.

In each sample <path>, the script expects the following three files to exist: GROUP_FILENAME, GFF_FILENAME, and COUNT_FILENAME. If your file names are different, please rename them or change the names in the config file. FASTQ_FILENAME is optional.

Full usage is:

chain_samples.py [-h] [--fuzzy_junction FUZZY_JUNCTION]
                        [--dun-merge-5-shorter] [--max_3_diff MAX_3_DIFF]
                        [--cpus CPUS]
                        config_file {norm_fl,count_fl}

positional arguments:
  config_file
  {norm_fl,count_fl}    Which count field to use for chained sample (default:
                        count_fl)

optional arguments:
  -h, --help            show this help message and exit
  --fuzzy_junction FUZZY_JUNCTION
                        Max allowed distance in junction to be considered
                        identical (default: 0 bp)
  --dun-merge-5-shorter 
                        Don't collapse shorter 5' transcripts (default: off)
  --max_3_diff MAX_3_DIFF
                        Maximum 3' difference allowed (default: 30bp)
  --cpus CPUS           Number of CPUs to use for multi-threading (default: 8)

where --fuzzy_junction is the maximum number of base differences allowed in junction sites to still be chained together (default is 5 bp). --dun-merge-5-shorter will chain a truncated 5' isoform in sample A with a longer, compatible isoform in sample B.

Try running the test data with two sets of params to see the difference:

## case 1: treat 5' truncated isoforms as same
chain_samples.py sample.config count_fl

## case 2: treat 5' truncated isoforms as different
chain_samples.py sample.config count_fl --dun-merge-5-shorter

The first command will generate a (chained) set of 2 isoforms. The latter will generate 3. This is because PB.2.1 in sample A is a truncated form of PB.2.1 in sample B.

Experimental: Iterative Chaining

NOTE: to use iterative chaining, you must first start with a clean directory. Absolutely nothing besides the config files ('sample.config1', 'sample.config2', etc....). And once you start running them iteratively, you should never delete any files (intermediate files are used as input to the next iteration) or change the order in which you iteratively chain the files.

The way you run iterative chaining is by making a series of config files. Let's say you have 5 samples. We will start with chaining 3 of them, then add the 4th to the result of the first 3, then the 5th result of the first 4.

To do so, you need to have a series of config files.

# sample.config1

SAMPLE=sample1;path1
SAMPLE=sample2;path2
SAMPLE=sample3;path3

GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq

Again, we expect that in path1/ directory, there will be files path1/touse.group.txt, 'path1/touse.gff, 'path1/touse.count.txt...and so on.

We will first run this with chain_samples.py sample.config1 count_nfl. The intermediate output is tmp_<sample>.XXX. You will see tmp files for sample2 and sample3 but not sample1. DO NOT delete these. We need all of them for "unpacking" the chains as we go on.

The next config file builds up on the last one:

# sample.config2

tmpSAMPLE=sample1;path1
tmpSAMPLE=sample2;path2
tmpSAMPLE=sample3;path3
SAMPLE=sample4;path4

GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq

It is important that the first three lines from sample.config1 remain unchanged except for the prefix change from SAMPLE= to tmpSAMPLE=. Do not change the order of the samples from one config file to another! Whenever you add a new sample, it must come after the existing samples. Basically we specify that the first 3 samples have already been chained by using the prefix tmpSAMPLE=.

Now we run chain_samples.py sample.config2 count_fl. You will see that now we have tmp_sample4.XXX.

# sample.config3

tmpSAMPLE=sample1;path1
tmpSAMPLE=sample2;path2
tmpSAMPLE=sample3;path3
tmpSAMPLE=sample4;path4
SAMPLE=sample5;path5

GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq

Now we run chain_samples.py sample.config3 count_fl. The final results are tmp_sample5.XXX, which, by the way, gets copied as all_samples.chained_XXX as well. So you can directly use the latest all_samples.chained_XXX as your final output.

A few final notes on iterative chaining.

As good practice, you should keep the GROUP_FILENAME, GFF_FILENAME, COUNT_FILENAME, FASTQ_FILENAME configuration the same for all samples.
Always start with an initially clean directory, with nothing but sample.config files.
When you add on chaining, simply do so by adding more SAMPLE= lines after previously chained samples and change the previously chained samples from SAMPLE= to tmpSAMPLE=.
You can add more than one new sample at a time.

To drive home my point, here's a valid config file:

# THIS IS A VALID CONFIG
tmpSAMPLE=sample1;path1
tmpSAMPLE=sample4;path4
SAMPLE=sample5;path5
SAMPLE=sample6;path6
SAMPLE=sample7;path7

GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq

And here is a invalid config file:

# THIS IS NOT A VALID CONFIG. Chained samples must precede all new samples.
tmpSAMPLE=sample1;path1
tmpSAMPLE=sample4;path4
SAMPLE=sample5;path5
tmpSAMPLE=sample6;path6
tmpSAMPLE=sample7;path7

GROUP_FILENAME=touse.group.txt
GFF_FILENAME=touse.gff
COUNT_FILENAME=touse.count.txt
FASTQ_FILENAME=optional.rep.fastq

Finding fusion genes

The usage of the script fusion_finder.py is pretty much identical to usage of the collapse_isoforms_by_sam.py script.

In fact, collapse_isoforms_by_sam.py by default ignores chimeric transcripts because it requires that a single continuous alignment has at least 99% coverage. (The 99% threshold can be lowered using the -c option.)

usage: fusion_finder.py [-h] [--input INPUT] [--fq] -s SAM -o PREFIX
                        [--cluster_report_csv CLUSTER_REPORT_CSV]
                        [--dun-merge-5-shorter] [-c MIN_LOCUS_COVERAGE]
                        [--min_locus_coverage_bp MIN_LOCUS_COVERAGE_BP]
                        [-t MIN_TOTAL_COVERAGE] [-d MIN_DIST_BETWEEN_LOCI]

optional arguments:
  --input INPUT         Input FA/FQ filename
  --fq                  Input is a fastq file (default is fasta)
  -s SAM, --sam SAM     Sorted GMAP SAM filename
  -o PREFIX, --prefix PREFIX
                        Output filename prefix
  --cluster_report_csv CLUSTER_REPORT_CSV
                        cluster_report.csv, optional, if given will generate
                        count info.
  --dun-merge-5-shorter
                        Don't collapse shorter 5' transcripts (default: turned
                        off)
  -c MIN_LOCUS_COVERAGE, --min_locus_coverage MIN_LOCUS_COVERAGE
                        Minimum per-locus coverage in percentage (default:
                        0.05)
  --min_locus_coverage_bp MIN_LOCUS_COVERAGE_BP
                        Minimum per-locus coverage in bp (default: 1 bp)
  -t MIN_TOTAL_COVERAGE, --min_total_coverage MIN_TOTAL_COVERAGE
                        Minimum total coverage (default: 0.99)
  -d MIN_DIST_BETWEEN_LOCI, --min_dist_between_loci MIN_DIST_BETWEEN_LOCI
                        Minimum distance between loci, in bp (default: 10000)

The criteria for fusion candidates used in fusion_finder.py is currently that a single transcript must:

Map to two or more separate loci
Each mapped locus must align at least 5% of the transcript
The combined alignment coverage (over all mapped loci) must be at least 99%
Each mapped locus is at least 10 kb apart

The thresholds (5% mapping, 99% total mapping, 10 kb apart) are rather ad hoc to rule out generally low-quality transcripts, junk transcripts, and mismappings. If you have better suggestions for parameter values, please provide them using the Issues and I could eventually open up the parameter choices.

Example Usage

If you don't already have a SAM alignment, create one using the following commands:

minimap2 -ax splice -t 30 -uf --secondary=no -C5 \
   hg38.fa hq_isoforms.fastq > hq_isoforms.fastq.sam

sort -k 3,3 -k 4,4n hq_isoforms.fastq.sam > hq_isoforms.fastq.sorted.sam

IMPORTANT: You MUST RUN minimap2 with the --secondary=no so there are no secondary alignments!

Next run fusion_finder.py:

fusion_finder.py --input hq_isoforms.fastq --fq -s hq_isoforms.fastq.sorted.sam \
-o lq_isoforms.fasta.fusion \
--cluster_report_csv cluster_report.csv

The output GFF file hq_isoforms.fasta.fusion.gff contains the information for fusion candidates. Each candidate will have two or more "fusion components". For example, you may see GFF entries for PBfusion.1.1, PBfusion.1.2, PBfusion.2.1, and PBfusion.2.2. The pair (PBfusion.1.1, PBfusion.1.2) denotes the two fusion component of the transcript PBfusion.1. This naming system is a little different from the output of collapse_isoforms_by_sam.py where PB.1.1 and PB.1.2 mean two separate transcripts mapping to the same locus.

cluster_report.csv is optional. You can find this file in your SMRTLink job directory <jobid>/tasks/pbtranscript.tasks.combine_cluster_bins-0/cluster_report.csv. If provided, it will generate an abundance file hq_isoforms.fasta.fusion.abundance.txt.

Read more about how to follow up with classifying and filtering fusion candidates.

Summarizing junction and scrubbing junctions across multiple samples

We now recommend using SQANTI3 to summarize junctions and filter for potential artifacts.

Making colored BED format shaded by isoform abundance

See the Alzheimer Iso-Seq UCSC track for an example of colored BED files.

Input:

GTF/collapsed GFF
SQANTI3 classification output (which used the input GTF above)

Output:

A BED12 file that contains RGB color code where darkest (black) is the most abundant isoform in a gene, and light pink is the least abundant

Prerequisite:

Cupcake installed (this repo)
gtfToGenePred and genePredToBed (download from UCSC utilities)

Given the GTF/collapsed GFF output from running collapse_isoforms_by_sam.py (or other tools), you can convert the GTF to BED format, then color the isoforms using the FL count information, normalized per gene.

Usage for the script is:

usage: color_bed12_post_sqanti.py [-h]
                                  class_filename bed12_filename output_prefix

positional arguments:
  class_filename  SQANTI(3) classification filename
  bed12_filename  Input BED12 filename (converted from same SQANTI3 input GTF)
  output_prefix

For example:

gtfToGenePred input.gtf input.genePred
genePredToBed input.genePred input.bed12
color_bed12_post_sqanti.py input.sqanti_classification.txt input.bed12 output.colored

The script will automatically look for the FL column in SQANTI3 (you must run SQANTI3 with --fl_count supplied). If there are multiple samples, the FL counts will have the columns FL.sample1, FL.sample2 in the SQANTI3 classification output file, and the script will output files such as output.colored.CPM.sample1.bed12,output.colored.CPM.sample2.bed12 ...etc.

Right now we are using a unit called CPM ("full-length counts per million") and that number is displayed for each isoform in the format of

"PB.X.Y | SQANTI3-classified gene name | CPM"

Summarizing and plotting transcript/exon/intron stats after collapse

NOTE: This is a substitute for those who do not have a reference annotation to run SQANTI3, which is a tool that generates more comprehensive stats. The only prerequisite you need to generate the stats & figures here is a reference genome that you can align to and run collapse_isoforms_by_sam.py on.

The script usage is:

usage: simple_stats_post_collapse.py [-h] input_prefix

positional arguments:
  input_prefix  Input prefix, ex: hq.5merge.collapsed