Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Annotation and Rarefaction Wiki

Elizabeth Tseng edited this page Sep 22, 2020 · 23 revisions

Last Updated: 09/21/2020

Python Requirement

  • BioPython

Test Data

All test data used in this wiki is available in the annotation/test_data directory.

List of Scripts

Detailed Documentation

Prepare file for subsampling efforts (rarefaction curve drawing)

Requirement: You must have run Iso-Seq cluster and post-processed it with the collapse script to get two files: the unique isoform FASTQ file (ex: test.fq) and an abundance file (ex: test.abundance.txt).

The subsampling will always be done using the number of full-length (FL) reads associated with each isoform. Therefore, the abundance file must contain the field count_fl.

To prepare the file to run subsampling, use the command below:

usage: Make subsample-ready file from ToFU/Iso-Seq collapsed output
       [-i INPUT_PREFIX] [-o OUTPUT_FILENAME] 
       [-m2 SQANTI_CLASS] [--include_single_exons]

For example:

make_file_for_subsampling_from_collapsed.py -i test -o test.for_subsampling.txt

By default, single exons are excluded. If you want to include it, use --include_single_exons.

It will automatically look for test.gff, test.abundance.txt and test.rep.fq, so make sure the three files exist.

In addition, the script can also add on the SQANTI3 classification result, should you want to subsample based on the number of known isoforms/genes.

make_file_for_subsampling_from_collapsed.py -i test -o test.for_subsampling.txt \
   -m2 test.classification.txt

The output test.for_subsampling.txt will look like this:

pbid pbgene length refisoform refgene category fl_count
PB.15650.14 15650 2627 xxxx TFDP2 novel_in_catalog 3
PB.19796.10 19796 2734 xxxx ZNF3 novel_in_catalog 3
PB.10034.17 10034 1959 xxxx TCF4 full-splice_match 3
PB.18454.7 18454 2732 xxxx NFYA full-splice_match 3

And now we are ready to do subsampling!

Running subsampling to draw rarefaction curves

At a minimum, the input file must have the following fields: pbid (must be unique ID), length (length of transcript), and fl_count (number of FL reads). You can generate the file anyway you prefer, or use the scripts provided above.

The subsample script usage is:

usage: subsample.py [-h] [--by BY] [--iterations ITERATIONS] [--range RANGE]
                    [--min_fl_count MIN_FL_COUNT] [--step STEP]
                    count_filename

positional arguments:
  count_filename

optional arguments:
  -h, --help            show this help message and exit
  --by BY               Unique specifier name(default: id)
  --iterations ITERATIONS
                        Number of iterations (default: 100)
  --range RANGE         Length range (ex: (1000,2000), default None)
  --min_fl_count MIN_FL_COUNT
                        Minimum FL count (default: 1)
  --step STEP           Step size (default: 10000)

If you don't specify the --range option, it will use all the transcripts. You can plot rarefaction curves for each size bin individually to better understand sampling efforts in different bins.

We can subsample by the collapse script gene ID (which is determined solely based on genomic locus overlap) in the 1-3kb range of transcripts:

subsample.py --by pbgene --min_fl_count 2 --step 1000 --range "(1000,3000)" \
  test.for_subsampling.txt  > test.1to3k.by_pbgene.rarefaction.txt

We will now run subsampling at the gene and isoform level, then plot them using R on the same figure.

subsample.py --by refgene --min_fl_count 2 --step 1000 \
   test.for_subsampling.txt > test.rarefaction.by_refgene.min_fl_2.txt 

subsample.py --by refisoform --min_fl_count 2 --step 1000 \
   test.for_subsampling.txt > test.rarefaction.by_refisoform.min_fl_2.txt 

Alternatively, we can use SQANTI3 classification results to subsample by category:

subsample_with_category.py --by refisoform --min_fl_count 2 --step 1000 \
    test.for_subsampling.all.txt > \
    test.rarefaction.by_refisoform.min_fl_2.by_category.txt

Plotting Rarefaction Curves in R

You can follow the R script in the test data directory to plot rarefaction curves.

The first plot shows rarefaction curve at the gene and isoform level.

The second plot shows rarefaction curve at the isoform level, by SQANTI3 classification category.