Tutorial: Demultiplexing SMRT Link Iso Seq Jobs

Last Updated: 01/04/2021

DISCLAIMER: ALL CODE AND TUTORIAL IN THIS REPOSITORY IS DEVELOPER WORK AND NOT PACBIO OFFICIAL SOFTWARE. UPDATES AND CHANGES CAN HAPPEN ANYTIME. USE AT YOUR OWN RISK.

This tutorial is for demultiplexing IsoSeq jobs in SMRT Link. The scripts offered below are standalone --- no installation required, however, you will need to have Python and also BioPython library installed.

Identifying the Unix path for your SMRT Link job
Demultiplexing IsoSeq jobs without a reference genome
Demultiplexing IsoSeq jobs with a reference genome
Making up your own classify report to manually de-multiplex
Creating demultiplexed GFF and FASTA/FASTQ files after demultiplexing

Prerequisite

Python (3.7.x)
BioPython

How to use the scripts

No installation is required. You can directly copy the scripts to your local folder:

$ mkdir ~/github
$ cd ~/github/
$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
$ mkdir ~/test_dir/
$ cd ~/test_dir
$ cp ~/github/cDNA_cupcake/post_isoseq_cluster/demux*py .

Or add them to your PATH variable:

$ export PATH=$PATH:~/github/cDNA_cupcake/post_isoseq_cluster/

Identifying the Unix path for your SMRT Link job

You can find the Unix path for your SMRT Link IsoSeq job by going to the "Analysis Overview" page and copying the "Path" value. In this example, the path is <path>/userdata/jobs_root/0000/0000114/0000114372.

Demultiplexing IsoSeq jobs without a reference genome

If you ran IsoSeq without a reference genome, the demultiplexing script will output the associated full-length read counts for each HQ isoform.

$ ln -s <path>/userdata/jobs_root/0000/0000114/0000114372 job114372
$ python demux_isoseq_no_genome.py -j job114372 -o job114372.hq_isoform_fl_count.txt

The output job114372.hq_isoform_fl_count.txt will show the associated FL count for each HQ isoform:

id,NEB_5p--barcode1_3p,NEB_5p--barcode2_3p
transcript/1,0,3
transcript/4,10,3
transcript/11,3,5

Providing files directly instead of job dir

Instead of providing a job directory, you can also directly provide the required input files to demultiplex.

$ python demux_isoseq_no_genome.py --hq_fafq hq_transcripts.fastq \
    --cluster_csv cluster_report.csv \
    --classify_csv flnc_report.csv \
    -o job114372.hq_isoform_fl_count.txt

Demultiplexing IsoSeq jobs with a reference genome

If you ran IsoSeq with a reference genome, the demultiplexing script will output the associated full-length read counts for each mapped, unique, isoform. Each isoform will have the ID format PB.X.Y.

$ ln -s <path>/userdata/jobs_root/0000/0000114/0000114372 job114372
$ python demux_isoseq_with_genome.py -j job114372 -o job114372.mapped_fl_count.txt

The output will look like:

id,NEB_5p--barcode1_3p,NEB_5p--barcode2_3p
PB.1.2,0,35
PB.2.4,3,1
PB.2.2,0,5
PB.2.3,0,6
PB.2.5,0,3
PB.3.1,0,2

Providing files directly instead of job dir

Instead of providing a job directory, you can also directly provide the required input files to demultiplex.

$ python demux_isoseq_with_genome.py --mapped_fafq mapped.fastq \
    --read_stat mapped.read_stat.txt \
    --classify_csv flnc_report.csv \
    -o job114372.mapped_fl_count.txt

Making up your own classify report to manually de-multiplex

What happens if you actually did not multiplex the samples and say, you merely combined different runs into a single Iso-Seq run and now you need to manually multiplex?

You can make a fake classify report flnc_report.csv that is a comma-delimited CSV file with just two fields:

id,primer
m54278_181025_184153/4194470/ccs,sample1
m54278_181025_184153/4194611/ccs,sample1
m54278_181025_184153/4259972/ccs,sample1
m54278_181110_094539/74908648/ccs,sample2
m54278_181110_094539/74908650/ccs,sample2
m54278_181110_094539/74908653/ccs,sample2

Each id is a FLNC read, and the primer field tells you which sample it comes from. You can make this fake report by, say, using the movi name (ex: m54278_181025_184153) which is uniquely associated with each run to identify the run-specific samples.

Below is an example of a Python code snippet that can be modified to make a fake FLNC report:

d = {'m54278_181109_232120': 'sample1', 
     'm54278_181103_072310': 'sample2', 
     'm54278_181110_094539': 'sample3'} 
 
h = open('fake.flnc_report.csv', 'w') 
h.write("id,primer\n") 
 
f = open('mapped.read_stat.txt') 
f.readline() 
for line in f: 
    seqid = line.strip().split()[0] 
    movie = seqid.split('/')[0] 
    h.write("{0},{1}\n".format(seqid, d[movie])) 
h.close()

Creating demultiplexed GFF and FASTA/FASTQ files after demultiplexing

After you have a demultiplexed FL count file, you can create a demultiplexed GFF (and optionally, FASTA or FASTQ) files for each barcode group.

You can use demux_by_barcode_groups.py, providing a mapped GFF file, the demux FL count file you obtained from running the demux scripts above, and list of tuples (surrounded by double quote) to indicate the grouping of each barcode.

usage: demux_by_barcode_groups.py [-h] [--pooled_fastx POOLED_FASTX]
                                  pooled_gff demux_count_file output_prefix
                                  outgroup_dict

For example, using the following, each barcode is separated into a different output:

python demux_by_barcode_groups.py 
    --pooled_fastx collapse_isoforms.fastq \
    collapse_isoforms.gff \
    mapped_fl_count.txt \
    output_demux \
    "('NEB_5p--barcode1_3p','barcode1'),('NEB_5p--barcode2_3p','barcode2'),('NEB_5p--barcode3_3p','barcode3'), 
    ('NEB_5p--barcode4_3p','barcode4')"

From this we will get output:

output_demux_barcode1_only.gff
output_demux_barcode1_only.fastq
output_demux_barcode2_only.gff
output_demux_barcode2_only.fastq
output_demux_barcode3_only.gff
output_demux_barcode3_only.fastq
output_demux_barcode4_only.gff
output_demux_barcode4_only.fastq

Alternatively, say barcode 1 and 2 are actually two samples from the same condition, and barcode 3 and 4 belongs to another condition. We can run:

python demux_by_barcode_groups.py 
    --pooled_fastx collapse_isoforms.fastq \
    collapse_isoforms.gff \
    mapped_fl_count.txt \
    output_demux \
    "('NEB_5p--barcode1_3p','conditionA'),('NEB_5p--barcode2_3p','conditionA'),('NEB_5p--barcode3_3p','conditionB'), 
    ('NEB_5p--barcode4_3p','conditionB')"

From this we will get output:

output_demux_ConditionA_only.gff
output_demux_ConditionA_only.fastq
output_demux_ConditionB_only.gff
output_demux_ConditionB_only.fastq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: Demultiplexing SMRT Link Iso Seq Jobs

Prerequisite

How to use the scripts

Identifying the Unix path for your SMRT Link job

Demultiplexing IsoSeq jobs without a reference genome

Providing files directly instead of job dir

Demultiplexing IsoSeq jobs with a reference genome

Providing files directly instead of job dir

Making up your own classify report to manually de-multiplex

Creating demultiplexed GFF and FASTA/FASTQ files after demultiplexing

Clone this wiki locally