Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Tutorial: Iso Seq Bioinfx hands on

Elizabeth Tseng edited this page Aug 23, 2022 · 1 revision

Last Updated: 01/19/2022

Tutorial data can be downloaded here

1. isoseq3

Official IsoSeq(3) documentation.

conda activate isoseqtoy
cd ~/workshop_data/isoseq3/

cat run.sh

Cheat Sheet:

lima --isoseq --dump-clips --peek-guess -j 8 alz.ccs.bam isoseq_primers.fasta alz.demult.bam 

isoseq3 refine --require-polya alz.demult.5p--3p.bam isoseq_primers.fasta alz.flnc.bam -j 8

isoseq3 cluster alz.flnc.bam alz.polished.bam --verbose --use-qvs -j 8

pbmm2 align ../hg38.mmi alz.polished.hq.bam alz.aligned.bam --preset ISOSEQ --sort -j 8

isoseq3 collapse alz.aligned.bam alz.collapsed.gff

If you need to remove a file or symbolic link, use rm <filename>. But be careful that if you remove a symbolic link - the actual file is still there. If you remove a physical file, the file is gone.

Use cat run.sh to look at the commands we are gonna run.

NOTE: there is one typo in run.sh that will cause a command to fail. Can you spot which one?

Practice 1.1. Check the isoseq3 version in your environment

isoseq3 --version

Practice 1.2. Run through isoseq3 steps, starting from lima to isoseq3 cluster output.

Practice 1.3. Learn to use pbmm2 and isoseq3 collapse to align isoseq3 cluster output (HQ isoforms) to the reference genome, collapse to produce the GFF output. Find the section in SMRT Tools documentation that describes how to do this.

Practice 1.4. Learn to visualize the GFF output using IGV and UCSC Genome Browser.

  • Can you change the track name?
  • Can you change the track colors?
  • Can you navigate to a specific gene (ex: USP11)

NOTE: you may need to delete the comments (the beginning lines starting with # in the GFF file) for IGV to show the exon chain properly. It'll work just fine in UCSC without modification.

2. Cupcake Fun

Remember to deactivate the previous conda env and load Cupcake module

conda activate SQANTI3.env
cd ~/workshop_data/isoseq3/

Practice 2.1. Practice reverse complementing sequences using revcomp.py

revcomp.py ATGGGG
revcomp.py CGAGAATT

Practice 2.2. Practice getting sequence length statistics using get_seq_stats.py. What is different about the three commands below? What are they trying to do?

get_seq_stats.py --help
get_seq_stats.py alz.collapsed.fasta
get_seq_stats.py -b 100 alz.collapsed.fasta

Practice 2.3. Practice using get_seqs_from_list by making a sequence ID list you want to extract

We will practice by extracting the first 5 sequences from the isoseq3 output HQ fasta file.

gunzip alz.polished.hq.fasta.gz
get_seqs_from_list.py --help
grep ">" alz.polished.hq.fasta | head -n 5 | sed 's/ .*//' | sed 's/>//' > ids.txt
get_seqs_from_list.py alz.polished.hq.fasta ids.txt | more

Try running the complicated grep line one segment at a time - what do you get when you do the commands below?

grep ">" alz.polished.hq.fasta 
grep ">" alz.polished.hq.fasta | head -n 5 
grep ">" alz.polished.hq.fasta | head -n 5 | sed 's/ .*//' 
grep ">" alz.polished.hq.fasta | head -n 5 | sed 's/ .*//' | sed 's/>//'
cat ids.txt

Practice 2.4. Practice converting between fasta <-> fastq using fa2fq.py and fq2fa.py

cp isoseq_primers.fasta test.fasta
fa2fq.py test.fasta
cat test.fastq
fq2fa.py test.fastq
cat test.fasta

(BONUS) Practice 2.5. Alternative way to get collapsed GFFs - using collapse_isoforms_by_sam.py. Use the same dataset from Practice 1.3 to practice this.

NOTE: the output GFF can differ slightly due to implementation differences between Cupcake collapse and isoseq3 collapse.

You can check that you have Cupcake installed and the $PATH variable set correctly by:

$ which collapse_isoforms_by_sam.py
<home_dir>/mylib/bin/collapse_isoforms_by_sam.py

3. Cogent

conda activate SQANTI3.env
cd ~/workshop_data/Cogent

Practice 3.1. Check Cogent version by typing reconstruct_contig.py --version

Practice 3.2. Run gene family finding. How many input sequences are there? How many partitions (gene families) is produced?

You can just run the two commands in run.sh for this.

Practice 3.3. Run coding genome reconstruction. How many reconstructed contigs are there for each partition?

*NOTE: run.sh does not contain the commands to do this, you will need to figure that out yourself! * HINT: it will involve the following commands: cd <SOMETHING_SOMETHING> to get you into a directory, then reconstruct_contig.py . to run the results

(BONUS) Practice 3.4. Align both the input (in.trimmed.fa) and output (cogent2.fa) to the ref genome hg38. Convert the SAM file either to BAM (using samtools) or GFF (using sam_to_collapsed_gff.py from Cupcake) and visualize it in IGV or UCSC Genome Browser.

4. Single Cell

Official isoseq3 UMI/single cell documentation. Also refer to the Cupcake single-cell wiki.

conda activate isoseqtoy
cd ~/workshop_data/singlecell

You can follow along the single cell example

5. SQANTI3

conda activate SQANTI3.env
cd ~/workshop_data/SQANTI3

cat run.sh