Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Tutorial: APIs for reading SAM, GFF, etc...

Elizabeth Tseng edited this page Dec 7, 2020 · 8 revisions

Last Updated: 12/7/2020

A random collection of useful APIs or little tricks to work with SAM/BAM, GTF/GFF, etc.

Prerequisite

You need to have installed Cupcake to use following API.



Subsetting a PacBio BAM file based on a list of wanted ZMWs

Here is a Python code snippet you can modify to get a list of ZMWs you want.

For PacBio HiFi (CCS) BAM files, the variable r.qname will contain the ZMW name like:

# format is <movie>/<zmw>/ccs
m64049_200801_085545/12/ccs
m64049_200801_085545/13/ccs
m64049_200801_085545/14/ccs
...

If your CCS BAM file only contains reads from the same run (same movie name), you can just make a wanted list that is just the ZMWs, like:

# cat zmws_to_use.txt
12
13
14

In this case using r.qname.split('/')[1] will extract just the ZMW number.

If you have multiple movie names and need to be more explicit, such as:

# cat zmws_to_use.txt
movie1/12
movie2/12
movie1/13

Then change r.qname.split('/')[1] to r.qname[:r.qname.rfind('/')] will extract movie/zmw instead.

import pysam
bam_filename = 'ccs.bam'
touse = [line.strip() for line in open('zmws_to_use.txt')]
f = pysam.AlignmentFile('ccs.bam', 'wb', header=reader.header)
reader = pysam.AlignmentFile(bam_filename, 'rb', check_sq=False)
for r in reader:
    if r.qname.split('/')[1] in touse:
        print(r.qname.split('/')[1])
        f.write(r)
f.close()

Reading SAM files

The following SAM reader is designed to work with GMAP SAM output or any SAM output that has the same structure as GMAP SAM outputs.

>>> from cupcake.io.BioReaders import GMAPSAMReader
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True)
>>> r = reader.next()

        qID: i0HQ|c58287/f1p3/338
        sID: chr1
        cigar: 112M1D226M
        sStart-sEnd: 106761814-106762153
        qStart-qEnd: 0-338
        segments: [Interval(start=106761814, end=106762153)]
        flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
        
        coverage (of query): None
        coverage (of subject): None
        alignment identity: 0.991150442478

>>> from Bio import SeqIO
>>> query_len_dict = dict((r.id, len(r.seq)) for r in SeqIO.parse(open('demo.quivered_hq.fastq'),'fastq'))
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True, \
query_len_dict=query_len_dict)
>>> r = reader.next()


        qID: i0HQ|c58287/f1p3/338
        sID: chr1
        cigar: 112M1D226M
        sStart-sEnd: 106761814-106762153
        qStart-qEnd: 0-338
        segments: [Interval(start=106761814, end=106762153)]
        flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
        
        coverage (of query): 1.0
        coverage (of subject): None
        alignment identity: 0.991150442478

>>> r.qCoverage

1.0

Comparing two isoforms

Test data: test1.gff and test2.gff

This is the underlying API used for the chaining sample script. The script compares two spliced isoforms r1 and r2 (must be of the same chromosome) and compares the splice junctions of the two and categorizes the match as:

  • exact --- there is a one-to-one match for each splice junction between r1 and r2
  • super --- r1 has more junctions than r2 but for those that are shared, they agree
  • subset --- r2 has more junctions than r1 but for those that are shared, they agree
  • partial --- r1 and r2 share some identical junctions but each have at least one additional junction that the other one doesn't
  • nomatch --- no junction match between r1 and r2

Note that compare_junctions.compare_junctions does not care about the differences in 5' start or the 3' end. It only cares about the splice junctions. Also, given the above categorization, it does not matter whether the transcript is on the plus or minus strand.

You can download the test1.gff and test2.gff and test the following.

from cupcake.io import GFF
from cupcake.tofu.compare_junctions import compare_junctions

reader1 = GFF.collapseGFFReader('test1.gff')
reader2 = GFF.collapseGFFReader('test2.gff')
recs1 = dict((r.seqid,r) for r in reader1)
recs2 = dict((r.seqid,r) for r in reader2)

r1 = recs1['PB.979.1']
r2 = recs2['PB.1055.1']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106213, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57118745, 57119058)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'exact'

Here, PB.979.1 from test1 and PB.1055.1 from test2 have the same junctions, so they are categorized as exact match.

We next compare PB.979.1 with PB.1055.2, where the former is a super of the latter because it has more exons but every shared exon agrees on the junctions.

r2 = recs2['PB.1055.2']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106219, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108164)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'super'

Finally we compare PB.979.2 with PB.1055.3. Here I have intentionally modified PB.1055.3 to differ from PB.979.2 on the first two (3'-end) exons. The junction on PB.979.2 is 57106336-57106569 whereas for PB.1055.3 it is 57106339-57106568. You can see that there is an extra option in compare_junctions where you can allow fuzzy matches in the junctions.

r1 = recs1['PB.979.2']
r2 = recs2['PB.1055.3']
r1.ref_exons
#[Out]# [Interval(57106221, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57119046, 57119093)]
r2.ref_exons
#[Out]# [Interval(57106221, 57106339),
#[Out]#  Interval(57106568, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57119046, 57119077)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'partial'
compare_junctions.compare_junctions(r1, r2, internal_fuzzy_max_dist=5)
#[Out]# 'exact'