Handling multiple samples: Creating a master transcriptome

Last Updated: 12/21/2022

How to handle multiple samples

What to do if you have multiple Iso-Seq runs? How do you connect the novel isoforms from one sample to another?

There are a few existing ways to do it and they all have their pros and cons.

Tool	Input	Output	Accepts Reference?	Designed for Long Reads?
gffcompare	GTF	one combined GTF	Yes, optional	No
Cupcake chaining	GTF, Counts, (FASTA)	GTF, Counts, (FASTA)	No	Yes
TAMA Merge	BED	BED	Yes	Yes
TALON	MD-tagged SAM	CSV, GTF	Yes, required	Yes

Solution 1: Use Cupcake Chaining

Cupcake chaining is very simple. For each sample, you provide a collapsed GFF3/GTF, a FL count file, and optionally a reference FASTA.

Cupcake does this chaining iteratively, so as the number of samples increase, issues with matching slightly difference isoforms become increasingly difficult. You may begin to see redundant isoforms (ex: there is no 1-to-1 mapping between isoforms across samples, rather you see an isoform from sample 1 matching to two isoforms in sample 2, etc).

TL;DR I recommend Cupcake chaining for small-scale (ex: 2-4) sample merging.

Solution 2: Use TAMA Merge

TAMA Merge is part of the TAMA package that is based on BED file format. It handles merging of samples more gracefully than Cupcake, but it only supports BED formats and does not appear to accept count information.

TL;DR I recommend TAMA Merge for small-scale (ex: 2-4) sample merging.

Solution 3: Using TALON

TALON was developed to handle the ENCODE4 consortium Iso-Seq samples and is database-driven. You start with a foundational DB (ex: Gencode) and incrementally add samples using a SQL database.

TL;DR I recommend TALON for moderate-to-large (ex: >4) sample merging where a reference transcriptome readily exists.

Tutorial Using TALON, a database-orientated way to chain samples

The TALON documentation on their GitHub is excellent, and below I'm simply providing an example of how to use it.

Note that in the TALON tutorial below, I'm using it slightly differently from how the TALON developers devised it (and hence, there are some changes). For example, I'm not giving TALON a SAM file where the alignments are aligned and cleaned full-length reads, which means these are very large SAM files with millions of alignments many of which come from the same isoforms.

Instead, I'm using TALON following the isoseq3 --> Cupcake collapse --> SQANTI3 classification and filtering pathway. So my input to TALON are unique transcripts that have gone through filtering and each transcript would have an associated FL count already. This means TALON will run pretty fast since the input file is much smaller, and I'm also going to skip the TALON filtering step.

(i) initializing a database

For this example, I'm using Gencode v36 as the initial database.

talon_initialize_database --f gencode.v36.annotation.gtf \
     --g hg38 --a gencode36 --o myTest

(ii) making MD-tagged SAM files for each sample

TALON will only accept MD-tagged SAM files. Let's say if you already ran Cupcake collapse or maybe already went through SQANTI3 filtering. For each of the samples you want to "chain/combine", it's most likely you have either a GFF3 file or a FASTA file.

If you have a GFF3 file, you can re-constitute a FASTA file from the genome using say getfasta or gffread.

NOTE: make sure your FASTA sequence IDs are PB.X.Y without additional description or text, if you want to be able to link it cleanly with other reports (ex: from SQANTI3). For example, your fasta file may have IDs that look like:

PB.1.1|chr16:23548331-23548927(-)|RPFWgene

which has extra text starting with the | symbol. One way to do clean it up is to run the following sed command to your fasta file:

sed -i 's/|.*//g' 5merge.collapsed.rep.fa

Now we align the FASTA file to the ref genome and make a MD-tagged SAM file. Note that we're using the splice:hq preset for alignment here as it is more sensitive to small exons for HiFi reads alignment.

minimap2 -ax splice:hq -uf --secondary=no -C5 hg38.fa -t30 \
    5merge.collapsed.rep.fa > 5merge.collapsed.rep.fa.hg38.sam

samtools calmd 5merge.collapsed.rep.fa.hg38.sam \
    hg38.fa \
    --output-fmt sam > 5merge.collapsed.rep.fa.hg38.MDtagged.sam

(iii) add new sample(s) to TALON database

The config file for TALON is of the following format:

name, sample description, platform, sam file (full path)

So we can make a config.csv file like:

sample1,sample1,PacBio,sample1/5merge.collapsed.rep.fa.hg38.MDtagged.sam
sample2,sample2,PacBio,sample2/5merge.collapsed.rep.fa.hg38.MDtagged.sam
sample3,sample3,PacBio,sample3/5merge.collapsed.rep.fa.hg38.MDtagged.sam

Now we add the samples to the database:

talon --f config.csv \
      --db myTest.db \
      --build hg38 \
      --t 30 \
      --cov 0.95 \
      --identity 0.95 \
      --o output1

(iv) optional: using TALON to extract sample abundances & filtering

At this point, you can use TALON to extract sample abundances and filter library artifacts.

However, I usually would have already run through SQANTI3 filtering before I use TALON. TALON filtering and SQANTI3 filtering serve the same purpose, just depends on whether you want to add everything to the database first then filter, or filter first w SQANTI3 then add them to the database.

(v) Output combined sample results from TALON

We can now output the union of all observed samples from the TALON db.

talon_create_GTF --db=myTest.db \
                 --annot=gencode36 \
                 -b hg38 \
                 --observed \
                 --o=combined

Note that I'm using --observed, so anything that appears at least once in the samples in my config.csv (and whatever else is added later) will be output. If you have specific subset of samples you want to output, you can use the --datasets option instead.

The output will be combined_talon_observedOnly.gtf.

(vi) linking TALON combined transcriptome back to FL counts

Recall that we can't use TALON to generate counts because we have given it unique transcripts per sample.

To link it back to the FL counts, we can use the PB.X.Y IDs that are in the .abundance.txt files you should have gotten when you ran Cupcake post-collapse abundance tallying, where the file looks like this:

pbid    count_fl
PB.1.1  1
PB.2.1  1
PB.3.1  3
PB.3.2  1
PB.3.3  3
PB.3.4  1
PB.3.5  705

You can now associate this PB.X.Y IDs back with the PB.X.Y IDs in the TALON output TSV file. (Remember however, that that TSV file will contain the PBIDs from all samples, so if you are using R to do the joining, do this per-sample)

Here's an example of how to do that for one sample in R:

x<-read.table('output1_talon_read_annot.tsv',sep='\t',header=T)
x.sample1 <- subset(x, dataset==sample1)
count.info <- read.table('sample1/5merge.collapsed.abundance.txt',sep='\t',header=T)
x.sample1$pbid <- x.sample1$read_name
m <- merge(x.sample1, count.info, by='pbid')
write.table(m, 'merged.count.sample1.tsv', quote=F, row.names=F, sep='\t')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly