Mapping kallisto equivalence classes to actual transcript sequences #421

rubenchazarra · 2023-12-18T10:40:23Z

Hi there!
Thanks for developing this awesome tool! I am currently using kallisto-bustools to generate single-cell equivalence class counts.
I am wondering if there is a way of linking an equivalence class to a particular transcript sequence or region. For instance, eq class { t1, t2, t3} corresponds to exon 1 of t1, t2 and t3 which is chr1:200-300.
I assume this information must be present in the kallisto index, but I am not finding a straightforward way of extracting it. Any suggestion ?

mschilli87 · 2023-12-18T10:56:31Z

Older version of kallisto had the --pseudobam flag which provided simliar information, but on a per read basis, but AFAIR it was removed due to the high maintainance burden. Maybe digging into the corresponding bits of the source code gives you some pointers on how to tackle your problem.

Yenaled · 2023-12-18T11:16:43Z

Long-winded answer:

Correct, pseudobam was very hard to maintain, especially within the logic of the new de bruijn graph structure that was implemented in the latest release. As such, it has been removed (and I'm unsure if I'll have the bandwidth to bring it back or implement an alternative -- though it remains on my radar). For looking at exact locations where a read maps to, just go with a standard aligner (since that's what they were built for) and use kallisto for getting you the actual quantifications (which is what kallisto was built for).

If you use kallisto with the --num option, the read numbers (zero-indexed) are stored in the BUS file. You can then view your BUS file (with bustools text) and get your equivalence classes of interest and the read numbers that are associated with them, and then you can run those reads through a standard aligner.

If you're interested in looking at what exon a read maps to, you could index exons rather than individual transcripts (but then this will no longer be splice-aware and you'd discard all reads that cross any exon-exon splice junction).

Anyway, just some ideas :)

rubenchazarra · 2023-12-18T11:35:55Z

Thanks for your reply!
More than knowing where does a particular read map to, I would like to know which particular genomic sequence an equivalence class is representing.
As far as I know, kallisto is based on compatibility of k-mers of reads from the query with k-mers of transcript sequences from the reference . I was wondering if there is a way of knowing which particular sequence in the reference is an equivalence class representing.

Yenaled · 2023-12-18T11:43:55Z

Ah yes, the equivalence class is basically a collection of transcripts.

So if an equivalence class consists of transcripts A, B, and C — it means that the k-mers in your read are compatible with those three transcripts.

In ENSEMBL, the transcript names usually start with ENST… (for human), and you can get their sequences through the Ensembl website or by simply looking at the transcript sequences in the FASTA file that you ran “kallisto index” on.

In your kallisto output folder, there’s a .ec file — the first column is some arbitrary equivalence class ID and the second column represents the collection of transcript IDs associated with that equivalence class (the numerical transcript IDs are zero-indexed and correspond to the line numbers in the transcripts.txt file in your output folder).

rubenchazarra · 2023-12-18T13:22:55Z

Yes, it is easy to know which transcripts are present in each equivalence class. Nonetheless, my intention is to link each equivalence class, to the actual transcript sequence (or genomic coordinates) it represents.
For instance, considering a gene with 3 transcripts: t1, t2, t3; which generate a total of 5 equivalence classes: { t1 }, { t1, t2 }, { t1, t2, t3 }, { t2, t3 }. How can I know which genomic coordinates is equivalence class { t2, t3 } representing ?
My goal is to know to which region of the transcripts is each equivalence class referring to
Thanks again!

mschilli87 · 2023-12-18T13:44:39Z

One option might be to use the corresponding GTF file, convert it to BED format and then use bedtools to intersect the transcripts that belong to each equivalence class. This ways, you show end up with genomic regions that are in common to all transcripts in each equivalence class. If you prefer working with R over bedtools, you could implement the same idea using GRanges(List) objects.

rubenchazarra · 2023-12-18T14:02:42Z

This is actually a great idea.
I was thinking there may be an easy way of extracting this info from the kallisto index, but maybe there isn't.
Thanks for the suggestion !

pmelsted · 2023-12-18T14:14:03Z

In kallisto 0.48.0 there was a way to generate the GFA file of the index using the inspect command. There the transcript id was written for each unitig, so you could easily get the unitig sequence and blat it. This is useful for debugging a handful of equivalence classes but not suitable for building into a pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping kallisto equivalence classes to actual transcript sequences #421

Mapping kallisto equivalence classes to actual transcript sequences #421

rubenchazarra commented Dec 18, 2023

mschilli87 commented Dec 18, 2023

Yenaled commented Dec 18, 2023

rubenchazarra commented Dec 18, 2023

Yenaled commented Dec 18, 2023

rubenchazarra commented Dec 18, 2023

mschilli87 commented Dec 18, 2023

rubenchazarra commented Dec 18, 2023

pmelsted commented Dec 18, 2023

Mapping kallisto equivalence classes to actual transcript sequences #421

Mapping kallisto equivalence classes to actual transcript sequences #421

Comments

rubenchazarra commented Dec 18, 2023

mschilli87 commented Dec 18, 2023

Yenaled commented Dec 18, 2023

rubenchazarra commented Dec 18, 2023

Yenaled commented Dec 18, 2023

rubenchazarra commented Dec 18, 2023

mschilli87 commented Dec 18, 2023

rubenchazarra commented Dec 18, 2023

pmelsted commented Dec 18, 2023