Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping kallisto equivalence classes to actual transcript sequences #421

Open
rubenchazarra opened this issue Dec 18, 2023 · 8 comments
Open

Comments

@rubenchazarra
Copy link

Hi there!
Thanks for developing this awesome tool! I am currently using kallisto-bustools to generate single-cell equivalence class counts.
I am wondering if there is a way of linking an equivalence class to a particular transcript sequence or region. For instance, eq class { t1, t2, t3} corresponds to exon 1 of t1, t2 and t3 which is chr1:200-300.
I assume this information must be present in the kallisto index, but I am not finding a straightforward way of extracting it. Any suggestion ?

@mschilli87
Copy link

Older version of kallisto had the --pseudobam flag which provided simliar information, but on a per read basis, but AFAIR it was removed due to the high maintainance burden. Maybe digging into the corresponding bits of the source code gives you some pointers on how to tackle your problem.

@Yenaled
Copy link
Collaborator

Yenaled commented Dec 18, 2023

Long-winded answer:

Correct, pseudobam was very hard to maintain, especially within the logic of the new de bruijn graph structure that was implemented in the latest release. As such, it has been removed (and I'm unsure if I'll have the bandwidth to bring it back or implement an alternative -- though it remains on my radar). For looking at exact locations where a read maps to, just go with a standard aligner (since that's what they were built for) and use kallisto for getting you the actual quantifications (which is what kallisto was built for).

If you use kallisto with the --num option, the read numbers (zero-indexed) are stored in the BUS file. You can then view your BUS file (with bustools text) and get your equivalence classes of interest and the read numbers that are associated with them, and then you can run those reads through a standard aligner.

If you're interested in looking at what exon a read maps to, you could index exons rather than individual transcripts (but then this will no longer be splice-aware and you'd discard all reads that cross any exon-exon splice junction).

Anyway, just some ideas :)

@rubenchazarra
Copy link
Author

Thanks for your reply!
More than knowing where does a particular read map to, I would like to know which particular genomic sequence an equivalence class is representing.
As far as I know, kallisto is based on compatibility of k-mers of reads from the query with k-mers of transcript sequences from the reference . I was wondering if there is a way of knowing which particular sequence in the reference is an equivalence class representing.

@Yenaled
Copy link
Collaborator

Yenaled commented Dec 18, 2023

Ah yes, the equivalence class is basically a collection of transcripts.

So if an equivalence class consists of transcripts A, B, and C — it means that the k-mers in your read are compatible with those three transcripts.

In ENSEMBL, the transcript names usually start with ENST… (for human), and you can get their sequences through the Ensembl website or by simply looking at the transcript sequences in the FASTA file that you ran “kallisto index” on.

In your kallisto output folder, there’s a .ec file — the first column is some arbitrary equivalence class ID and the second column represents the collection of transcript IDs associated with that equivalence class (the numerical transcript IDs are zero-indexed and correspond to the line numbers in the transcripts.txt file in your output folder).

@rubenchazarra
Copy link
Author

Yes, it is easy to know which transcripts are present in each equivalence class. Nonetheless, my intention is to link each equivalence class, to the actual transcript sequence (or genomic coordinates) it represents.
For instance, considering a gene with 3 transcripts: t1, t2, t3; which generate a total of 5 equivalence classes: { t1 }, { t1, t2 }, { t1, t2, t3 }, { t2, t3 }. How can I know which genomic coordinates is equivalence class { t2, t3 } representing ?
My goal is to know to which region of the transcripts is each equivalence class referring to
Thanks again!

@mschilli87
Copy link

One option might be to use the corresponding GTF file, convert it to BED format and then use bedtools to intersect the transcripts that belong to each equivalence class. This ways, you show end up with genomic regions that are in common to all transcripts in each equivalence class. If you prefer working with R over bedtools, you could implement the same idea using GRanges(List) objects.

@rubenchazarra
Copy link
Author

This is actually a great idea.
I was thinking there may be an easy way of extracting this info from the kallisto index, but maybe there isn't.
Thanks for the suggestion !

@pmelsted
Copy link
Contributor

In kallisto 0.48.0 there was a way to generate the GFA file of the index using the inspect command. There the transcript id was written for each unitig, so you could easily get the unitig sequence and blat it. This is useful for debugging a handful of equivalence classes but not suitable for building into a pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants