UMI count per gene #408

tsofiya · 2023-10-26T10:32:04Z

Hi,
I have toy sample with only one barcode, and I used kallisto bus to create gene matrix.
I want to create a plot of reads vs UMIs for each gene.
The out put is, obviously, already collapsed and so I basically get the number of UMIs I had. How can I get the number of reads?
Thank you,
Tsofiya

Yenaled · 2023-10-26T10:34:21Z

bustools count, by default, collapses the UMIs.
If you want to ignore the UMIs when counting reads, use bustools count with the --cm option.

tsofiya · 2023-10-26T12:29:09Z

Thank you.
It looks quite good, though sometimes I get more UMIs then reads which is weird.
I use:
bustools count -o no_collapsing/cells_x_genes -e matrix.ec -t transcripts.txt --genecounts output.unfiltered.bus -g kallisto/busIndex/kalisto_t2g.txt -cm
with and without the -cm and compare

Yenaled · 2023-10-26T13:42:09Z

That can definitely happen.

Let's say you have 3 reads with the exact same UMI.

Read 1 maps to genes A, B
Read 2 maps to gene A, B
Read 3 maps to gene B, C

When doing UMI collapsing, you'll get that UMI counted because it will all be collapsed to gene B. However, when you do --cm, none of those reads will be counted because they all map to multiple genes. By default, things that map to multiple genes are always discarded.

MengjunWu · 2023-11-14T21:26:40Z

Hi,

Following this question, I want to ask how do you collapse UMI to generate cell x transcript equivalence class count table if pseudoaligned reads to transcripts? Do you still collapse umi on the gene level first and then count UMI in individual transcript, or you collapse UMI on each transcript independently.

Many thanks,
Mengjun

Yenaled · 2023-11-14T23:37:35Z

UMIs are always collapsed at gene-level regardless. The final "collapsed" UMI should belong to a single gene (and the equivalence class would contain multiple transcripts associated with that gene).

For example, if you have:
UMI sequence ATCG: tx 0, tx 1
UMI sequence ATCG: tx 1, tx 2

If tx 0 and tx1 belong to gene A while tx 2 belong to gene B, the equivalence class of the collapse UMI sequence will simply have one item in it: tx 1.

If you have:
UMI sequence ATCG: tx 0, tx 1
UMI sequence ATCG: tx 2

Then we don't count that UMI because it maps to two different genes (i.e. the {tx0,tx1,tx2} equivalence class is NOT counted).

MengjunWu · 2023-11-15T10:11:24Z

Many thanks!
To confirm if I understood correctly:

In your first example: the two UMI sequences are two different reads, i.e.
UMI sequence ATCG: tx 0, tx 1 (read1)
UMI sequence ATCG: tx 1, tx 2 (read2)

While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e.
UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 1)

If in the second example the two UMI sequences are from different reads e.g.
UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 2)
After collapsing, both UMI should be kept: read1 UMI will have {tx1, tx0} associated with geneA while read2 UMI will have tx2 associated with geneB.

Is this correct? Thanks!

mschilli87 · 2023-11-15T11:11:41Z

@MengjunWu:

While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e.
UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 1)

I don't thinks that necessarily true: Even in the case of

UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 2)

the data would still suggest that there was a molecule (identified by the UMI), that on the one hand contains a subsequence that's compatible with gene A only, but on the other hand also features a subsequence that is exclusively found in gene B. Excluding fusion transcripts, this is incompatible with either gene resulting in both reads getting dropped

@Yenaled: Please correct me if I am wrong.

MengjunWu · 2023-11-15T12:55:24Z

@mschilli87 Thanks a lot for the alternative scenario! Got it and I think it makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI count per gene #408

UMI count per gene #408

tsofiya commented Oct 26, 2023

Yenaled commented Oct 26, 2023

tsofiya commented Oct 26, 2023

Yenaled commented Oct 26, 2023

MengjunWu commented Nov 14, 2023

Yenaled commented Nov 14, 2023 •

edited

MengjunWu commented Nov 15, 2023 •

edited

mschilli87 commented Nov 15, 2023

MengjunWu commented Nov 15, 2023

UMI count per gene #408

UMI count per gene #408

Comments

tsofiya commented Oct 26, 2023

Yenaled commented Oct 26, 2023

tsofiya commented Oct 26, 2023

Yenaled commented Oct 26, 2023

MengjunWu commented Nov 14, 2023

Yenaled commented Nov 14, 2023 • edited

MengjunWu commented Nov 15, 2023 • edited

mschilli87 commented Nov 15, 2023

MengjunWu commented Nov 15, 2023

Yenaled commented Nov 14, 2023 •

edited

MengjunWu commented Nov 15, 2023 •

edited