Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI count per gene #408

Open
tsofiya opened this issue Oct 26, 2023 · 8 comments
Open

UMI count per gene #408

tsofiya opened this issue Oct 26, 2023 · 8 comments

Comments

@tsofiya
Copy link

tsofiya commented Oct 26, 2023

Hi,
I have toy sample with only one barcode, and I used kallisto bus to create gene matrix.
I want to create a plot of reads vs UMIs for each gene.
The out put is, obviously, already collapsed and so I basically get the number of UMIs I had. How can I get the number of reads?
Thank you,
Tsofiya

@Yenaled
Copy link
Collaborator

Yenaled commented Oct 26, 2023

bustools count, by default, collapses the UMIs.
If you want to ignore the UMIs when counting reads, use bustools count with the --cm option.

@tsofiya
Copy link
Author

tsofiya commented Oct 26, 2023

Thank you.
It looks quite good, though sometimes I get more UMIs then reads which is weird.
I use:
bustools count -o no_collapsing/cells_x_genes -e matrix.ec -t transcripts.txt --genecounts output.unfiltered.bus -g kallisto/busIndex/kalisto_t2g.txt -cm
with and without the -cm and compare

@Yenaled
Copy link
Collaborator

Yenaled commented Oct 26, 2023

That can definitely happen.

Let's say you have 3 reads with the exact same UMI.

  • Read 1 maps to genes A, B
  • Read 2 maps to gene A, B
  • Read 3 maps to gene B, C

When doing UMI collapsing, you'll get that UMI counted because it will all be collapsed to gene B. However, when you do --cm, none of those reads will be counted because they all map to multiple genes. By default, things that map to multiple genes are always discarded.

@MengjunWu
Copy link

Hi,

Following this question, I want to ask how do you collapse UMI to generate cell x transcript equivalence class count table if pseudoaligned reads to transcripts? Do you still collapse umi on the gene level first and then count UMI in individual transcript, or you collapse UMI on each transcript independently.

Many thanks,
Mengjun

@Yenaled
Copy link
Collaborator

Yenaled commented Nov 14, 2023

UMIs are always collapsed at gene-level regardless. The final "collapsed" UMI should belong to a single gene (and the equivalence class would contain multiple transcripts associated with that gene).

For example, if you have:
UMI sequence ATCG: tx 0, tx 1
UMI sequence ATCG: tx 1, tx 2

If tx 0 and tx1 belong to gene A while tx 2 belong to gene B, the equivalence class of the collapse UMI sequence will simply have one item in it: tx 1.

If you have:
UMI sequence ATCG: tx 0, tx 1
UMI sequence ATCG: tx 2

Then we don't count that UMI because it maps to two different genes (i.e. the {tx0,tx1,tx2} equivalence class is NOT counted).

@MengjunWu
Copy link

MengjunWu commented Nov 15, 2023

Many thanks!
To confirm if I understood correctly:

In your first example: the two UMI sequences are two different reads, i.e.
UMI sequence ATCG: tx 0, tx 1 (read1)
UMI sequence ATCG: tx 1, tx 2 (read2)

While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e.
UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 1)

If in the second example the two UMI sequences are from different reads e.g.
UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 2)
After collapsing, both UMI should be kept: read1 UMI will have {tx1, tx0} associated with geneA while read2 UMI will have tx2 associated with geneB.

Is this correct? Thanks!

@mschilli87
Copy link

@MengjunWu:

While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e.
UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 1)

I don't thinks that necessarily true: Even in the case of

UMI sequence ATCG: tx 0, tx 1 (read 1)
UMI sequence ATCG: tx 2 (read 2)

the data would still suggest that there was a molecule (identified by the UMI), that on the one hand contains a subsequence that's compatible with gene A only, but on the other hand also features a subsequence that is exclusively found in gene B. Excluding fusion transcripts, this is incompatible with either gene resulting in both reads getting dropped

@Yenaled: Please correct me if I am wrong.

@MengjunWu
Copy link

@mschilli87 Thanks a lot for the alternative scenario! Got it and I think it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants