-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMI count per gene #408
Comments
bustools count, by default, collapses the UMIs. |
Thank you. |
That can definitely happen. Let's say you have 3 reads with the exact same UMI.
When doing UMI collapsing, you'll get that UMI counted because it will all be collapsed to gene B. However, when you do --cm, none of those reads will be counted because they all map to multiple genes. By default, things that map to multiple genes are always discarded. |
Hi, Following this question, I want to ask how do you collapse UMI to generate cell x transcript equivalence class count table if pseudoaligned reads to transcripts? Do you still collapse umi on the gene level first and then count UMI in individual transcript, or you collapse UMI on each transcript independently. Many thanks, |
UMIs are always collapsed at gene-level regardless. The final "collapsed" UMI should belong to a single gene (and the equivalence class would contain multiple transcripts associated with that gene). For example, if you have: If tx 0 and tx1 belong to gene A while tx 2 belong to gene B, the equivalence class of the collapse UMI sequence will simply have one item in it: tx 1. If you have: Then we don't count that UMI because it maps to two different genes (i.e. the {tx0,tx1,tx2} equivalence class is NOT counted). |
Many thanks! In your first example: the two UMI sequences are two different reads, i.e. While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e. If in the second example the two UMI sequences are from different reads e.g. Is this correct? Thanks! |
I don't thinks that necessarily true: Even in the case of UMI sequence ATCG: tx 0, tx 1 (read 1) the data would still suggest that there was a molecule (identified by the UMI), that on the one hand contains a subsequence that's compatible with gene A only, but on the other hand also features a subsequence that is exclusively found in gene B. Excluding fusion transcripts, this is incompatible with either gene resulting in both reads getting dropped @Yenaled: Please correct me if I am wrong. |
@mschilli87 Thanks a lot for the alternative scenario! Got it and I think it makes sense. |
Hi,
I have toy sample with only one barcode, and I used kallisto bus to create gene matrix.
I want to create a plot of reads vs UMIs for each gene.
The out put is, obviously, already collapsed and so I basically get the number of UMIs I had. How can I get the number of reads?
Thank you,
Tsofiya
The text was updated successfully, but these errors were encountered: