Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make alevin-fry --expect cells to be better at guessing the number of cells? #102

Open
ljudevitluka opened this issue Jan 31, 2023 Discussed in #101 · 3 comments

Comments

@ljudevitluka
Copy link

Discussed in #101

Originally posted by ljudevitluka January 31, 2023
Hi all,
I am working on the scRNAseq dataset generated with the 10x Genomics platform. We aimed to collect 10, 000 cells per experiment, and we mapped the reads to the transcriptome (since no genomic reference is available at the moment). Everything seems to be working more or less ok with alevin-fry, but in my samples, I always get the overestimation of the cell number (as per the graph bellow). I am aware we could use --force-cells, but would be more comfortable if the knee is detected by the --expect-cells flag. I tried calling --expect-cells 5000, 8000, and 10000 cells but the results were always overestimated.
My question is: Is there a way to make --expect-cells more sensitive? If I use --force-cells how will this affect my downstream analysis if some of the samples are carrying noise? If there is no automated way, is it better to take a bit of noise and hope it will get filtered out based on the further cell QC (mitoDNA genes content, ribosomal content etc.) or to be on the secure side and take a bit less cell with a risk of loss of the cells that have a lower number of genes expressed?
Thank you very much for your help and opinions :) Hopefully, this question is not out of the scope of discussion.
All the best, Luka
image

@rob-p
Copy link
Contributor

rob-p commented Feb 1, 2023

Hi @ljudevitluka,

Thanks for the question! What is your intended workflow? I think there may be a slight disconnect between the intended usage of expect-cells and your understanding.

In general there are several ways to generate a permit list with alevin-fry:

  1. The knee method
  2. Expect cells
  3. Force cells
  4. Unfiltered permit list
  5. Valid bc permit list

As you suggest, force cells forces a specific number of cells to be quantified (the top N if you pass the argument N, as long as there are at least N distinct barcodes in the input).

The purpose of expect cells is to provide a liberal quantification for a number of cells around the number you specify. The way it actually achieves this is motivated by what Cell Ranger does — taking the frequency distribution up to the number you specify, looking at the 99th percentile, and then including anything that has up to 1/10th that number of barcodes. As you suggest, the idea here is that it is better to quantify more cells, under the assumption that they can later be filtered out if they are of low or dubious quality once you are doing your analysis.

Now, I'd expected the knee method to be most inline with what you seem to suggest from the plot. The purpose of the knee method is to find the cutoff at the knee of the rank plot. This method doesn't take an argument, and build the plot and attempts to find the knee itself. This method usually works well, however it tends to be quite conservative in the number of cells it calls — i.e. it errs on the side of excluding cells from quantification rather than including them.

However, since you are using 10x Chromium technology, what I'd actually recommend as a default pipeline is to use --unfiltered-pl. This allows you to pass in the unfiltered permit list (e.g. the 10x v2 or 10x v3 "whitelist") to alevin-fry. It will then quantify all cells above a nominal count threshold (default 10 reads). This will result in a count matrix with many cells, but it can then be intelligently filtered in downstream analysis using an intelligent algorithm such as DropletUtils::emptyDrops. In addition to filtering by e.g. mitochondrial gene content, this will give your downstream analysis pipeline the most information to distinguish between "high-quality" and "low-quality" cells. That label is highly correlated with read/UMI count, but it's not one-to-one.

Let me know if the above makes sense, or if you have any other questions. Also, looping in @DongzeHE as he may have thoughts as well.

Best,
Rob

@DongzeHE
Copy link
Contributor

DongzeHE commented Feb 1, 2023

Hi all,

I totally agree with what @rob-p said. One thing to add: If you have an expected number of cells in your mind, you can also try to run the DropletUtils::emptyDropsCellRanger by setting n.expected.cells = 10000. This function mimics the behavior of CellRanger's cell-calling strategy. Usually, this function will return more cells than DropletUtils::emptyDrops.

Best,
Dongze

@ljudevitluka
Copy link
Author

Dear @rob-p @DongzeHE,

thank you very much for your help and explanations!
At first, I actually tested all methods 1-4, to find the "best" way to get a list of valid cells out of my data. In the end, choosing --force-cells with setting <20 000> seemed reasonable, as it is more than what we placed in the library prep and less than what both --expect-cells and --knee-distance gave ~70000 calls.

Based on your reply, a good approach would be to use --unfiltered-pl with the 10x v3 "whitelist" settings and proceed with DropletUtils to generate a list of high-quality cells.

We are comparing 4 states: healthy (control) vs. infected vs. vaccinated+infected vs. "placebo"-vaccinated+infected; all conditions in duplicates.
My intended workflow is to Quality filtering per sample > Generate a high-quality cell list per sample> Merge high-quality cells from all samples together > Normalise > Cluster > Identify specific/enriched cell types and if possible do differential expression analysis. Hopefully, this makes sense.

All of the analyses are very new to me, so yes I think my understanding was probably wrong. I was thinking that after mapping the list of cells that are generated with alevin-fry is almost complete, and afterward only mtDNA content and rRNA content are the values used for filtering.

Best,

Luka

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants