Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which kind of normalization for 10X data is perferred for kBET? #79

Open
Smilenone opened this issue Mar 30, 2024 · 3 comments
Open

Which kind of normalization for 10X data is perferred for kBET? #79

Smilenone opened this issue Mar 30, 2024 · 3 comments

Comments

@Smilenone
Copy link

Smilenone commented Mar 30, 2024

Thanks for such a good tool for assessing single-cell RNA-seq batch correction. I have 180k cells across 20 patients and I would like to analyze whether there is batch effect. I wonder which kind of data is perferred for kBET? raw counts with total genes, log(CPM+1) data with selected highly varibales genes, or z-score normalized log(CPM+1) data? Do I have to selected highly varibales genes or use PCAs as input?

@mbuttner
Copy link
Collaborator

mbuttner commented Apr 1, 2024

Hi @Smilenone
thank you for your appreciation!
On the choice of data normalization as input: When it comes to assessing a batch effect, one essentially wants to understand whether there is a batch effect to begin with (i.e. do we need to correct for it? The answer for patient data is often yes.) and second which tool is most suited for the batch correction. I would start with normalized data (log(CPM+1) or scran), because those normalizations worked well in most cases.
On the technicalities of kBET: Before you start with kBET, I suggest to downsample your data because kBET computes the k-nearest neighbor graph on a dense matrix by default, which does not scale to large data. There are a few tricks that will speed up the kBET computation, especially if you don't want to downsample. The slowest step is the computation of k-nearest neighbors and there are certainly more efficient algorithms around than the one used in the kBET package. In that case, I would normalize the data (e.g. log(CPM+1)), then select highly variable genes, then compute a PCA (with or without z-score scaling is up to you, I usually do not scale), then compute a k-nearest neighbor graph and pass this object alongside with the data matrix, while turning off the PCA step, the k-nearest neighbor step and fix the number of nearest neighbors to use (should be in the ballpark of number of batches * 5 such that there is a sufficient number of cells expected per batch).

I hope that helps! Please let me know if you have further questions.

@Smilenone
Copy link
Author

Thanks for your detailed response! I have one more question, should I use the average.pval to evaluate whether there exists batch effect in my data? The average.pval <0.05 means there exists batch effect in my data. Am I right?

@mbuttner
Copy link
Collaborator

In general, please beware that kBET is probably the most sensitive tool when it comes to batch effects and we realized that the pval may be extremely low even with very small batch effects, which may not bias your data as much. So the average rejection rate is the most telling metric and you can use the pval comparison for null and actual data as a sanity check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants