Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input preprocessing and normalization #20

Open
pavsol opened this issue Aug 24, 2022 · 3 comments
Open

Input preprocessing and normalization #20

pavsol opened this issue Aug 24, 2022 · 3 comments

Comments

@pavsol
Copy link

pavsol commented Aug 24, 2022

Hello,
I have a question regarding the normalization of the input expression matrix. Is there some recommendation on how to preprocess or normalize the data? In the tutorials, I noticed log and sqrt scaling but without any further comment. Then, is it recommended to use scv.pp.normalize_per_cell(adata) or filter the genes with sc.pp.highly_variable_genes(adata) for example? I would appreciate more information on this if possible.

More precisely, I am working with Seurat-integrated data and I use integrated assay as an input. The results seem to make sense so far, however, I noticed differences when I use "raw" integrated data or scaled with sc.pp.scale(adata) as recommended in Seurat integration vignette. It would be also nice to clarify whether it makes sense to use integrated data nad how.

Thank you,
Pavel

@ShobiStassen
Copy link
Owner

ShobiStassen commented Aug 25, 2022

hi Pavel,

Thanks for your message. Can I double check what type of single cell data you are analysing? scRNA-seq?
There is a lot of variation in preferred pre-processing. I think most papers analysing a new dataset pick a set of pre-processing steps that they can rationalize and give sensible results. Even between scanpy and seurat, I find there are some pre-defined pre-processing functions or recommendations that do log-scaling, especially when finding a subset of Highly Variable Genes (i would say that the sqrt scaling is intended to have a similar effect as log scaling in terms of reducing skewness of the data) and some that go straight to pca without any transformation. So this depends on what kind of results you get with and without a log transform.
As to whether or not you should do sc.pp.scale for pca. As far as I know, sc.tl.pca() automatically zero-centers the mean as this is required for pca to work properly. But you should pp.scale if you want each gene's variance to contribute equally to the pcs. Otherwise the genes with the biggest absolute variance will contribute most to the PCs, maybe this is what you prefer? Are you using a subset of Highly Variable Genes or just all genes passing some QC? See this discussion here about the same question wrt scaling and this paper about pre-processing practices and options.

When you say integrated data, are you integrating multiple batches?

@pavsol
Copy link
Author

pavsol commented Aug 25, 2022

Thank you for a quick response.
Yes, I am working with scRNA-seq data. I was not aware that sc.tl.pca() zero-centers the mean internally. But indeed, I see some differences in the output when I use sc.pp.scale or not, however, these differences are rather subtle and probably should not affect downstream analysis dramatically. I will try to explore the effect of scaling more.
The question about HVG is what I am asking now. Both cases give me quite reasonable results but there are some differences. It is now difficult to say which is better.
Regarding the integration, yes, I am integrating different batches. The resulting expression matrix is log-scaled and batch corrected. Batch correction, however, results in negative values in a significant proportion of genes/cells which cannot happen with simple log-scaling.

@ShobiStassen
Copy link
Owner

@pavsol hi Pavel, did you manage to get some reasonable analysis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants