Input preprocessing and normalization #20

pavsol · 2022-08-24T15:19:10Z

Hello,
I have a question regarding the normalization of the input expression matrix. Is there some recommendation on how to preprocess or normalize the data? In the tutorials, I noticed log and sqrt scaling but without any further comment. Then, is it recommended to use scv.pp.normalize_per_cell(adata) or filter the genes with sc.pp.highly_variable_genes(adata) for example? I would appreciate more information on this if possible.

More precisely, I am working with Seurat-integrated data and I use integrated assay as an input. The results seem to make sense so far, however, I noticed differences when I use "raw" integrated data or scaled with sc.pp.scale(adata) as recommended in Seurat integration vignette. It would be also nice to clarify whether it makes sense to use integrated data nad how.

Thank you,
Pavel

The text was updated successfully, but these errors were encountered:

ShobiStassen · 2022-08-25T01:07:05Z

hi Pavel,

Thanks for your message. Can I double check what type of single cell data you are analysing? scRNA-seq?
There is a lot of variation in preferred pre-processing. I think most papers analysing a new dataset pick a set of pre-processing steps that they can rationalize and give sensible results. Even between scanpy and seurat, I find there are some pre-defined pre-processing functions or recommendations that do log-scaling, especially when finding a subset of Highly Variable Genes (i would say that the sqrt scaling is intended to have a similar effect as log scaling in terms of reducing skewness of the data) and some that go straight to pca without any transformation. So this depends on what kind of results you get with and without a log transform.
As to whether or not you should do sc.pp.scale for pca. As far as I know, sc.tl.pca() automatically zero-centers the mean as this is required for pca to work properly. But you should pp.scale if you want each gene's variance to contribute equally to the pcs. Otherwise the genes with the biggest absolute variance will contribute most to the PCs, maybe this is what you prefer? Are you using a subset of Highly Variable Genes or just all genes passing some QC? See this discussion here about the same question wrt scaling and this paper about pre-processing practices and options.

When you say integrated data, are you integrating multiple batches?

pavsol · 2022-08-25T08:01:51Z

Thank you for a quick response.
Yes, I am working with scRNA-seq data. I was not aware that sc.tl.pca() zero-centers the mean internally. But indeed, I see some differences in the output when I use sc.pp.scale or not, however, these differences are rather subtle and probably should not affect downstream analysis dramatically. I will try to explore the effect of scaling more.
The question about HVG is what I am asking now. Both cases give me quite reasonable results but there are some differences. It is now difficult to say which is better.
Regarding the integration, yes, I am integrating different batches. The resulting expression matrix is log-scaled and batch corrected. Batch correction, however, results in negative values in a significant proportion of genes/cells which cannot happen with simple log-scaling.

ShobiStassen · 2022-09-02T03:16:33Z

@pavsol hi Pavel, did you manage to get some reasonable analysis?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input preprocessing and normalization #20

Input preprocessing and normalization #20

pavsol commented Aug 24, 2022

ShobiStassen commented Aug 25, 2022 •

edited

pavsol commented Aug 25, 2022

ShobiStassen commented Sep 2, 2022

Input preprocessing and normalization #20

Input preprocessing and normalization #20

Comments

pavsol commented Aug 24, 2022

ShobiStassen commented Aug 25, 2022 • edited

pavsol commented Aug 25, 2022

ShobiStassen commented Sep 2, 2022

ShobiStassen commented Aug 25, 2022 •

edited