Any interest in getting SCTransform up on Scanpy? #1643

atarashansky · 2021-02-10T20:24:32Z

Additional function parameters / changed functionality / changed defaults?
New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
New plotting function: A kind of plot you would like to seein sc.pl?
External tools: Do you know an existing package that should go into sc.external.*?
Other?

I recently ported SCTransform from R into python. Any interest in getting it onto Scanpy?

The original paper is here. It's a variance-stabilizing transformation that overcomes some key drawbacks of previous, similar methods (e.g. overfitting caused by building regression models from individual genes as opposed to groups of similar genes). It also eliminates the need for pseudocounts, log transformations, or library size normalization.

My code is here.

Implementation notes (from the SCTransformPy README):

Poisson regression is done using the statsmodels package and parallelized with multiprocessing.
Improved Sheather & Jones bandwidth calculation is implemented by the KDEpy package.
Estimating the negative binomial dispersion factor, theta, using MLE was translated from the theta.ml function in R.
Pearson residuals are automatically clipped to be non-negative. This ensures that sparsity structure can be preserved in the data. Practically, the results do not change much when allowing for dense, negative values.

Anecdotally, it produces very similar results to the R implementation, though the code itself is still a little rough around the edges. I also have to do more formal quantitative benchmarking to ensure results are similar to those of the original package.

I thought I'd gauge interest here prior to working on making it scanpy-ready.

The text was updated successfully, but these errors were encountered:

ivirshup · 2021-02-11T04:22:38Z

Definitely love to have more contributions! I believe @LuckyMD and @giovp are quite keen on having this in the library.

Excited to see your benchmarks!

Side note: I'd definitely recommend looking into using joblib instead of multiprocessing for parallelization. It's a bit more simple to use, is much better about not oversubscribing your resources, and copying less data.

atarashansky · 2021-02-11T04:25:12Z

Thanks for the suggestion, @ivirshup! I'll take a look into that.

Would this be something that goes into sce.externals?

ivirshup · 2021-02-11T04:29:54Z

I think there's a pretty strong desire on the team to see something like this in the main API, if you'd like to contribute this there.

atarashansky · 2021-02-11T04:31:56Z

Gotcha! I'll prioritize getting the benchmarks up and then I'll need some guidance on how to organize it to fit in scanpy's codebase. Thanks!

LuckyMD · 2021-02-11T07:57:42Z

agreed, this would be super cool!

giovp · 2021-02-11T08:17:46Z

totally agree! would be really cool to have!

fidelram · 2021-02-12T14:18:50Z

I agree that this should not belong to 'external' but to the main API.

Also, I would not be initially concerned about performance. Having the tool first is more important at the moment. We can later see how can be optimized.

atarashansky · 2021-02-12T16:32:26Z

I looked at a few genes and it looks like I'm getting pretty similar residuals compared to the R implementation. Something weird is going on with the genes in the last couple images so I'm currently trying to figure that before generating more thorough benchmarks. (I clip negative values to zero to preserve sparsity structure)

atarashansky · 2021-02-12T17:17:24Z

Think I fixed it.

Similarity of cluster assignments based on the R and python corrected expressions is 0.9 (sklearn.metrics.v_measure_score).

Correlations between the top corrected gene expressions for the top 3000 varying genes between python and R:

Although the correlations are all close to 1.0, if you look at the scatter plot of all non-zero gene expressions, there seems to be some systematic difference in the linear shifting/scaling. Is this worth figuring out?

giovp · 2021-02-13T17:48:42Z

Hi @atarashansky , looks really good!
Have couple of questions and comments:

Is it the dataset in the original publication? Could you try it in a couple of other datasets (like the 3k ?). If you can make a notebook on how to run your implementation I could help with that.
I would also plot the slope and intercept estimate against mean exp, as in original pubblication.

Re parallelization, @ivirshup already mentioned joblib. For reference, I want to share here a framework developed by @michalk8 that uses joblib for parallelizing a function over a collection of elements.

It is already used in 2 scanpy sisters packages: https://github.com/theislab/cellrank and https://github.com/theislab/scvelo , as well as a third one that is coming out next week.

Implementation in theislab/scvelo
https://github.com/theislab/scvelo/blob/1659cc8e00a45fcf87cd80a7013aae5531744613/scvelo/tools/dynamical_model.py#L1172

Implementation in theislab/cellrank
https://github.com/theislab/cellrank/blob/master/cellrank/ul/_parallelize.py

I feel this settings is very similar to the one above, so maybe useful to check it out.

Let me know what you think !

atarashansky · 2021-02-13T20:08:37Z

Here are the fit parameters vs the geometric mean expression applied to the 33k pbmc dataset. Looks pretty similar!

Thanks for pointing me towards the joblib parallelization. I’ll work on applying it here.
Notebook incoming! Stay tuned.

atarashansky · 2021-02-13T20:13:28Z

Here’s the notebook.

giovp · 2021-02-16T08:36:22Z

@atarashansky thanks a lot, looks really good indeed!
I'll have time to have a look on Friday and will get back to you then! and thanks for the notebooks!

giovp · 2021-02-22T17:41:14Z

@atarashansky sorry for getting back to you late! I played around with the implementation, thanks a lot for the notebooks!

few questions before opening the PR:

do you think it's worth it to allow users to add other variables (beside log_umi) ?
do you think it would be useful to add other models other than poisson?
there are other outputs provided by R implementation other than pearson residuals. Do you think it's worth to include them?
testing: how do you think it should be best tested? we thought about saving results from original implementation in R and test against those (as it's done for others seurat re-implementation like highly variable genes)

looking forward to hear what you think! thank you!

gokceneraslan · 2021-02-22T23:40:27Z

This looks super cool! One minor thing is the bandwidth adjustment. Ideally this shouldn't require manual tweaking. Is this the case?

When I was playing around with Nebulosa (https://github.com/powellgenomicslab/Nebulosa) to reimplement it in python, I noticed that both bandwidth algorithms (scott and silverman) in scipy's gaussian_kde are really horrible. Bandwidths estimated by the ks package in R (https://cran.r-project.org/web/packages/ks/ks.pdf, which is what Nebulosa uses) were always superior.

atarashansky · 2021-02-25T20:28:58Z

Sorry for the late reply, the notifications for this thread got sent to my spam folder.

@giovp

I think so! It’s not difficult to extend it to more latent variables. We could allow them to specify any column(s) in the obs DataFrame.
Hmm, I think statsmodels can do regression on lots of different models, but from the source paper it sounds like using Poisson was simplest/fastest and did not affect the results too much when compared to negative binomial regression. I think parameter estimation for other models might be a bit more involved.
I think that would be pretty straightforward. What outputs are you referring to, specifically?
I’ve been testing by computing correlations between the genes from the python and R implementations. You could also compare rank-ordering of cells by variance. Another approach might be to compare the output of downstream analysis methods (like clustering) to see if the results are similar, and compare to the output of unprocessed data as a negative control.

atarashansky · 2021-02-25T20:32:40Z

@gokceneraslan

SCTransform does automatic bandwidth selection using the scott algorithm, which I am implementing with gaussian_kde. Although, I could change this to use bandwidth selection using KDEpy’s implementation, since I use their ISJ bandwidth algorithm for another step (the bandwidth algorithms used at different steps is following the methods write-up in the original paper).

gokceneraslan · 2021-02-26T03:05:20Z

I don't know how "easy" the density estimation problem here is, but I would avoid scott's whenever I can (e.g. https://aakinshin.net/posts/kde-bw/).

giovp · 2021-03-04T10:52:45Z

hi @atarashansky ,
sorryf or late reply again.
I'm not sure about kde issue raised by @gokceneraslan , do you think it's worth to look into it? I think they use the same in original implementation though right?

For the rest, I think the two main points that could be addressed before PR are:

using same nomenclature for inferred params as original implementation (so not like _1 that is done now) and also returning same set of params. This would make it easier for users who are already familiar, and also for us for benchmarking.
allow the creation of a design matrix so that users could pass any covariate (I'd say batch is the most important and probably used in general).

with these two points addressed, I think it's good to start a PR, I'd be happy to review and also help in setting up tests/benchmarks (which like you mentioned, would be probably best to just test against the result of original implementation).

Let me know for anything else, exciting for this!

jlause · 2021-03-04T18:40:54Z

Hi @atarashansky and everyone following this interesting discussion!

I just found this issue after posting quite a related PR yesterday (#1715) that came out of a discussion from the end of last year (berenslab/umi-normalization#1), and wanted to leave a note about that relation here:

In my PR, I implement normalization by analytic Pearson residuals based on a NB offset model, which is an improved/simplified version of the scTransform model that does not need regularization by smoothing anymore... This brings some theoretical advantages and we found it works well in practice (details in this preprint with @dkobak ).

One of the differences remaining between the two is how the overdispersion theta is treated (scTransform: fitted per gene, analytical residuals: fixed to one theta for all genes based on negative controls). I think fixing theta like that makes a lot of sense, but also thought about adding a function that learns a global theta from the data. With some modifications that could be another fruitful use-case of your theta.ml python implementation @atarashansky.

Also, I'm curious about the clipping of the Pearson residuals to [0, sqrt(n/30)] in your method. We also find that clipping is an important step for obtaining sensible analytical residuals, and I recently though a bit about motivating different cutoffs - so I'd be interested to learn what is behind your choice of sqrt(n/30)!

Looking forward to you thoughts on this :)
Jan.

atarashansky · 2021-03-06T04:32:32Z

@jlause Interesting work! It would indeed be nice to avoid having to learn bandwidths altogether.

What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector?

With regards to the clipping, I turned my brain off and copied the Seurat implementation as much as possible. sqrt(n/30) was the default parameter used by the SCTransform wrapper in Seurat. I also removed negative values to preserve sparsity structure of the data. Sorry I couldn't provide any insight about this!

dkobak · 2021-03-06T09:56:15Z

Oh, scTransform uses sqrt(n/30) by default? Interesting. If I remember correctly, the paper says they use sqrt(n)...

atarashansky · 2021-03-06T17:46:40Z

If I remember correctly, the SCTransform vst method uses sqrt(n) by default but the SCTransform wrapper in Seurat uses sqrt(n/30).

dkobak · 2021-03-06T21:34:38Z

What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector?

Regarding this -- yes, that's what we thought to do. Do you think it makes sense? However, for computational efficiency, I would threshold the genes by expression and take e.g. 1000 or even 100 genes with the highest average expression (for the purpose of estimating theta). Lowly-expressed genes don't really constrain theta much, it's highly-expressed ones that do.

hurleyLi · 2021-06-02T23:23:48Z

Hi guys, I've been following this thread and it's been quiet recently :) wondering if there's any updates on incorporating ScTransform on Scanpy. Thanks!! 🙏🏼

Moloch0 · 2023-05-26T15:36:57Z

Found it after I wrote it.😄 Here is my code for those who want to try 'R' SCT.

import anndata2ri
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
import numpy as np

anndata2ri.activate()
pandas2ri.activate()

def run_sctransform(adata, layer=None, **kwargs):
    if layer:
        mat = adata.layers[layer]
    else:
        mat = adata.X

    # Set names for the input matrix
    cell_names = adata.obs_names
    gene_names = adata.var_names
    r.assign('mat', mat.T)
    r.assign('cell_names', cell_names)
    r.assign('gene_names', gene_names)
    r('colnames(mat) <- cell_names')
    r('rownames(mat) <- gene_names')

    seurat = importr('Seurat')
    r('seurat_obj <- CreateSeuratObject(mat)')

    # Run
    for k, v in kwargs.items():
        r.assign(k, v)
    kwargs_str = ', '.join([f'{k}={k}' for k in kwargs.keys()])
    r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')

    # Extract the SCT data and add it as a new layer in the original anndata object
    sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@data')))
    adata.layers['SCT_data'] = sct_data.T
    sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@counts')))
    adata.layers['SCT_counts'] = sct_data.T
    return adata

adata.layers["data"] = adata.X.copy()

adata = run_sctransform(adata, layer="counts")

R[write to console]: Running SCTransform on assay: RNA
R[write to console]: Place corrected count matrix in counts slot
R[write to console]: Set default assay to SCT

adata
    layers: 'counts', 'data', 'SCT_data', 'SCT_counts'

Please use this code and your data with caution.

cpantea · 2023-06-16T08:19:27Z

Thank you everybody, and particularly @Moloch0, for contributing to this discussion. Your run_sctransform function is exactly what I needed for my analyses.

A little note in case anyone else might run into the same issue: my original AnnData contained a small set of genes/variables present in very few cells, which were filtered out during SCTransform normalisation. This prevented the SCT layers from being added to adata due to dimension mismatch.

To address this, I added a simple subsetting script to the function between normalisation and layer addition, as follows:

    #[...]
    r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')

    # Prevent partial SCT output because of default min.genes messing up layer addition
    r('diffDash <- setdiff(rownames(seurat_obj), rownames(mat))')
    r('diffDash <- gsub("-", "_", diffDash)')
    r('diffScore <- setdiff(rownames(mat), rownames(seurat_obj))')
    filtout_genes = svconvert(r('setdiff(diffScore, diffDash)'))
    filtout_indicator = np.in1d(adata.var_names, filtout_genes)
    adata = adata[:, ~filtout_indicator]

    # Extract the SCT data and add it as a new layer in the original anndata object
    #[...]

Hope that comes in handy for anyone else facing this issue!

atarashansky added the Enhancement ✨ label Feb 10, 2021

giovp mentioned this issue Mar 15, 2021

Normalization and gene selection by analytical Pearson residuals #1715

Merged

Hrovatin mentioned this issue Oct 7, 2021

Improve gene expression visualisatio non UMAP #2013

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any interest in getting SCTransform up on Scanpy? #1643

Any interest in getting SCTransform up on Scanpy? #1643

atarashansky commented Feb 10, 2021

ivirshup commented Feb 11, 2021 •

edited

atarashansky commented Feb 11, 2021

ivirshup commented Feb 11, 2021

atarashansky commented Feb 11, 2021

LuckyMD commented Feb 11, 2021

giovp commented Feb 11, 2021

fidelram commented Feb 12, 2021

atarashansky commented Feb 12, 2021 •

edited

atarashansky commented Feb 12, 2021 •

edited

giovp commented Feb 13, 2021

atarashansky commented Feb 13, 2021

atarashansky commented Feb 13, 2021

giovp commented Feb 16, 2021

giovp commented Feb 22, 2021

gokceneraslan commented Feb 22, 2021 •

edited

atarashansky commented Feb 25, 2021

atarashansky commented Feb 25, 2021

gokceneraslan commented Feb 26, 2021

giovp commented Mar 4, 2021

jlause commented Mar 4, 2021

atarashansky commented Mar 6, 2021

dkobak commented Mar 6, 2021

atarashansky commented Mar 6, 2021

dkobak commented Mar 6, 2021

hurleyLi commented Jun 2, 2021

Moloch0 commented May 26, 2023

cpantea commented Jun 16, 2023

Any interest in getting SCTransform up on Scanpy? #1643

Any interest in getting SCTransform up on Scanpy? #1643

Comments

atarashansky commented Feb 10, 2021

ivirshup commented Feb 11, 2021 • edited

atarashansky commented Feb 11, 2021

ivirshup commented Feb 11, 2021

atarashansky commented Feb 11, 2021

LuckyMD commented Feb 11, 2021

giovp commented Feb 11, 2021

fidelram commented Feb 12, 2021

atarashansky commented Feb 12, 2021 • edited

atarashansky commented Feb 12, 2021 • edited

giovp commented Feb 13, 2021

atarashansky commented Feb 13, 2021

atarashansky commented Feb 13, 2021

giovp commented Feb 16, 2021

giovp commented Feb 22, 2021

gokceneraslan commented Feb 22, 2021 • edited

atarashansky commented Feb 25, 2021

atarashansky commented Feb 25, 2021

gokceneraslan commented Feb 26, 2021

giovp commented Mar 4, 2021

jlause commented Mar 4, 2021

atarashansky commented Mar 6, 2021

dkobak commented Mar 6, 2021

atarashansky commented Mar 6, 2021

dkobak commented Mar 6, 2021

hurleyLi commented Jun 2, 2021

Moloch0 commented May 26, 2023

cpantea commented Jun 16, 2023

ivirshup commented Feb 11, 2021 •

edited

atarashansky commented Feb 12, 2021 •

edited

atarashansky commented Feb 12, 2021 •

edited

gokceneraslan commented Feb 22, 2021 •

edited