Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add normalization method to scanpy? #1

Open
giovp opened this issue Dec 12, 2020 · 22 comments
Open

add normalization method to scanpy? #1

giovp opened this issue Dec 12, 2020 · 22 comments

Comments

@giovp
Copy link

giovp commented Dec 12, 2020

hi!
recently read the paper and found the method really convincing and effective!
would you be interested in submitting a PR to scanpy? Had a look at the code and seems pretty straightforward to re implement there. This is essentially the function needed right?

def pearson_residuals(counts, theta):

If you are interested and willing, I would suggest you to loosely follow sc.pp.normalize_total for implementation
https://github.com/theislab/scanpy/blob/5533b644e796379fd146bf8e659fd49f92f718cd/scanpy/preprocessing/_normalization.py#L28-L202
you can also have a look at docs
would be very happy to assist/help out in case you are interested!

Thank you!
Giovanni

@dkobak
Copy link

dkobak commented Dec 12, 2020

Hi, Giovanni,

thanks for reaching out! We are planning to submit a PR to scanpy, perhaps even next week. Are you one of the scanpy developers? I have some questions:

  1. Should it be a new separate function, or something like flavour='pearson_residuals' as an option in an existing function?

  2. Normalizing with Pearson residuals makes a sparse matrix into a dense one. So running it on a large dataset where UMI counts are stored as a sparse matrix would require a lot of RAM to store the dense result. Of course, if one selects something like 1000 most variable genes in advance, then it's not an issue. Does scanpy to most variable gene selection by default?

  3. Pearson residuals can be used for most variable gene selection and/or for data normalization. But these are two different things. How should we implement it in scanpy?

@giovp
Copy link
Author

giovp commented Dec 13, 2020

Hi @dkobak ,

thank you for the prompt reply! All valid questions and it's indeed best to discuss them upfront. Pinging @ivirshup @flying-sheep and @LuckyMD to join discussion.

So a standard normalization workflows in scanpy now looks like this:

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)

as you pointed out, the pearson residual could come in before or after that. E.g.:

sc.pp.pearson_residual(adata)
sc.pp.highly_variable_genes(adata)

# or
# some normalization
sc.pp.highly_variable_genes(adata)
sc.pp.pearson_residual(adata)

It should also be pointed out that flavour="seurat_v3" accepts counts in sc.pp.highly_variable_genes and so could be also an alternative to filter out genes before pearson_residuals.

For these reasons, my naive view on this is to have a separate function that would give more flexibility to the users, as long as they know what they are doing. So

Should it be a new separate function, or something like flavour='pearson_residuals' as an option in an existing function?

yes

Normalizing with Pearson residuals makes a sparse matrix into a dense one. So running it on a large dataset where UMI counts are stored as a sparse matrix would require a lot of RAM to store the dense result. Of course, if one selects something like 1000 most variable genes in advance, then it's not an issue. Does scanpy to most variable gene selection by default?

no it does not, but it's flexible enough to do it with or without normalized counts. RE dense matrix, it's ok as long as there is a warning to the user

Pearson residuals can be used for most variable gene selection and/or for data normalization. But these are two different things. How should we implement it in scanpy?

for the reasons above, I would implement it in a separate function.

Looking forward to hear what you and the others think!

@dkobak
Copy link

dkobak commented Dec 13, 2020

Thanks @giovp. This is very helpful.

What is the meaning of "total" in normalize_total? Does it mean normalize by the "total" read/UMI counts per cell? If so, then I think Pearson residuals normalization should indeed be a separate function, also because it replaces both normalize_total and log1p. So I imagine a function like

sc.pp.pearson_residuals(adata)

For highly variable genes I would suggest to add a flavour="pearson_residuals", i.e.

sc.pp.highly_variable_genes(adata, flavor="pearson_residuals", n_top_genes=2000)

Despite what you said, I am a bit concerned about Pearson residuals resulting in a dense matrix. So I would suggest that sc.pp.pearson_residuals() has an argument highly_variable_genes=True set to True by default, and also n_top_genes=1000 (or some other default value). Then this function would run highly_variable_genes(flavor="pearson_residuals") first. This could be switched off by highly_variable_genes=False.

I am not sure if it would make sense to allow pearson_residuals() to use highly variable genes previously selected by highly_variable_genes() with any flavour. If yes, then I am not sure what's the best way to organize the interface.

In addition, what do you think having sc.pp.sqrt() in addition to log1p? I would let if have several flavours: flavor="freeman-tukey", "anscombe", "raw" (perhaps Freeman-Tukey by default). This is approximately variance-stabilizing for Poisson counts, whereas log1p is a bit of a hack.

@LuckyMD
Copy link

LuckyMD commented Dec 13, 2020

Hi @dkobak, @giovp,

I think pearson residuals would have to be a separate function, and it would require a bit of separate preprocessing infrastructure, as it wouldn't fit into the mix-and-max workflow scanpy currently has for pre-processing (which is currently not great). This is mainly as pearson residuals wouldn't really work with scanpy's sc.pp.highly_variable_genes function, which relies on mean-binning and the residuals would have mean 0 (or very close) by design I guess. The method's included highly variable gene selection would work much better here.

One thing I've been wondering about models that output residuals:
Would it be possible to not use residuals, but keep the intercept as well to preserve some concept of relative gene expression magnitude? This would also allow us to use the standard (but in this case obviously not recommended) scanpy standard sc.pp.highly_variable_genes function, and would enable users to pick and mix functions again.

Regarding filtering to highly variable genes... this is something we have been removing gradually from Scanpy. When Scanpy initially filtered out highly variable genes for further processing, we stored the full feature set in adata.raw. We've since used the adata.var['highly_variable'] field avoid filtering and enable users to query the expression of all genes later after plotting. I believe @ivirshup has been meaning to slowly deprecate adata.raw as currently all layers have to have the same number of features.

How about implementing this as two scanpy functions (for normalization and HVG selection) without filtering but adding a warning about scaling in the docs? We would also need to have a test to see if genes have mean 0 expression before standard HVG functionality is used within scanpy.

Regarding alternative forms of variance stabilization... I'm not sure what the best way forward there would be. I believe you showed sqrt-transform not to be the best idea, and it's typically not used. Maybe there's such a thing as providing too many options? On the other hand, it's easy to implement and some people might find it useful... I haven't really seen it outside of pre-processing papers tbh.

@dkobak
Copy link

dkobak commented Dec 13, 2020

Thanks for joining in @LuckyMD.

So if I understood correctly, you agree with having two new functions

sc.pp.pearson_residuals(adata)
sc.pp.highly_variable_genes(flavor="pearson_residuals")

right?

Would it be possible to not use residuals, but keep the intercept as well to preserve some concept of relative gene expression magnitude?

Interesting question. The problem is that we do seq depth normalization within computing the residuals. So there is no fixed "intercept" for each gene. Instead the "intercept" is different for each cell (because it depends on the cell's seq depth).

Screenshot from 2020-12-09 23-18-18

I guess one could add some intercept, e.g. p_g * median(n_c) for each gene after computing the residuals. @jlause and me have not thought about, but I guess one could consider doing it. I am not sure about the effect it would have downstream -- depends on how downstream scipy functions handle non-centered data matrix (e.g. for PCA/t-SNE/UMAP/etc one would want it to be centered eventually). We could have it as an option (perhaps even set to True by default, if it fits the scanpy pipeline better...). I don't immediately see many downsides. Apart from this feeling a bit arbitrary.

Regarding filtering to highly variable genes... this is something we have been removing gradually from Scanpy.

I see what you are saying. But when dealing with say 1M cells, dense data matrix of 1M times 25k genes takes A LOT of space. I would definitely like to have a possibility to run the analysis on my laptop which would require selecting let's say 1k highly var genes before computing dense residuals. Can we have it at least as an option (turned off by default), without breaking scanpy's logic? In this case, it'd be nice to have adata.raw still available to query any gene if one wants. Is there any way this can work after you phased our adata.raw @ivirshup ?

I believe you showed sqrt-transform not to be the best idea, and it's typically not used. Maybe there's such a thing as providing too many options?

We do argue it's not the best idea but it's not worse than log1p (rather better). I am not sure about this one, but my preference would be to have such function... As you say, "it's easy to implement and some people might find it useful". It's not used very often, that's true, but that's because everybody is following tradition and using log1p...

@LuckyMD
Copy link

LuckyMD commented Dec 14, 2020

So if I understood correctly, you agree with having two new functions

Partly... I guess the second function would be an addition to sc.pp.highly_variable_genes(). And I would maybe call the first function sc.pp.normalize_pr? Extending the HVG function, would mean this has to live in scanpy core. I would like to hear what @ivirshup thinks about this, as normally we would put functions like this in scanpy.external. Although I assume that maintenance of this function would not be problematic as it seems quite simple if I saw correctly.

I am not sure about the effect it would have downstream -- depends on how downstream scipy functions handle non-centered data matrix

sc.pp.pca() automatically centers the data. This would be mainly for showing gene expression levels on a UMAP and any other games that people might want to play with the gene expression data. Quite a lot of biologists I collaborate with aren't so comfortable with showing negative expression values in a plot.

It would be interesting to compare your suggestion with expression levels per gene that come out of other normalization methods that retain expression magnitude (like scran pooling).

In this case, it'd be nice to have adata.raw still available to query any gene if one wants. Is there any way this can work after you phased our adata.raw

adata.raw hasn't been phased out yet, but I think Isaac is looking for arguments why it shouldn't be ;). Maybe this is one... To be honest, I would think that anyone can subset the gene space to HVGs if they want to if they want to run this, no?

The second issue is that you'd have to use a non-pearson residuals normalization for the first HVG selection and then run normalization on this HVG set with pearson residuals to then know which HVG selection you might want to do based on variance of residuals, no?

It's not used very often, that's true, but that's because everybody is following tradition and using log1p...

That's probably true. I guess I use sc.pp.log1p not only for variance stabilization, but also to be closer to satisfy the normal assumption of several downstream methods. Would you suggest another transform for this?

@ivirshup
Copy link

Hey @dkobak. I haven't actually had a chance to fully read the preprint yet, so I might be missing out on some details of the method.

I am not sure about the effect it would have downstream -- depends on how downstream scipy functions handle non-centered data matrix ... But when dealing with say 1M cells, dense data matrix of 1M times 25k genes takes A LOT of space

So, what are the uses of the dense residual matrix? Do you still want it around after you've computed a PCA or other decomposition on top of it? To me, it seems like there are more downstream uses of the full feature set than a reduced one.

A possible solution here would be to not actually create a dense matrix. This is currently what we do with our default PCA for sparse data by using a scipy LinearOperator to implicitly center the matrix. Perhaps a similar approach could be taken here?

LinearOperators let you implicitly center

Here's an example of how we mean center the expression matrix for PCA

import numpy as np
from scipy.sparse import spmatrix
from scipy.sparse.linalg import LinearOperator

def _implicitly_center(X: spmatrix, mu: np.ndarray) -> LinearOperator:
    mu = mu[None, :]
    XH = X.T.conj(copy=False)
    _ones = np.ones(X.shape[0])[None, :].dot
    return LinearOperator(
        matvec=lambda x: X.dot(x) - mu.dot(x),
        dtype=X.dtype,
        matmat=lambda x: X.dot(x) - mu.dot(x),
        shape=X.shape,
        rmatvec=lambda x: XH.dot(x) - mu.T.dot(_ones(x)),
        rmatmat=lambda x: XH.dot(x) - mu.T.dot(_ones(x)),
    )

Another option would be to just store the residuals for the variable genes. A simple encoding would be a csc matrix where you just don't include entries for the other variables and loadings for missing variables in varm["PCs"] could be NaN.

@dkobak
Copy link

dkobak commented Dec 14, 2020

Replying to both comments above:

normally we would put functions like this in scanpy.external

That's up to you guys. We can make normalize_pearson_res and highly_variable_genes_pearson_res and put both into scanpy.external if this is what you prefer.

The second issue is that you'd have to use a non-pearson residuals normalization for the first HVG selection and then run normalization on this HVG set with pearson residuals to then know which HVG selection you might want to do based on variance of residuals, no?

Actually no! HVG selection with Pearson residuals does not require a lot of memory. Residuals can be computed per-gene (or in some batches, e.g. 1000 genes at a time), then we find their variance and don't need to store the residuals. So one can easily find variances of Pearson residuals without forming the entire Pearson residual matrix in memory.

So one would be able to run highly_variable_genes_pearson_res without any memory issues.

I guess I use sc.pp.log1p not only for variance stabilization, but also to be closer to satisfy the normal assumption of several downstream methods. Would you suggest another transform for this?

To be honest, I think sqrt makes more sense than log1p for any downstream use, at least for UMI data.


So, what are the uses of the dense residual matrix? Do you still want it around after you've computed a PCA or other decomposition on top of it? To me, it seems like there are more downstream uses of the full feature set than a reduced one.

Hmm. Maybe one does not want to have it around after PCA. When you run t-SNE/UMAP/clustering/etc in scanpy, are you using the PCA object? If yes, then one approach would be to actually combine normalize_pearson_res with PCA in one function like pca_pearson_res. And leave the original counts as they are (unnormalized). Sorry - I guess I am not super familiar with recommended scanpy workflows.

A possible solution here would be to not actually create a dense matrix. This is currently what we do with our default PCA for sparse data by using a scipy LinearOperator to implicitly center the matrix. Perhaps a similar approach could be taken here?

This won't work I think. The reason Pearson residual matrix is dense is not because the gene means are subtracted. The "mean" that is subtracted from each UMI count depends on both cell and gene, so the matrix of means is itself dense.

Another option would be to just store the residuals for the variable genes. A simple encoding would be a csc matrix where you just don't include entries for the other variables and loadings for missing variables in varm["PCs"] could be NaN.

You mean store the Pearson residuals not in adata.X but in a separate matrix? One could do that, but then downstream functions won't know how to use it, no? Not sure I fully understand this suggestion.

@LuckyMD
Copy link

LuckyMD commented Dec 14, 2020

Actually no! HVG selection with Pearson residuals does not require a lot of memory. Residuals can be computed per-gene (or in some batches, e.g. 1000 genes at a time), then we find their variance and don't need to store the residuals. So one can easily find variances of Pearson residuals without forming the entire Pearson residual matrix in memory.

Great! Then I would vote for this as an extension to sc.pp.highly_variable_genes(mode='pearson_residuals') as you initially suggested. There are more important votes than mine though ;).

To be honest, I think sqrt makes more sense than log1p for any downstream use, at least for UMI data.

Any particular reason for this?

@ivirshup
Copy link

When you run t-SNE/UMAP/clustering/etc in scanpy, are you using the PCA object?

I'd say the recommended workflow is UMAP on the pca, then clustering on the neighbor network from UMAP. But this is pretty commonly the case in single cell analysis.

If yes, then one approach would be to actually combine normalize_pearson_res with PCA in one function like pca_pearson_res.

👍, but you'd know better if there were more downstream applications for the residuals.

This won't work I think. The reason Pearson residual matrix is dense is not because the gene means are subtracted. The "mean" that is subtracted from each UMI count depends on both cell and gene, so the matrix of means is itself dense.

So, I think the issue is not that the matrix of means is dense. This is fairly easy to deal with while never making a dense matrix. The LinearOperator lets us write something that does matrix operations, and pass that to things like svd solvers. For example, if the residuals were just X - mu, we could write this as X - (X.mean(axis=1) @ X.mean(axis=0)). We can use the associative property with the mu term to avoid creating a dense matrix. Here's what a LinearOperator for this looks like:

Full code
import numpy as np
from scipy import sparse

from scipy.sparse.linalg import LinearOperator, svds, aslinearoperator
from sklearn.utils import check_random_state
from sklearn.utils.extmath import svd_flip
from scanpy.pp._utils import _get_mean_var


def operator_from_vector_product(col, row):
    assert len(col.shape) == 2 and col.shape[1] == 1
    assert len(row.shape) == 2 and row.shape[0] == 1
    return LinearOperator(
        matvec=lambda x: col @ (row @ x),
        matmat=lambda x: col @ (row @ x),
        rmatvec=lambda x: row.T @ (col.T @ x),
        rmatmat=lambda x: row.T @ (col.T @ x),
        dtype=col.dtype,
        shape=(col.shape[0], row.shape[1]),
    )

# In practice this probably should have another centering step for the residuals
def pca_from_centered(X, X_centered, npcs=50, random_state=None):
    random_state = check_random_state(random_state)
    np.random.set_state(random_state.get_state())
    random_init = np.random.rand(np.min(X.shape))

    u, s, v = svds(X_centered, solver="arpack", k=npcs, v0=random_init)
    u, v = svd_flip(u, v)
    idx = np.argsort(-s)
    v = v[idx, :]

    X_pca = (u * s)[:, idx]
    ev = s[idx] ** 2 / (X.shape[0] - 1)

    total_var = _get_mean_var(X)[1].sum()
    ev_ratio = ev / total_var

    output = {
        'X_pca': X_pca,
        'variance': ev,
        'variance_ratio': ev_ratio,
        'components': v,
    }
    return output
Example
def operator_from_vector_product(col, row):
    assert len(col.shape) == 2 and col.shape[1] == 1
    assert len(row.shape) == 2 and row.shape[0] == 1
    return LinearOperator(
        matvec=lambda x: col @ (row @ x),
        matmat=lambda x: col @ (row @ x),
        rmatvec=lambda x: row.T @ (col.T @ x),
        rmatmat=lambda x: row.T @ (col.T @ x),
        dtype=col.dtype,
        shape=(col.shape[0], row.shape[1]),
    )

# Test data
X = sparse.random(10000, 1000, format="csr", density=0.2)

mu = operator_from_vector_product(X.mean(axis=1).A, X.mean(axis=0).A)
Z = aslinearoperator(X) - mu

%memit
# peak memory: 342.61 MiB, increment: 0.07 MiB

%%memit
pca_from_centered(X, Z, random_state=42)
# peak memory: 376.68 MiB, increment: 34.05 MiB

%%memit
pca_from_centered(X, X.toarray() - (X.mean(axis=1) @ X.mean(axis=0)), random_state=42)
# peak memory: 589.89 MiB, increment: 213.20 MiB

# These results are very similar. You can compare the top few components to check.
# There's some weird behaviour with `svds` and dense arrays involving the number 
# of singular values, but I forget exactly what it is.

I just don't know if mu_{c, g} / sqrt(mu_{c, g} + mu_{c, g} ** 2 / theta) can be expressed as efficiently. I couldn't figure it out today.

You mean store the Pearson residuals not in adata.X but in a separate matrix?

Yeah, as something like adata.layers["pearson_residuals"].

One could do that, but then downstream functions won't know how to use it, no?

Here I was thinking downstream tools with special behavior for pearson residuals would look for some "pearson_residuals" key in uns, then a "layers_key" within that.

You could look at the output of sc.pp.neighbors for a similar case.

@dkobak
Copy link

dkobak commented Dec 18, 2020

Very good point about rank-1 \mu matrix representable this way in LinearOperator. That is pretty cool. However, my intuition is that it won't work with the denominator, because it's not linear anymore.

Okay, based on all of the above, my suggestion would be:

  1. We implement either sc.pp.highly_variable_genes(flavor='pearson_residuals') or sc.pp.highly_variable_genes_pearson_res in scanpy.external (what's your preference?)

BTW, what does flavor='seurat3' do? How is it different from 'seurat'`?

  1. We implement sc.pp.pca_pearson_res that would use only highly var genes by default, as sc.pp.pca currently does.

  2. We can also optionally implement sc.pp.normalize_pearson_res that would transform the entire count matrix into Pearson residuals. Just in case somebody wants to use this matrix for something else, other than PCA. So this would allow to get access to the residual values. This function could have an optional argument use_highly_variable=False and if it's set to True then it kicks out all non-highly-var genes to save memory. It could also have inplace=True by default but with inplace=False one would get Pearson residuals back in another adata object.

Actually this way sc.pp.pca_pearson_res could run sc.pp.normalize_pearson_res(use_highly_variable=True, inplace=False) first, and then run PCA on that. But the recommended Pearson residual pipeline would not use this function directly and simply be

sc.pp.highly_variable_genes(flavor='pearson_residuals')
sc.pp.pca_pearson_res()

(@jlause We should implement some checks that if the data look like read counts and not like UMIs, e.g. values too large, then you get a warning. We are currently working on Pearson residuals for read counts but it's work in progress.)

@ivirshup What do you think?

@LuckyMD
Copy link

LuckyMD commented Dec 18, 2020

Just a quick comment... your suggested pipeline would omit a separate normalization function that users might expect to run before sc.pp.highly_variable_genes() so it might be unintuitive for user expecting to follow a standard idea of scRNA-seq pre-processing. Thus, if the idea is to merge the function with PCA, I might consider doing it all in one go and instead selling it as a one-stop pre-processing function?

@dkobak
Copy link

dkobak commented Dec 18, 2020

@LuckyMD Hmm. We could simply call this function normalize_pca_pearson_res() rather than pca_pearson_res(), because indeed normalization happens inside. So this might reduce confusion.

Then the application pipeline would be

sc.pp.highly_variable_genes(adata, flavor='pearson_residuals')
sc.pp.normalize_pca_pearson_res(adata)

May still be confusing for some because normalize() comes after highly_variable_genes(), but I think maybe it's okay?

We could make normalize_pca_pearson_res run highly_variable_genes within by default, but this deviates from how sc.pp.pca() works so I would rather not do it.

(@jlause One thing we should do is to check if everything that is passed to Pearson residuals functions consists of integers. If not, fire a warning.)

@jlause
Copy link
Contributor

jlause commented Dec 18, 2020

Hey all,

thanks for all this input! Regarding @LuckyMD's point of the 'one-stop-preprocessing': I like the idea of having a complete workflow like HVG selection -> transform counts to pearson residuals -> PCA Because this is already three steps, would it make sense to bundle this as another scanpy.pp.recipe_pearson_residuals? As the other recipes, it could have a few hyperparameters (like n_top_genes and n_PCs).

Additionally, we could implement the two core functionalities we propose as independent functions for flexible usage, as @dkobak suggested above:

  • expand sc.pp.highly_variable_genes(adata) with new flavor='pearson_residuals', performing just the gene selection based on Pearson residuals in a memory-friendly (batchwise) way.
  • add sc.pp.normalize_pearson_res(adata), which transforms the input counts to pearson residuals.

Within the recipe, we would than just call these functions in appropriate order and add PCA on top.

@ivirshup @LuckyMD @giovp What do you think about this setup?

@jlause
Copy link
Contributor

jlause commented Dec 18, 2020

Regarding the practical implementation of sc.pp.normalize_pearson_res(adata), @dkobak proposed to integrate usage of previously computed gene selections:

This function could have an optional argument use_highly_variable=False and if it's set to True then it kicks out all non-highly-var genes to save memory. It could also have inplace=True by default but with inplace=False one would get Pearson residuals back in another adata object.

From earlier posts I understood that filtering to HVGs can be applied in two ways:

  1. the user calls sc.pp.highly_variable_genes to set adata.var.highly_variable. Then, any downstream method called with use_only_highly_variable=True will only use HVGs internally, without changing the size of adata.X (e.g. in sc.pp.pca)
  2. the user subsets adata manually by adata = adata[:, adata.var.highly_variable] and then calls the downstream function. Size of adata.X changes.

I think what @dkobak proposed is a mixture:

  1. the user calls sc.pp.highly_variable_genes to set adata.var.highly_variable. The downstream function then does the subsetting internally. Size of adata.X changes.

What is the preferred way?

To me, it seems the easiest could be to not integrate any use_highly_variable=True/False functionality into sc.pp.normalize_pearson_res(adata) at all, and expect that users who want to do that just use method (2) and -if needed- save the input data matrix to adata.raw to have it around for later plotting etc. Then we would not need to save data matrices of different sizes in the same adata object..

@dkobak
Copy link

dkobak commented Dec 18, 2020

@jlause Clarification:

The downstream function then does the subsetting internally. Size of adata.X changes.

What I meant is that this would ONLY happen when the user calls normalize_pearson_res with use_highly_variable=True which would not be the default option. So with default arguments everything would function as in your 1st way, as is standard in scanpy.

But you may be right that the 2nd way is enough and we don't need to implement use_highly_variable argument for normalize_pearson_res...

One thing, however: there is a slight difference between computing Pearson residuals on the subsetted matrix vs. on the full matrix but outputing only a subset of genes. Because the seq depths would be different... Not sure it matters much.

@LuckyMD
Copy link

LuckyMD commented Dec 18, 2020

I agree fully with @jlause. The proposed 3 functions (sc.pp.highly_variable_genes(adata, flavor='pearson_residuals'), sc.pp.normalize_pearson_res(adata), and 'sc.pp.recipe_pearson_residuals') fit fully within what a user would normally expect to find. The recipes we have so far are mostly normalization recipes, but I think it's fine for users to see this as an integrated strategy.

I would also prefer normalization functions that don't change the dimensions of the input object as a design principle.

One thing, however: there is a slight difference between computing Pearson residuals on the subsetted matrix vs. on the full matrix but outputing only a subset of genes. Because the seq depths would be different... Not sure it matters much.

This is the only aspect that might be problematic. We could get around this by adding a adata.obs.total_counts_raw column in sc.pp.filter_cells, so that any function can use this if it exists. But that would not cover all cases of course.

@dkobak
Copy link

dkobak commented Dec 18, 2020

@jlause @LuckyMD Okay, so I am on board with sc.pp.highly_variable_genes(adata, flavor='pearson_residuals') and with sc.pp.normalize_pearson_residuals(adata) that normalizes all genes and does not change the dimensions (and does not allow it even optionally). If the user wants to subset the dimensions, they would need to do so manually.

I am less sure about sc.pp.recipe_pearson_residuals() vs. sc.pp.normalize_pca_pearson_residuals(). The former function would include HVG selection, Pearson residuals of HVG genes, and then PCA. The latter would do Pearson residuals of HVG genes (given that they were computed previously) and then PCA. The usage would be e.g.

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.highly_variable_genes(adata, flavor='pearson_residuals')
sc.pp.normalize_pca_pearson_residuals(adata)

vs.

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.recipe_pearson_residuals(adata)

compared to the standard pipeline

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)

Right? I think the first option is actually closer to what the default scanpy approach is. Otherwise you could combine normalize_total, log1p, highly_variable_genes, and pca into some "recipe" as well. You don't do it (at least currently you don't) and the user has to run these four functions manually. If so, I don't quite see why the Pearson residuals pipeline should be necessarily packaged into one function. So my current preference is rather the first approach above.

@jlause
Copy link
Contributor

jlause commented Dec 19, 2020

@dkobak @LuckyMD Regarding sc.pp.recipe_pearson_residuals() vs. sc.pp.normalize_pca_pearson_residuals():

First of all: I thought having the pipeline as a recipe might be appropriate because it would ship the 'standard approach' we suggest in the paper, including all steps from raw to PCA, providing only a few degrees of freedom (I saw other recipes are doing something similar).

The pseudocode for sc.pp.recipe_pearson_residuals() would probably look like this:

def recipe_pearson_residuals(adata, n_top_genes=1000,n_pcs=50):

    sc.pp.filter_genes(adata, min_cells=5)
    sc.pp.filter_cells(adata, min_genes=200)

    sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor='pearson_residuals')
    adata = adata[:,adata.var.highly_variable]
    sc.pp.normalize_pearson_residuals(adata)

    sc.pp.pca(adata,  n_comps=n_pc)

The version suggested by @dkobak would look like this:

def normalize_pca_pearson_residuals(adata, n_pcs=50):

    if adata.var.highly_variable:
        adata=adata[:,adata.var.highly_variable]

    sc.pp.normalize_pearson_residuals(adata)
    sc.pp.pca(adata,  n_comps=n_pc)

and would allow the user to choose initial filtering and HVG selection.

In both bundles, one could additionally make sure that the adata.X remains unchanged in size along the way (except the initial gene/cell filtering in the recipe), and is still available afterwards.

@LuckyMD @ivirshup @giovp : What do you prefer?

@dkobak
Copy link

dkobak commented Dec 19, 2020

I would use something like

adata_for_pca = adata[:,adata.var.highly_variable]
sc.pp.normalize_pca_pearson_residuals(adata_for_pca)
sc.pp.pca(adata_for_pca,  n_comps=n_pc)
adata['pca'] = adata_for_pca['pca']

in either bundle.

@LuckyMD
Copy link

LuckyMD commented Jan 4, 2021

@dkobak @jlause Happy new year! Just got back to this discussion.

Otherwise you could combine normalize_total, log1p, highly_variable_genes, and pca into some "recipe" as well.

I guess the analogy is to the old sc.pp.recipe_zhang() function and I think there's one more which includes sc.pp.normalize_total, sc.pp.log1p and sc.pp.scale. Adding PCA is new for a recipe, true.

For me, both function concepts are otherwise completely fine. The only minor issue with expecting a prior HVG selection is that any HVG selection approach in Scanpy apart from the one proposed here would expect normalized data (and logged... I think). Thus, I assume most people that use this recipe would also perform the HVG selection before as there is currently not really an alternative. I would however agree with keeping cell and gene filtering out of the recipe.

I wonder if keeping a subset of the adata object that was used for normalization is the most memory efficient thing to do. We could instead just keep a mask in adata.var, and the residual data in adata.obsm or adata.raw? I'm not sure if we should store in adata.raw in a function by default though. That would depend on what the plans for that slot are... @ivirshup.

@dkobak
Copy link

dkobak commented Jan 4, 2021

Thus, I assume most people that use this recipe would also perform the HVG selection before

I think we should make highly_variable_genes(flavor='pearson_residuals') as well as normalize_pearson_residuals() check if the input data are integers, and if not, exit with an error. It should handle this potential problem, IMHO.

Looking forward to @ivirshup's thoughts on this.

Happy New Year to all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants