Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intuition for noise parameter #237

Open
binyaminZ opened this issue May 25, 2022 · 5 comments
Open

Intuition for noise parameter #237

binyaminZ opened this issue May 25, 2022 · 5 comments

Comments

@binyaminZ
Copy link

Hi,
Thanks for this useful tool! I have done some extensive analysis and found BASiCS to very efficiently remove several types of extrinsic noise, validated by at least three independent datasets.
However, I am not able to follow the math of your algorithm, so I am having a hard time parsing the statistical meaning of delta. I do understand that the mean parameter, mu, would be the equivalent of TPM or something like this, but how would you describe delta? It seems to behave like CV or maybe like CV^2, but I'm not sure what is the mathematical difference between delta and CV.
I will explain our motivation: We are interested in comparing super-Poissonian behavior between samples and looking to convert the delta values to something that is independent of the mean, like the Fano factor (which is CV^2 / mean). Would it make sense to just divide delta by mu? Or how would you address this generally? The regression solution is not good for us, because our treatment causes a global change in noise, and the regression approach cancels out this effect.

Best,
Binyamin

@alanocallaghan
Copy link
Collaborator

A short explanation of delta is that it is the over-dispersion, that is the amount of variation observed above Poisson noise for a given cell+gene combination.1 In BASiCS, cells are modelled as being distributed according to a negative binomial distribution. This is pretty much a super-Poisson model, with the level of "super-Poisson-ness" being controlled by the over-dispersion parameter delta.

That is, we model counts $X$ for gene $i$ in cell $j$ as having a mean that depends on a gene-specific mean level $\mu_i$ and a cell-specific normalisation factor $\nu_j$:

$$ X_{ij} \sim NB(\nu_j \mu_i, \delta_{i}^{-1}) $$

If we ignore the cell-specific factor $\nu_j$ for a minute, and consider it constant, then the variance of this distribution is the Poisson variance (ie, the mean, given by $\nu_j \mu_i$) plus the level of "super-Poisson-ness" quantified by $\delta_i$, which is multiplied by the Poisson mean:

$$ \text{Var}(X_{ij}) = \nu_j \mu_i + \delta_{i}(\nu_j \mu_i) $$

I hope this is sufficient intuition for the interpretation of delta.

As for how to account for mean dependence, can you maybe explain in more detail why the global change in mean makes a regression approach unsuitable? If you estimate the mean/variance trend independently in the two groups you are investigating, then a global shift shouldn't be too much of a problem, although interpretation may be tricky if you have very large-scale shifts in expression patterns. In general most approaches that remove mean dependence from measures of variability tend to be at least somewhat similar in their approach, usually involving some sort of regression model as ratios of variance/SD to mean are generally still highly correlated with mean in sequencing data (and single cell in particular).

Footnotes

  1. The 2015 and 2016 papers cover this in more detail and in truth the variance of the counts is also modelled by the cell- and batch-specific factors, but for an intuition I think this is okay.

@binyaminZ
Copy link
Author

We treat cells with IdU, causing global increase in noise without altering mean expression levels. See our recent paper. Regarding mean dependence, we see that delta is negatively correlated with the mean (mu). We are interested in measuring noise amplification in a mean-independent manner, and looking for a metric that would resemble Fano, as below (from this paper):
image
If I get it correctly, using Fano is conceptually distinct from the regression methods you mention. Is it possible to use delta to "produce" a Fano-like measure?

@binyaminZ
Copy link
Author

Hi,
To follow up on this discussion, do you have any idea why there is a negative relation between mean and over-dispersion, as you call it "the confounding that is typically observed between mean and over-dispersion for scRNA-seq datasets (Eling et al. 2018)"? I don't think such a relation is expected for the real absolute count values (e.g. measured by other means, like smFISH).
Of note, both papers you mention there (Brennecke et al. 2013 and Antolovic et al. 2017) mention the mean-variance (or CV^2) dependence but not the mean-overdispersion dependence.

@alanocallaghan
Copy link
Collaborator

Hi, sorry for the delay getting back to you.

To follow up on the mean dependence issue, you can see below that CV2 and over-dispersion are similar measures, and in (sc)RNAseq data Fano isn't typically mean-independent. That's why in general1 scRNAseq pipelines for selecting HVGs involve a step where a variability measure (Fano/CV2/variance/overdispersion) is regressed against mean.

The question of whether and why mean and overdispersion are negatively correlated in scRNAseq is a complicated one - there are papers in favour of the idea2 and against it3. I don't feel qualified to provide a final answer in this thread unfortunately.

I'm not really familiar with smFISH data to be honest. Aspects of the data-generating process are different so perhaps the same doesn't hold there, though I don't think that really impacts how scRNAseq data should be treated.

library("scRNAseq")
library("matrixStats")
library("glmGamPoi")

data <- ZeiselBrainData()
## restrict to a cell type just to avoid composition effects
data <- data[, data$level1class == "pyramidal CA1"]
## arbitrary cutoff just to thin the data a bit
data <- data[rowMeans(counts(data)) > 1, ]

mean <- rowMeans(counts(data))
var <- rowVars(counts(data))
cv2 <- var / (mean^2)
fano <- var / mean

plot(mean, var, log = "xy")

plot(mean, cv2, log = "xy")

plot(mean, fano, log = "xy")

## estimate gamma-poisson glm to get overdispersion estimates
## these will be different to the BASiCS estimates but a good stand-in
fit <- glm_gp(data)

plot(fit$overdispersions, cv2, log = "xy")

plot(fit$overdispersions, var, log = "xy")

plot(fit$overdispersions, fano, log = "xy")

Footnotes

  1. such as http://bioconductor.org/books/3.15/OSCA.advanced/more-hvgs.html

  2. such as https://link.springer.com/article/10.1186/s13059-021-02584-9

  3. such as https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02451-7

@catavallejos
Copy link
Owner

Hi @binyaminZ, just a couple of things to add to Alan's response:

  1. The trend that is typically observed between mean and over-dispersion also holds in other technologies. In particular, bulk RNAseq DE tools such as DESeq2 make use of this relationship in order to stabilise inference for over-dispersion parameters. In terms of smFISH, my experience is rather limited. However, when analysing the data presented here, we saw that a similar trend is observed (in fact, Grün et al shown that CV estimates derived from their scRNAseq and smFISH data were similar).

  2. In terms of preserving a global increase of variability one option could be to try to adapt the definition of $\epsilon_i$ such that a global offset effect is not removed (this is me thinking out loud, rather than a tried & tested approach!). Currently, we define it as the difference between $\delta_i$ and what is predicted for the overall trend $f(\mu_i)$. The latter is given by:

$$f(\mu_i ) = \alpha_0 + \alpha_1 \log(\mu_i) + \sum_{l=1}^L \beta_l g_l(\log(\mu_i)),$$

where $g_l(\cdot)$ represent Gaussian radial basis functions. Then, $\epsilon_i$ is calculated as:

$$\epsilon_i = \log(\delta_i) - \left[\alpha_0 + \alpha_1 \log(\mu_i) + \sum_{l=1}^L \beta_l g_l(\log(\mu_i)) \right]$$

If there is a global increase in variability, I imagine this would be captured by $\alpha_0$. Therefore, it may be interesting to test the following as a variability metric in your context:

$$\epsilon_i = \log(\delta_i) - \left[\alpha_1 \log(\mu_i) + \sum_{l=1}^L \beta_l g_l(\log(\mu_i)) \right]$$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants