Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Mp/dispersion smoothing #145

Open
wants to merge 59 commits into
base: development
Choose a base branch
from
Open

Conversation

picciama
Copy link
Collaborator

@picciama picciama commented May 27, 2022

This branch contains the dispersion-smoothing functionality:

  • write wrapper around the training procedure in the training_procedure function.
  • check for missing optional import of scikit-fda
  • implement sctransform-like scale param dispersion smoothing procedure
  • implement final mean model refit after dispersion smoothing procedure
  • write unit test for dispersion smoothing using sctransform with test data -> test for deviation from true scale param
  • check dask array support in sctransform code
  • check if exponentiation of scale param is always correct

Optional TODOs:

  • implement DESeq2 approach (doesn't smooth outliers, maybe not applicable here) + unit test
  • implement edgeR approach (will be moved to edgePy package eventually)

@picciama picciama added the enhancement New feature or request label May 27, 2022
@picciama picciama self-assigned this May 27, 2022
Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the unfinished tasks' status is, but the PR as it stands looks good. I would say unit-tests are a must but that's up to David at the end of the day.

batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved
batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved
Comment on lines 124 to 125
if isinstance(genes_log_gmean, dask.array.core.Array):
genes_log_gmean = genes_log_gmean.compute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I had trouble with this downstream. It was easier to transform to a dense numpy array before. I will check again to see if I can delay this further.

batchglm/train/numpy/base_glm/estimator.py Show resolved Hide resolved
batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved
genes_log_gmean = genes_log_gmean.compute()

# specify which kind of regularization is performed
scale_param = np.exp(self.model_container.theta_scale[0]) # TODO check if this is always correct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why an exponentiation here? is this meant to be link-function specific?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to resolve this (thus the TODO). Indeed, I think this depends on the noise model. I will have to look at that again.

batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved
batchglm/train/numpy/base_glm/vst.py Outdated Show resolved Hide resolved
batchglm/train/numpy/base_glm/vst.py Show resolved Hide resolved
@Zethson
Copy link
Member

Zethson commented Jun 15, 2022

Hi,

"implement edgeR approach" this is actually a very important use-case for batchGLM. We are planning to use batchGLM/diffxpy for a pure Python implementation of MILO and we promised to be able to 1:1 replace

    dge = edgeR.DGEList(counts=count_mat[keep_nhoods,:][:,keep_smp], lib_size=lib_size[keep_smp])
    dge = edgeR.calcNormFactors(dge, method="TMM")
    dge = edgeR.estimateDisp(dge, model)
    fit = edgeR.glmQLFit(dge, model, robust=True)

eventually. I would kindly ask you to also strongly consider implementing this. Having the edgeR and DEseq2 approaches being implemented here will also greatly boost the impact. I have no doubt about this.

@picciama
Copy link
Collaborator Author

I would kindly ask you to also strongly consider implementing this. Having the edgeR and DEseq2 approaches being implemented here will also greatly boost the impact. I have no doubt about this.

Most definitely. I had a look at the edgeR source code for already. It shouldn't be too complicated to transfer this to batchGLM. I will start implementing this tomorrow but cannot give an estimate for the time it'll take at this point in time.

I think the main part would be to take over estimateDisp, i.e. the glm edgeR procedure replaced by batchGLM using trend.method="locfit".
I will do this first and see which of the arguments the function accepts are needed for this configuration. Once it's implemented I'll see what else needs to be transferred to python. Let me know if you have any specific dataset in mind that's well suited for testing the batchGLM procedure. I'll create a jupyter notebook in the batchglm_tutorials repo and we could maybe do some fancy rpy2 stuff to directly compare against edgeR.

@Zethson
Copy link
Member

Zethson commented Jun 20, 2022

Amazing @picciama

Let me know if you have any specific dataset in mind that's well suited for testing the batchGLM procedure. I'll create a jupyter notebook in the batchglm_tutorials repo and we could maybe do some fancy rpy2 stuff to directly compare against edgeR.

Would it be too crazy to just compare for example the DE results of the edgeR reimplementation and the new Python version for a small dataset/simulation? I know that the edgeR model does a lot, but this might be the eventual goal?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants