New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] Mp/dispersion smoothing #145

Open

picciama wants to merge 59 commits into development from mp/dispersion_smoothing

Collaborator

picciama commented May 27, 2022 •

edited

This branch contains the dispersion-smoothing functionality:

write wrapper around the training procedure in the training_procedure function.
check for missing optional import of scikit-fda
implement sctransform-like scale param dispersion smoothing procedure
implement final mean model refit after dispersion smoothing procedure
write unit test for dispersion smoothing using sctransform with test data -> test for deviation from true scale param
check dask array support in sctransform code
check if exponentiation of scale param is always correct

Optional TODOs:

implement DESeq2 approach (doesn't smooth outliers, maybe not applicable here) + unit test
implement edgeR approach (will be moved to edgePy package eventually)

picciama added 4 commits

May 27, 2022 15:39


          added sctransform vst wrapper

c001fef


          fixed trailing whitespace

9f768db


          removed return statements

3fe5524


          cast np.finfo to float explicitly

74155f0

picciama added the enhancement label

picciama requested review from davidsebfischer and ilan-gold

May 27, 2022 14:30

picciama self-assigned this

ilan-gold reviewed

View reviewed changes

Collaborator

ilan-gold left a comment

Not sure what the unfinished tasks' status is, but the PR as it stands looks good. I would say unit-tests are a must but that's up to David at the end of the day.

batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved

batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved

batchglm/train/numpy/base_glm/estimator.py Outdated

Comment on lines 124 to 125

		if isinstance(genes_log_gmean, dask.array.core.Array):
		genes_log_gmean = genes_log_gmean.compute()

Collaborator

ilan-gold Jun 9, 2022

why?

Collaborator Author

picciama Jun 19, 2022

because I had trouble with this downstream. It was easier to transform to a dense numpy array before. I will check again to see if I can delay this further.

batchglm/train/numpy/base_glm/estimator.py Show resolved Hide resolved

batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved

batchglm/train/numpy/base_glm/estimator.py

+                          genes_log_gmean = genes_log_gmean.compute()
+                      # specify which kind of regularization is performed
+                      scale_param = np.exp(self.model_container.theta_scale[0])  # TODO check if this is always correct

Collaborator

ilan-gold Jun 9, 2022

why an exponentiation here? is this meant to be link-function specific?

Collaborator Author

picciama Jun 19, 2022

I'm not sure how to resolve this (thus the TODO). Indeed, I think this depends on the noise model. I will have to look at that again.

batchglm/train/numpy/base_glm/estimator.py Outdated Show resolved Hide resolved

batchglm/train/numpy/base_glm/vst.py Outdated Show resolved Hide resolved

batchglm/train/numpy/base_glm/vst.py Show resolved Hide resolved

Member

Zethson commented Jun 15, 2022

Hi,

"implement edgeR approach" this is actually a very important use-case for batchGLM. We are planning to use batchGLM/diffxpy for a pure Python implementation of MILO and we promised to be able to 1:1 replace

    dge = edgeR.DGEList(counts=count_mat[keep_nhoods,:][:,keep_smp], lib_size=lib_size[keep_smp])
    dge = edgeR.calcNormFactors(dge, method="TMM")
    dge = edgeR.estimateDisp(dge, model)
    fit = edgeR.glmQLFit(dge, model, robust=True)

eventually. I would kindly ask you to also strongly consider implementing this. Having the edgeR and DEseq2 approaches being implemented here will also greatly boost the impact. I have no doubt about this.

picciama added 2 commits

June 19, 2022 11:16


          resolved commments by Ilan Gold in #145

47c8341


          bugfix: also check for "no_smoothing"

6e347f7

Collaborator Author

picciama commented Jun 20, 2022

I would kindly ask you to also strongly consider implementing this. Having the edgeR and DEseq2 approaches being implemented here will also greatly boost the impact. I have no doubt about this.

Most definitely. I had a look at the edgeR source code for already. It shouldn't be too complicated to transfer this to batchGLM. I will start implementing this tomorrow but cannot give an estimate for the time it'll take at this point in time.

I think the main part would be to take over estimateDisp, i.e. the glm edgeR procedure replaced by batchGLM using trend.method="locfit".
I will do this first and see which of the arguments the function accepts are needed for this configuration. Once it's implemented I'll see what else needs to be transferred to python. Let me know if you have any specific dataset in mind that's well suited for testing the batchGLM procedure. I'll create a jupyter notebook in the batchglm_tutorials repo and we could maybe do some fancy rpy2 stuff to directly compare against edgeR.

Member

Zethson commented Jun 20, 2022

Amazing @picciama

Let me know if you have any specific dataset in mind that's well suited for testing the batchGLM procedure. I'll create a jupyter notebook in the batchglm_tutorials repo and we could maybe do some fancy rpy2 stuff to directly compare against edgeR.

Would it be too crazy to just compare for example the DE results of the edgeR reimplementation and the new Python version for a small dataset/simulation? I know that the edgeR model does a lot, but this might be the eventual goal?

Thank you!

picciama added 14 commits

July 15, 2022 14:02


          added nbdeviance from edgeR

235e8a5


          added aveLogCPM from edgeR

66b3ed5


          added calcNormFactors from edgeR

17360a0


          added residDF from edgeR

6e1e530


          added wleb wrapper from edgeR

21f6570


          added maximizeInterpolant from edgeR

63444d1


          added external.py for external imports

9756c36


          added squeezeVar from limma

9dd2d38


          added glm_one_group from edgeR

b09b853


          bugfixes in tmmwsp fixed

0b0dcef


          added deps for newly included edgeR procedures

1541d7d


          compute() dask arrays before calculating nb_dev

c6dcf99


          added levenberg-marquardt estimator as in edgeR

7701c78


          added qr_decomposition for param init as in edgeR

bb78dab

picciama mentioned this pull request

Make init_par a class Method #147

Open


          integrated batchglm model / estimator environment

ae5a8bd

picciama added 30 commits

August 6, 2022 18:00


          added effects function

281336a


          added module init

291f2f2


          added fit_f_dist function


          added adjusted profile likelihood

9b4bb0b


          cleanup and integration of one_group_glm

8d4ac3b


          added prior degrees of freedom

0e78480


          added basic estimateDisp function

dc4ad4c


          added c_utility functions

f2e0831


          cleanup

26861d7


          added support for summed nb_deviance

dd7c053


          bugfix: leverages not correctly returned


          removed print statement

6abe5c9


          ignore irrelevant log division errors

3d2e1f3


          delayed levenberg theta_loc init + dask support

9595dd9


          added sizefactor and dask/nodask support

ca7e50d


          added newest mypy and pytest versions

e1228ee


          updated pre-commit yaml

0a21e4a


          bumped version of black to support newer click


          typo fixed

09475f0


          added dask support to typehint

0e53b11


          make constraints a mandatory argument

5cd4676


          fix mypy related problems + black reformatting

8730ba7


          index using np.ndarray for mypy to succeed

a0636ca


          return all, not just prior_df


          minor refactoring and bugfixing

d961166


          fixed pre-commit hooks and mypy

4a230b5


          increased the param clipping limits

fbc5e91


          Merge branch 'development' into mp/dispersion_smoothing

16915b7


          fix mypy / batchglm API after merging develoment

13a7800


          added missing import and init file

dd63af4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment