Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix : enable vst be used "blindly" and to fit its own dispersion #268

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

laudmt
Copy link

@laudmt laudmt commented Apr 9, 2024

Reference Issue or PRs

Fix #263

What does your PR implement? Be specific.

Proposed solution :

  • A self.fit_type is defined at class initialisation and will by default set the fit_type for DEA and VST.
  • If needed, a fit type can be passed to deseq2() and vst() in order to launch the two analysis with a separate fit_type. It will set self.fit_type to the user one.
  • self.vst_fit() and self.vst_transform() will call internally the self.fit_type.
  • vst_transform() should always be called after vst_fit() or deseq2(). Raise an exception if needed trend_coef have not been computed by deseq2() or vst_fit()
  • Add unit tests

@BorisMuzellec
Copy link
Collaborator

Thanks for the PR @laudmt!

I think it does the trick for VST, but I have a little concern regarding the fact that providing a fit_type other than None overwrites a DeseqDataSet's trend_fit_type.

Here's an example of problematic / ambiguous behavior that could occur:

dds = DeseqDataSet(counts=counts, metadata=metadata, trend_fit_type="parametric")
# Start by computing VST
dds.vst(fit_type="mean")
# Then perform DEA
dds.deseq2() # what trend_fit_type is used to compute the curve ?

In the above example, dds.deseq2() would use trend_fit_type = "mean" (because it was overwritten by vst), but the user probably intended to fit VST with a curve and then run DEA with a parametric curve.

I think we should find a way to make the choices of curve fit type for vst() and deseq2() completely independent.

@laudmt
Copy link
Author

laudmt commented Apr 10, 2024

Thanks for the PR @laudmt!

I think it does the trick for VST, but I have a little concern regarding the fact that providing a fit_type other than None overwrites a DeseqDataSet's trend_fit_type.

Here's an example of problematic / ambiguous behavior that could occur:

dds = DeseqDataSet(counts=counts, metadata=metadata, trend_fit_type="parametric")
# Start by computing VST
dds.vst(fit_type="mean")
# Then perform DEA
dds.deseq2() # what trend_fit_type is used to compute the curve ?

In the above example, dds.deseq2() would use trend_fit_type = "mean" (because it was overwritten by vst), but the user probably intended to fit VST with a curve and then run DEA with a parametric curve.

I think we should find a way to make the choices of curve fit type for vst() and deseq2() completely independent.

Indeed thank you, I was missing this use case.
I will then store two different fit type one for dea and one for vst and enable self.fit_X methods to take a fit_type in argument.

pydeseq2/dds.py Outdated
Comment on lines 461 to 462
self.current_fit_type = fit_type
print(f"Using {self.current_fit_type} fit type.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.current_fit_type doesn't seem to be used elsewhere, so the deseq2() fit_type argument doesn't have any effect

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, forgot to set it to fit_type.

pydeseq2/dds.py Outdated
Comment on lines 322 to 323
self.fit_type = fit_type
print(f"fit type used : {self.fit_type}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set a new fit_type in VST, then I think we should either set it back to the one provided at initialization, or make it clear that it will also change the fit type used in deseq2()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the user should be able, after he launches vst with its own fit_type, to launch deseq2 and expect the fit_type he provided at initialisation?

Is there a risk if deseq2 and vst have their own separate fit_type (lets say self.vst_fit_type and self.dea_fit_type) and since they both share all the other attributes filled during the fit, that vst will compute stuff that deseq2 will have access to it but is not supposed to?

In that case, should they not share any of the attributes (varm, obs, design_factors etc) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late answer @laudmt!

If I understand correctly, the user should be able, after he launches vst with its own fit_type, to launch deseq2 and expect the fit_type he provided at initialisation?

I think that this is indeed what should happen by default, i.e. if no other fit_type is provided when calling deseq2().

Is there a risk if deseq2 and vst have their own separate fit_type (lets say self.vst_fit_type and self.dea_fit_type) and since they both share all the other attributes filled during the fit, that vst will compute stuff that deseq2 will have access to it but is not supposed to?

In that case, should they not share any of the attributes (varm, obs, design_factors etc) ?

Currently, this is what happens. vst() and deseq2() populate the same fields (varm["genewise_dispersions"] and uns["trend_coeffs"] in particular), but they recompute and overwrite them every time (except in the case when we call vst(use_design=True) after calling deseq2() with the same fit type, in which case we can re-use the computation).

In principle, the only issue I see with this is if a user calls e.g. vst(use_design=False) after deseq2(). In that case, varm["genewise_dispersions"] will be overwritten, and if the user wants to perform some analysis of the gene-wise dispersions (e.g. if they call plot_dispersions()), then they will access the wrong values (not the ones used in the DEA).

Perhaps a cleaner thing to do would be to prefix any parameter stored by vst() with "vst_"? (varm["vst_genewise_dispersions"], uns["vst_trend_coeffs"], etc.). This would create some memory overhead but would prevent any confusion. What do you think?

pydeseq2/dds.py Outdated Show resolved Hide resolved
pydeseq2/dds.py Outdated Show resolved Hide resolved
@BorisMuzellec
Copy link
Collaborator

Hi @laudmt, I've pushed a commit in which I implemented the solution I suggested earlier, i.e. setting two separate fit_type and vst_fit_type attributes and duplicating fields (with or without a "vst_" prefix) to avoid data leakage.

Could you have a look when you have time and tell me if that would work for you :) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] The vst() function does not work when the algorithm switches to a mean-based dispersion trend.
2 participants