New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: use pydantic for parameter validation #514
base: develop
Are you sure you want to change the base?
Conversation
Hi @DerWeh -- I like what you're doing here and plan to merge it once it fully replaces the existing checks and passes the tests. A few thoughts: We can add some classes to formalize the DP vs Regular EBM class partition. Someday, I think we might support multi-output EBMs, and one strange type of EBM might be an EBM that has both classification and regression outputs. To create one of the strange beasts, you'd need to directly create an EBMModel object, so I like having the ability to make and use these classes directly. I can't think of any good reason though to mix private and non-private EBMs, so the private vs regular partition can be a first split in the class hierarchy below the EBMModel class. You probably already saw that I had a test for check_estimator, but disabled it. I think we can probably now increase our minimum scikit-learn version to allow for this check now. We didn't pass all the checks since EBMs are more permissive in what they accept than scikit-learn, and I'd like to preserve our user-friendliness in that respect, so I expect we'll want to continue disabling some of the scikit-learn checks. |
Sure enough, I typically orient myself at NEP-29, but it is a good idea to be a more conservative (after all, now NumPy is already very stable such that using an older NumPy version is typically sufficient).
Out of curiosity: what do advantage do you expect? I don't really see the synergy for multi-output in EBMs compared to deep networks (or deep trees). Can we expect an advantage during the boosting phase?
As soon as it is merged, (and when I find time) I can continue working on |
I think you're right that there will be very little technical benefit. Mostly, I think of it as potentially a nice packaging solution for grouping related predictions, although I might reject the idea if it adds anything more than a little complexity. EBMs do have some benefits in terms of the compactness of the model representation. For example, SoftAtHome, which wrote Ebm2Onnx (https://github.com/interpretml/ebm2onnx) uses EBMs inside their routers, and I also saw a paper where some company was evaluating EBMs to be used inside pacemakers (!!!). Sharing the feature definitions, and bin cuts might be nice in those applications both for the compactness of the model, because it could lower the computation requirements of predictions, and also to speed up predictions. Probably more likely than boosting multi-output models together would be constructing two separate EBMs and then merging them together as a post process step similarly to how merge_ebm works today. But who knows, maybe someone really will want some super weird objective that includes both classification and regression. I'd like that even if just for the entertainment value of seeing someone use an EBM this way. |
We don't allow multi-class interactions in the default API entirely because of the difficulties in visualizing them. You can construct multi-class EBMs with interactions today if you use the measure_interactions function manually. Here's a slightly related example showing how to use it: https://interpret.ml/docs/python/examples/custom-interactions.html |
Hi @DerWeh -- I've been thinking through this question the last few days and here's what I came up with in terms of a hierarchy: EBMBase -> abstract base class for all EBM models Thoughts? |
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
c9ed51e
to
56c8135
Compare
Yes, the hierarchy sound very reasonable. Though, I am not quite sure if making EBMBase and abstract base class has any benefit. See the discussion: https://stackoverflow.com/questions/3570796/why-use-abstract-base-classes-in-python But it doesn't really seem very important to me, so I am fine with anything/ However, now that EBMs have the checks for What is your opinion @paulbkoch? |
You're probably right that it has very little technical benefit in python. I was thinking of it mainly as a form of documentation to indicate that the EBMBase class should not be instantiated directly and then just using duck typing as we currently are. I haven't written as much python as you have (seemingly) and have never made an abstract class in python, but I'm assuming adding the abstract part is fairly lightweight.
Yeah, that does seem a bit ugly. I didn't realize pydantic was unable to use deferred checks. Even if we get the clone function working though, it's not clear if the scikit-learn requirement to defer checks will break anything else now or in the future. Maybe we should instead move back to option #2 @DerWeh where we move the checks into a separate _validate function? That would at least move them out of the way so that someone reading the fit function would only need to see them if they were specifically interested in the checks. And we'd get to keep the more descriptive error messages too. |
I agree, it should be lightweight, so I never did a benchmark (or in fact even use it). For documentation purpose, the most important part is probably not to make the class public.
Sure, having a separate Another question is, if I should keep the 'Yoda-conditions' that are currently used for type-checks (example: Anyway, I'll play around a bit more in case I find a reasonable (elegant) way to use For documentation purpose: just gave it a try out of curiosity, and it is def __sklearn_clone__(self, *, safe=True):
"""Customized implementation of clone. See :func:`sklearn.base.clone` for details."""
klass = self.__class__
new_object_params = self.get_params(deep=False)
for name, param in new_object_params.items():
new_object_params[name] = clone(param, safe=False)
new_object = klass(**new_object_params)
try:
new_object._metadata_request = copy.deepcopy(self._metadata_request)
except AttributeError:
pass
# _sklearn_output_config is used by `set_output` to configure the output
# container of an estimator.
if hasattr(self, "_sklearn_output_config"):
new_object._sklearn_output_config = copy.deepcopy(
self._sklearn_output_config
)
return new_object |
Avoiding boilerplate through dataclass sounds like a good idea. I guess this implies having the class hierarchy since we conditionally assign instance attributes based on whether the class is differentially private or not.
If you want to change the ordering, go for it! I do use Yoda-conditionals in C++ to combat assignment bugs since "if(a = 5)" compiles, but "if(5 = a)" does not, but I don't care about the order in python since assignment isn't legal inside a conditional in python.
Sounds good. Thanks again for investigating all these options! |
We use a regular Python dataclass in combination with a deferred parameter validation using pydantic's `TypeAdaptor`. This is necessary to be compatible to scikit-learn's `clone` function. Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
In case we are not interested in an accurate result, we can drastically reduce parameters to reduce the fit time. On my old laptop, this reduces test time from 8 minutes to 1 minute. Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Ok, the parameter validation seems to be working now. I managed to find a way based on pydantics TypeAdapter. The DP versions are currently broken, as I focused only on the non-private versions. I am currently looking into breaking up the |
Signed-off-by: DerWeh <andreas.weh@web.de>
|
||
noise_scale_binning_: float | ||
noise_scale_boosting_: float | ||
objective: Literal["log_loss"] = "log_loss" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
objective can be any string, as above.
@@ -3259,103 +2932,17 @@ class DPExplainableBoostingRegressor(EBMModel, RegressorMixin, ExplainerMixin): | |||
noise_scale_boosting\\_ : float | |||
The noise scale during boosting. | |||
""" | |||
objective: Literal["rmse"] = "rmse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
objective can be any string, as above.
@paulbkoch I'll answer your comments in detail as soon as I find the time. For now, I just can give a more general remark: The big question is, if we use such a monkey-patch approach as an intermediate step, followed by additional clean-up PRs, or if we go for a total revamp from the start. Currently, I went for the first version, as the PR is already a huge change, and I am an outsider of the project. I don't really want to mess things up, and incremental merges might be easier to merge as the changes are more traceable. |
Thanks for the background @DerWeh. We can split it up into 2 PRs if that's easier. The drawback I see is that it temporarily puts the project in a state where we shouldn't put out a release, but I think that's fine in the short term since I get the sense this is something we can plug the hole in quickly. It shouldn't break any existing code if anyone clones it, so that's ok too. |
Hi @DerWeh -- I just noticed something a bit concerning. I downloaded the documentation produced from the PR and the documentation for the ExplainableBoostingClassifier and ExplainableBoostingRegressor is missing. Not sure what the problem is yet. The docs are available to download here: We use Jupyter Book to make our documentation which sits on top of Sphinx. The ExplainableBoostingClassifier is documented here: I'll have a look. |
On the documentation, I think we might just need to add the sphinx-pydantic extension: https://sphinx-pydantic.readthedocs.io/en/latest/ here: interpret/docs/interpret/_config.yml Line 22 in d5ec3fa
|
Hi @DerWeh -- I've resolved the ShapTree test that was failing on the develop branch. It was due to a change in the SHAP API. |
feature_names: Optional[Sequence[Union[None, str]]] = None, | ||
feature_types: Optional[ | ||
Sequence[Union[None, str, Sequence[str], Sequence[float]]] | ||
] = None, | ||
# Preprocessor | ||
max_bins: int = 32, | ||
# Stages | ||
exclude: Optional[Sequence[Union[int, str, Sequence[Union[int, str]]]]] = None, | ||
# Ensemble | ||
validation_size: Optional[Union[int, float]] = 0, | ||
outer_bags: int = 1, | ||
# Boosting | ||
learning_rate: float = 0.01, | ||
max_rounds: Optional[int] = 300, | ||
# Trees | ||
max_leaves: int = 3, | ||
# Overall | ||
n_jobs: Optional[int] = -2, | ||
random_state: Optional[int] = None, | ||
# Differential Privacy | ||
epsilon: float = 1.0, | ||
delta: float = 1e-5, | ||
composition: str = "gdp", | ||
bin_budget_frac: float = 0.1, | ||
privacy_bounds: Optional[ | ||
Union[np.ndarray, Mapping[Union[int, str], Tuple[float, float]]] | ||
] = None, | ||
privacy_target_min: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to preserve the parameter ordering of the __init__ functions if some of the attributes are defined in the EBMBase, EBMModel, and ExplainableBoostingClassifier/ExplainableBoostingRegressor classes while still using the dataclass decorators? It looks by default like the parameters will be grouped by inheritance ordering, so all the EBMBase attributes would come first, then all the EBMModel attributes, etc..
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
Signed-off-by: DerWeh <andreas.weh@web.de>
65d7448
to
ac45540
Compare
Sorry, never really used Notebooks to generate a documentation, and I am struggling a bit with the Azure pipeline. Where does the pinning come from? Can I just update the package? |
Yes, just update the pinning if necessary. I think this is pinned because Jupyter Book changes format frequently and pinning preserves documentation consistency, but we'll need to update these periodically. |
If we dropped python 3.8 support, we could eliminate the typing_extensions dependency, right? Seems like a good tradeoff. |
I've updated the develop branch pipelines to use python 3.9 for building the docs (and other build tasks). |
Sorry for the wait, didn't manage to spend much time on it the last week. Indeed, we only use Annotated from However, the import is hard-coded, so I expect, that Just googled a little, if you want to still support Python3.8 and keep the dependencies as slim as possible, there is an option to make dependencies specific for a python version. setup(
...,
install_requires=[
"typing_extensions;python_version<'3.9'",
],
) For course, this would complicate the source-code: import sys
...
if sys.version_info < (3, 9):
from typing_extensions import Annotated
else:
from typing import Annotated Your thoughts @paulbkoch? I hope that I can resolve the conflicts with interpret 0.6.0 the next days and continue on this PR. |
Hi @DerWeh -- I'm not too fussy on this, but I think of the options given I'd have a preference towards keeping things simple and dropping support for 3.8. Anyone who needs python 3.8 can just stick with interpret 0.6.0 until support for 3.8 is officially ended in a few months. |
94cc351
to
385a66b
Compare
This is a first proof of concept to use pydantic for the validation of parameters. I used Python 3.10 syntax as it is more concise and used
pydantic=2.6.2
without putting much thought into it.Currently, the
fit
methods is extremely messy due to all the checks. Of course, a refactoring putting this into a dedicated method would already simplify things.I think using
pydantic
might be a more elegant solution. This draft is mostly for discussion, whether this is a good idea or not. I tried to be faithful to the checks that were implemented.A disadvantage is that
pydantic
errors are harder to read. On the plus side, the type annotations are much clearer than having to search through the fit method. It's probably also much easier to maintain the validation.Currently, I break the private versions of the classifier. Attributes were set, depending on whether the estimator is private or not. Do you think it would be better to define a super class with the common arguments and subclasses with the additional arguments?
Another question is how to best proceed with overwriting the default arguments. Currently, there is a lot of redundancy.
If we want to follow through with this idea, I think some sklearn checks should be enabled again using parametrize_with_checks or check_estimator